Graph Tiling Jobs

In addition to heatmap layers of individual data points, Aperture Tiles supports visualizations of graph datasets that contain edge and node information. Graph visualizations illustrate the relationships between nodes and communities across multiple zoom levels. The process of generating the tile pyramid that represents this type of layer is a graph tiling job.

Graph tiling jobs comprise several configuration and generation phases as described in the Graph Tiling Process section.

NOTE: The graph tiling capabilities of Aperture Tiles are considered experimental.

Graph Tiling Process

Aperture Tiles requires graph data to be in a comma- or tab-delimited format (CSV). If your data is in a GraphML format, the first step in the graph tiling process is to convert the GraphML data to a CSV format.

Once your source data is in a valid delimited format, you can choose to apply either of the following optional layout styles before proceeding with the tile generation:

  1. Hierarchical Clustering: Hierarchically group nodes into communities using Louvain detection based on modularity-maximization.
  2. Graph Layout: Compute node locations using a force-directed algorithm.

Once your data is prepared, you can execute your tile generation job using standard or customized methods.

Converting GraphML Data to CSV Data

Aperture Tiles requires graph data to be in comma- or tab-delimited format (CSV). GraphML data can be converted to CSV using the following tools in com.oculusinfo.tilegen.graph.util:

GraphParseApp

GraphParseApp is an example Java application for converting GraphML data to CSV.

To execute the GraphParseApp and convert your GraphML data to a CSV file
  • Use the following command line syntax:

    java GraphParseApp –in source.graphml -out output.csv -longIDs true 
    –nAttr nAttr1, nAttr2 -eAttr eAttr1, eAttr2 -nCoordAttr NO
    

    Where:

    Argument Required? Description
    -in Yes Path and filename of GraphML input file.
    -out Yes Path and filename of tab-delimited output file.
    -longIDs No Indicates whether nodes should be assigned a unique long ID (true) regardless of the ID format in the original file. This ID convention is needed for data processing with Spark's GraphX library. Defaults to false.
    -nAttr No List of node attributes to parse. Enter as a list of attribute ID tags separated by commas. Defaults to all node attributes.
    -eAttr No List of edge attributes to parse. Enter as a list of attribute ID tags separated by commas. Defaults to all edge attributes.
    -nCoordAttr No Node attributes to use for node co-ordinates. Enter as a list of attribute ID tags separated by commas. Defaults to NO, which indicates no co-ordinate data should be associated with nodes.

Output

The GraphParseApp outputs a file that contains tab-delimited data. The first column denotes whether each record is a node or an edge object.

Node object columns Edge object columns
  • Long ID identifier
  • Original string name
  • Any additional node attributes in the source file or just those passed in with the -nAttr argument
  • ID of the source node
  • ID of the destination node
  • Any additional edge attributes in the source file or just those passed in with the -eAttr argument

GraphParseApp also creates a README file containing column labels for all node and edge records in the CSV file.

Workflow Example

Consider a brain connectomics GraphML dataset with the following node and edge formats:

<node id="n0">
    <data key="v_region">rh-rostralmiddlefrontal</data>
    <data key="v_centroid">[59, 47, 91]</data>
    <data key="v_id">n0</data>
    <data key="v_latent_pos">NA</data>
    <data key="v_scan1">40</data>
    <data key="v_tri">30</data>
    <data key="v_clustcoeff">0.666667</data>
    <data key="v_degree">10</data>
</node>
...
<edge source="n361054" target="n364322">
    <data key="e_weight">19</data>
</edge>
...

To visualize this data, we are only interested in the v_region and v_degree node attributes and the e_weight edge attribute (i.e., all edge attributes). Therefore, the proper GraphParseApp command line syntax is:

java GraphParseApp –in <graphML file> -out <output CSV file> -longIDs true 
–nAttr v_region, v_degree

GraphParseApp then outputs a file containing records for all of your nodes with the following format:

Node columns
node 0 n0 rh rostralmiddlefrontal 10
Edge columns
edge 361054 364322 19

Hierarchical Clustering

CSV datasets can be hierarchically clustered using the GraphClusterApp Scala application in com.oculusinfo.tilegen.graph.cluster. GraphClusterApp groups nodes into communities using Louvain community detection based on modularity-maximization.

The number of hierarchical levels (X) is dependent on your data, the modularity of the graph, how sparse or dense the dataset is, etc. Hierarchy level_0 corresponds to the raw unclustered data, while level_X is the most clustered (top-level communities).

IDs and labels for parent communities are automatically chosen as the underlying child node with highest weighted degree (i.e., sum of weights of incident edges for a give node). The metadata for a parent community is also chosen in the same manner.

GraphClusterApp

GraphClusterApp's implementation of the Louvain clustering algorithm uses Sotera's distributed Louvain modularity algorithm in conjunction with Spark's GraphX graph processing library.

To execute the GraphClusterApp and hierarchically cluster your data
  • Use the following command line syntax:

    spark-submit --class com.oculusinfo.tilegen.graph.cluster.GraphClusterApp 
    <tile-generation-assembly.jar> -source <hdfs source location> -output 
    <hdfs output location> -onlyEdges false -nID 1 -nAttr 2 -eSrcID 1 -eDstID 2 
    -eWeight 3 -spark local -sparkhome /opt/spark -user <username> 
    

    Where:

    Argument Required? Description
    -source Yes HDFS location of the input data.
    -output Yes HDFS location to which to save clustered results.
    -onlyEdges No Indicates whether the source data contains only edges (*true*). Defaults to *false*.
    -parts No Number of partitions into which to break up the source dataset. Defaults to value chosen automatically by Spark.
    -p No Amount of parallelism for Spark-based data processing of source data. Defaults to value chosen automatically by Spark.
    -d No Source dataset delimiter. Defaults to tab-delimited.
    -progMin No Percent of nodes that must change communities for the algorithm to consider progress relative to total vertices in a level. Defaults to 0.15.
    -progCount No Number of times the algorithm can fail to make progress before exiting. Defaults to 1.
    -nID When ‑onlyEdges = false Number of the column in the raw data that contains the node IDs. Note that IDs must be of type long.
    -nAttr No Column numbers in the raw data that contain additional node metadata that should be parsed and saved with cluster results. Individual attribute tags should be separated by commas.
    -eSrcID Yes Number of the column in the raw data that contains the edge source IDs. Note that IDs must be of type long.
    -eDstID Yes Number of the column in the raw data that contains the edge destination IDs. Note that IDs must be of type long.
    -eWeight No Number of the column in the raw data that contains the edge weights. Defaults to -1, meaning that no edge weighting is used.
    -spark Yes Spark master location.
    -sparkhome Yes Spark HOME location.
    -user No Spark/Hadoop username associated with the Spark job.

Input

GraphClusterApp accepts two types of graph data formats:

  • Node and edge tab-delimited data, where the first column contains the keyword node or edge
  • Edge-only tab-delimited data with different columns for source ID, destination ID and edge weight (optional).

    In this case, all nodes will be inferred internally from the parsed edges, but no node attributes or metadata will be associated with the clustered nodes or communities.

NOTE: GraphClusterApp requires that node IDs and edge weights are of type long.

Output

Clustered results are stored sub-directories in the -output HDFS location. Each hierarchical level from 0 to X has its own separate sub-directory.

Within each hierarchical level, clustered data is stored in the following tab-delimited format for nodes/communities and edges:

Node columns
node ID parent ID number of internal nodes node degree metadata (optional)
Edge columns
edge srcID dstID edge weight

The modularity (q-value) is also saved for each hierarchical level in a _qvalues sub-directory.

Workflow Example

The following example illustrates command line syntax for GraphClusterApp with Spark in local mode:

spark-submit --class com.oculusinfo.tilegen.graph.cluster.GraphClusterApp 
<tile-generation-assembly.jar> -source <hdfs source location> -output 
<hdfs output location> -onlyEdges false -nID 1 -nAttr 2 -eSrcID 1 -eDstID 2 
-eWeight 3 -spark local -sparkhome /opt/spark -user <username>

In this case, the source dataset contains both nodes and edges (-onlyEdges false) with the following columns:

  • For nodes:
    • Column 1 contains the node IDs (-nID 1)
    • Column 2 contains node metadata (-nAttr 2)
  • For edges
    • Column 1 contains source IDs (-eSrcID 1)
    • Column 2 contains destination IDs (-eDstID 2)
    • Column 3 contains the edge weights (-eWeight 3)

Using the brain connectomics dataset as an example:

  • Graph community X at hierarchy level 2 contains 100 communities (from hierarchy 1)
  • If the child community with the highest weighted degree has ID 0 and metadata rh-rostralmiddlefrontal, community X at hierarchy 2 will be labeled with the same ID and metadata.

Graph Layout

Node positions can be computed using a hierarchical force-directed algorithm with the following tools in com.oculusinfo.tilegen.graph.util:

The hierarchic force-directed algorithm runs in a distributed manner using Spark's GraphX library. The layout of each hierarchy level is determined in sequence starting with the highest hierarchical level. Graph communities are defined as circles, where size is based on the number of internal raw nodes in the community.

ClusterGraphLayoutApp

ClusteredGraphLayoutApp

To execute the ClusterGraphLayoutApp and position your node/community data
  • User the following command line syntax:

    spark-submit --class com.oculusinfo.tilegen.graph.util.ClusteredGraphLayoutApp 
    <tile-generation-assembly.jar> -source <hdfs source location> -output 
    <hdfs output location> -i 1000 -maxLevel 4 -layoutLength 256 -nArea 45 
    -border 2 -eWeight true -g 0 -spark local -sparkhome /opt/spark
    -user <username>
    

    Where:

    Argument Required? Description
    -source Yes HDFS location of the clustered input graph data.
    -output Yes HDFS location to which to save graph layout results.
    -parts No Number of partitions into which to break up the source dataset. Defaults to value chosen automatically by Spark.
    -p No Amount of parallelism for Spark-based data processing of source data. Defaults to value chosen automatically by Spark.
    -d No Source dataset delimiter. Defaults to tab-delimited.
    -i No Maximum number of iterations for the force-directed algorithm. Defaults to 500.
    -maxLevel No Highest cluster hierarchic level to use for determining the graph layout. Defaults to 0.
    -border No Percent of the parent bounding box to leave as whitespace between neighbouring communities during initial layout. Defaults to 2 percent.
    -layoutLength No Desired width/height of the total graph layout region. Defaults to 256.0.
    -nArea No Area of all node circles with a given parent community. Controls the amount of whitespace in the graph layout. Defaults to 30 percent.
    -eWeight No Indicates whether to use edge weights to scale force-directed attraction forces (true). Defaults to false.
    -g No Amount of gravitational force to use for force-directed layout to prevent outer nodes from spreading out too far. Defaults to 0 (no gravity).
    -spark Yes Spark master location.
    -sparkhome Yes Spark HOME location.
    -user No Spark/Hadoop username associated with the Spark job.

Input

The source location should correspond to the root directory of hierarchically clustered data. The force-directed layout algorithm works in conjunction with hierarchically clustered data; it expects all hierarchical levels to be in separate sub-directories labeled level_##.

NOTE: Not all hierarchy levels are required to generate a layout.

ClusterGraphLayoutApp requires the source clustered graph data to be a tab-delimited format analogous to that used by the Louvain Clustering application:

Node columns
node ID parent ID number of internal nodes node degree metadata (optional)
Edge columns
edge srcID dstID edge weight

Output

Graph layout results are stored separately for each hierarchical level. Each level has the following tab-delimited format:

Node columns
node ID XY coords radius parent ID parent XY coords parent XY radius num internal nodes degree metadata (optional)
Edge columns
edge srcID src XY coords dstID dest XY coords edge weight isInterCommunityEdge

NOTE: The final edge column will be 0 for intra-community edges (both endpoints have the same parent community) or 1 for inter-community edges.

ClusterGraphLayoutApp also saves a separate stats sub-directory that contains general information about the layout of each hierarchical level including:

  • Number of nodes and edges
  • Min and max radii for the graph communities

NOTE: Radii information for each hierarchical level is used to provide a minimum recommended zoom level for tile generation. It is advantageous from a graph visualization standpoint to have a parent community radius to correspond to approximately one tile length at a given zoom level.

Example Workflow

Consider a graph dataset clustered up to hierarchy 4. The expected directory structure for the graph layout application is:

../<hdfs source location>
                        /level_4/
                        /level_3/
                        /level_2/
                        /level_1/
                        /level_0/

The following command line syntax defines a layout for the clustered graph dataset with:

  • 4 hierarchical levels (to only use a subset of hierarchy levels, set -maxlevel to exclude hierarchies from the layout process)
  • Force-directed algorithm set to 1000 iterations
  • 45% non-whitespace
  • Edge weights
  • No gravity force
spark-submit --class com.oculusinfo.tilegen.graph.util.ClusteredGraphLayoutApp 
<tile-generation-assembly.jar> -source <hdfs source location> -output 
<hdfs output location> -i 1000 -maxLevel 4 -layoutLength 256 -nArea 45 -border 2 
-eWeight true -g 0 -spark local -sparkhome /opt/spark -user <username>

Tile Generation

Once a graph dataset has been converted to CSV format and the nodes have been positioned (using raw positions for geo-located data or using the hierarchical force-directed algorithm), tile generation can be performed. Standard heatmaps of nodes and edges can be generated using the CSVGraphBinner application in com.oculusinfo.tilegen.examples.apps

NOTE: For information on performing custom tile generation jobs, see Graph Analytics.

CSVGraphBinner

In general, the CSVGraphBinner application works similarly to the standard Aperture Tiles CSVBinner. It ingests a properties file (*.bd) and create a collection of Avro tile data files.

NOTE: Heatmaps of graph nodes and edges must be generated separately.

Nodes

The following BD file parameters are used to configure CSVGraphBinner to generate tiles of graph nodes. For additional information on BD file parameters, see the CSVBinner section of the Standard Tiling Jobs topic.

# Indicate that you want to create a tile set of node elements
oculus.binning.graph.data=nodes

# Specify the delimiter character. Defaults to tab (\t)
oculus.binning.parsing.separator=\t

# Specify the column numbers of your x/y coordinates
oculus.binning.parsing.x.index=2
oculus.binning.parsing.y.index=3

# Map the x/y coordinates to the cartesian index scheme used by the binner
oculus.binning.index.field.0=x
oculus.binning.index.field.1=y

# Define the projection (x/y cross-plot) over which to draw the nodes and 
# manually specify the min/max bounds
oculus.binning.projection.type=areaofinterest
oculus.binning.projection.autobounds=false
oculus.binning.projection.minX=0.0
oculus.binning.projection.maxX=256.0
oculus.binning.projection.minY=0.0
oculus.binning.projection.maxY=256.0

Given these parameters, the application parses columns 2 and 3 of each node object and uses them as x/y coordinates during tile generation. By default, the count of nodes is used as the binning value.

Edges

The following BD file parameters are used to configure CSVGraphBinner to generate tiles of graph edges. For additional information on BD file parameters, see the CSVBinner section of the Standard Tiling Jobs topic.

# Indicate that you want to create a tile set of edge elements
oculus.binning.graph.data=edges

# Specify the delimiter character. Defaults to tab (\t)
oculus.binning.parsing.separator=\t

# Specify the edge types you want to include: inter, intra or all
oculus.binning.graph.edges.type=all

# Column number that indicates whether an edge is inter-community (1)
# or intra-community (0)
oculus.binning.graph.edges.type.index=0

# Indicate whether to draw edges as straight lines (false) or clockwise 
# arcs (true)
oculus.binning.line.style.arcs=true

# Specify the column numbers of your source (x/y) and destination (x2/y2)
# coordinates
oculus.binning.parsing.x.index=2
oculus.binning.parsing.y.index=3
oculus.binning.parsing.x2.index=5
oculus.binning.parsing.y2.index=6

# Specify the column number of your edge weights and indicate how to aggregate
# them
oculus.binning.parsing.v.index=7
oculus.binning.parsing.v.fieldAggregation=add

# Use the line segment index scheme
oculus.binning.index.type=segment

# Map the x/y coordinates to the line segment index scheme used by the binner
oculus.binning.xField=x
oculus.binning.yField=y
oculus.binning.xField2=x2
oculus.binning.yField2=y2

# Set the edge weights as the binning value
oculus.binning.value.field=v

# Define the projection (x/y cross-plot) over which to draw the nodes and 
# manually specify the min/max bounds
oculus.binning.projection.type=areaofinterest
oculus.binning.projection.autobounds=false
oculus.binning.projection.minX=0.0
oculus.binning.projection.maxX=256.0
oculus.binning.projection.minY=0.0
oculus.binning.projection.maxY=256.0

Given these parameters, the application parses two endpoints representing the source and destination coordinates of each edge object, and draws a line using the edge weight parsed from column 7.

Edge Length Considerations

On a zoom level, any edges longer than 1024 bins (4 tiles) or shorter than 2 bins are excluded. This is done for visualization purposes (it is not possible to discern two discrete endpoints on the screen for very short or long lines), as well as to optimize processing time for high zoom levels. These thresholds can be modified using the following BD file parameters:

oculus.binning.line.max.bins=1024
oculus.binning.line.min.bins=2 

Alternatively, long line segments can be included in the tiling job by setting:

oculus.binning.line.drawends=true

In this case, long line segments are rendered within one tile length of each endpoint (since both endpoints will not be on the screen simultaneously), and the line intensity will be faded-out to 0 as it gets farther away from an endpoint.

Hierarchical Tile Generation

By default, tiles for a nodes and edges are generated using one hierarchy level for all zoom levels.

For example, even though hierarchical info of a graph's nodes may be available if the data has been Louvain clustered, it may be preferable to only use the raw nodes (hierarchy level 0) for tile generation. However, for a dense graph with many edges, it may be worthwhile to assign different hierarchy levels to different zooms. This can be accomplished using the following parameters:

oculus.binning.hierarchical.clusters=true
oculus.binning.source.levels.0=hdfs://hadoop/graphdata /level_3
oculus.binning.source.levels.1=hdfs://hadoop/graphdata /level_2
oculus.binning.source.levels.2=hdfs://hadoop/graphdata /level_1
oculus.binning.source.levels.3=hdfs://hadoop/graphdata /level_0

oculus.binning.levels.0=0-3
oculus.binning.levels.1=4-6
oculus.binning.levels.2=7,8
oculus.binning.levels.3=9-11

These settings instruct tile generation to use:

  • Hierarchy level 3 for zoom levels 0-3
  • Hierarchy level 2 for zoom levels 4-6
  • Hierarchy level 1 for zoom levels 7-8
  • Hierarchy level 0 for zoom levels 9-11

Hierarchical Tile Generation of Graph Edges

For clustered graph data, it can helpful to visualize intra and inter-community edges on separate layers (using the oculus.binning.graph.edges.type switch).

For example, if intra-community edges are tiled for hierarchy L for a given zoom, then it may be desirable to show all inter-community edges going between the parent communities of that hierarchy. To accomplish this, tile all edges for hierarchy L+1.

Sample configuration for intra-community edges for hierarchy level 2 at zoom levels 4-6:

oculus.binning.hierarchical.clusters=true
oculus.binning.source.levels.0=hdfs://hadoop/graphdata /level_2
oculus.binning.levels.1=4-6
oculus.binning.graph.data=edges
oculus.binning.graph.edges.type=intra

Sample configuration for inter-community edges between parent communities (i.e., tiling all edges for hierarchy level 3 (L+1) at zoom levels 4-6):

oculus.binning.hierarchical.clusters=true
oculus.binning.source.levels.0=hdfs://hadoop/graphdata /level_3
oculus.binning.levels.1=4-6
oculus.binning.graph.data=edges
oculus.binning.graph.edges.type=all

Graph Analytics

You can also create custom analytics of key communities at each hierarchy for client-side rendering to display labels, metadata and add interactive features as desired. Custom analytics can be created using the GraphAnalyticsBinner in com.oculusinfo.tilegen.graph.analytics.

This application uses many of the same BD file parameters for hierarchical tile generation as described in the previous section. Instead of generating tiles of 256x256 bins, each tile contains one GraphAnalyticsRecord object.

By default, each GraphAnalyticsRecord contains stats about the 25 largest graph communities in a given tile. These stats include:

  • Community ID
  • x/y coordinates and radius
  • Number of internal nodes
  • Degree
  • Metadata
  • Parent community coordinates and radius

A list of BD file parameters for parsing each of these stats is given below.

The following parameters specify the column number of key graph community/node attributes:

Argument Description
oculus.binning.graph.x.index x-axis coordinate
oculus.binning.graph.y.index y-axis coordinate
oculus.binning.graph.id.index Long ID
oculus.binning.graph.r.index Radius
oculus.binning.graph.numnodes.index Number of nodes
oculus.binning.graph.degree.index Degree
oculus.binning.graph.metadata.index Metadata
oculus.binning.graph.parentID.index Long ID of the parent community

The following parameters specify the column number of key parent community attributes:

Argument Description
oculus.binning.graph.parentID.index Long ID
oculus.binning.graph.parentR.index Radius
oculus.binning.graph.parentX.index x-axis coordinate
oculus.binning.graph.parentY.index y-axis coordinate

It is possible to save stats on the 10 highest weighted edges incident on a given community using the following parameters. NOTE: If these parameters are excluded, no edge analytics information will be saved for each GraphAnalyticsRecord.

Argument Description
oculus.binning.graph.edge.srcID.index Source ID of each graph edge
oculus.binning.graph.edges.dstID.index Destination ID of each graph edge
oculus.binning.graph.edges.weight.index Weight of each graph edge. Defaults to 1 (unweighted).

The number of communities to store per record can be tuned using the oculus.binning.graph.maxcommunities parameter (set to 25 by default).

Similarly, the number of edges to store per graph community can be tuned using the oculus.binning.graph.maxedges parameter (set to 10 by default).