On our work with Twitter networks, we relied on the Webgraph framework developed by Sebastiano Vigna, Paolo Boldi, and other coworkers, to analyse and process large graphs. We gather here some notes on how we used it.
Let us assume that webgraph-3.0.9.jar and all required dependencies are located in ./lib with respect to current directory.
Given a graph in ASCIIGraph format, we can compress it from stdin as follows:
cat example.graph-txt | java -cp lib/'*' it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph -1 null examplebv
Note that in our case the graph was piped directly from another script, cat example.graph-txt is just for illustration. The file example.graph-txt must contain a graph in ASCIIGraph format:
14 1 2 3 0 2 3 0 1 3 0 1 2 4 3 5 9 4 6 7 8 5 7 8 5 6 8 5 6 7 4 10 11 12 13 9 11 12 13 9 10 12 13 9 10 11 13 9 10 11 12
After compression, we will have three more files:
examplebv.graph examplebv.properties examplebv.offsets
In our work we also compressed more the resulting graph using the Layered Label Propagation method for relabelling vertices.
Before computing network statistics and proceeding with analysis, we computed connected components for each graph under analysis. Although Webgraph authors provide a method to compute the connected components based on parallel BFSs, we have implemented the classical alternative approach based on disjoint sets. You can find our implementation bellow in files ConnectedComponents.java and DisjointSet.java. You can also use the file webgraph-3.2.1-cc.patch to patch Webgraph source code. Note that our implementation will work both for undirected and directed graphs.
Assuming that we compiled above files in current directory, we can proceed as follows to compute connected components:
java -cp lib/'*':. ConnectedComponents -s examplebv
This command will generate files examplebv.scc and examplebv.sccsizes.
Then we can compute network statistics as follows:
java -cp lib/'*' it.unimi.dsi.webgraph.Stats -s examplebv
This command will generate files below with relevant statistics and distributions:
examplebv.outdegrees examplebv.indegrees examplebv.outdegree examplebv.indegree examplebv.sccdistr examplebv.stats
We computed also the neighbourhood function using the approximation implemented in Webgraph, namely the HyperANF method. Since this is an approximation, we computed it several times and we used a Jackknife based estimate as proposed by Webgraph authors (see file ApproximateNeighbourhoodFunctionStats.java below).
java -cp lib/'*' it.unimi.dsi.webgraph.Transform transpose examplebv examplebvtrans java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.1 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.2 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.3 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.4 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.5 java -cp lib/'*':. ApproximateNeighbourhoodFunctionStats examplebv.anf.*
The last command will produce several statistics (with standard error computed through Jackknife), including the average distance, harmonic diameter, effective diameter, SPID, etc. Note that HyperANF was extended and renamed to HyperBall in more recent versions of Webgraph.More details at http://law.di.unimi.it/tutorial.php and Webgraph API docs.
Name Size ApproximateNeighbourhoodFunctionStats.java 3.1K ConnectedComponents.java 7.0K DisjointSet.java 1.7K VertexBetweenness.java 5.8K webgraph-3.2.1-cc.patch 8.9K