On our work with Twitter networks, we relied on the Webgraph framework developed by Sebastiano Vigna, Paolo Boldi, and other coworkers, to analyse and process large graphs. We gather here some notes on how we used it.

Let us assume that `webgraph-3.0.9.jar` and all required dependencies are located in `./lib` with respect to current directory.

Given a graph in ASCIIGraph format, we can compress it from `stdin` as follows:

cat example.graph-txt | java -cp lib/'*' it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph -1 null examplebv

Note that in our case the graph was piped directly from another script, `cat example.graph-txt` is just for illustration. The file `example.graph-txt` must contain a graph in ASCIIGraph format:

14 1 2 3 0 2 3 0 1 3 0 1 2 4 3 5 9 4 6 7 8 5 7 8 5 6 8 5 6 7 4 10 11 12 13 9 11 12 13 9 10 12 13 9 10 11 13 9 10 11 12

After compression, we will have three more files:

examplebv.graph examplebv.properties examplebv.offsets

In our work we also compressed more the resulting graph using the *Layered Label Propagation* method for relabelling vertices.

Before computing network statistics and proceeding with analysis, we computed connected components for each graph under analysis. Although Webgraph authors provide a method to compute the connected components based on parallel BFSs, we have implemented the classical alternative approach based on disjoint sets. You can find our implementation bellow in files `ConnectedComponents.java` and `DisjointSet.java`. You can also use the file `webgraph-3.2.1-cc.patch` to patch Webgraph source code. Note that our implementation will work both for undirected and directed graphs.

Assuming that we compiled above files in current directory, we can proceed as follows to compute connected components:

java -cp lib/'*':. ConnectedComponents -s examplebv

This command will generate files `examplebv.scc` and `examplebv.sccsizes`.

Then we can compute network statistics as follows:

java -cp lib/'*' it.unimi.dsi.webgraph.Stats -s examplebv

This command will generate files below with relevant statistics and distributions:

examplebv.outdegrees examplebv.indegrees examplebv.outdegree examplebv.indegree examplebv.sccdistr examplebv.stats

We computed also the neighbourhood function using the approximation implemented in Webgraph, namely the HyperANF method. Since this is an approximation, we computed it several times and we used a Jackknife based estimate as proposed by Webgraph authors (see file `ApproximateNeighbourhoodFunctionStats.java` below).

java -cp lib/'*' it.unimi.dsi.webgraph.Transform transpose examplebv examplebvtrans java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.1 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.2 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.3 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.4 java -cp lib/'*' it.unimi.dsi.webgraph.algo.HyperApproximateNeighbourhoodFunction -l 12 examplebv examplebvtrans > examplebv.anf.5 java -cp lib/'*':. ApproximateNeighbourhoodFunctionStats examplebv.anf.*

The last command will produce several statistics (with standard error computed through Jackknife), including the average distance, harmonic diameter, effective diameter, SPID, etc. Note that HyperANF was extended and renamed to HyperBall in more recent versions of Webgraph.

More details at http://law.di.unimi.it/tutorial.php and Webgraph API docs.Name Size ApproximateNeighbourhoodFunctionStats.java 3.1K ConnectedComponents.java 7.0K DisjointSet.java 1.7K VertexBetweenness.java 5.8K webgraph-3.2.1-cc.patch 8.9K