Slide View : Parallel Computer Architecture and Programming : 15-418/618 Fall 2016

Previous | Next --- Slide 40 of 40

Holladay

To compare GraphLab to GraphX, GraphX claims to be faster on the use case of Page Rank (http://spark.apache.org/images/graphx-perf-comparison.png). In distributing the graph, GraphX splits up vertices, not edges. That is to say an edge will only live on one machine while a node maybe be copied on multiple machines. The documentation argues that this can reduce communication and storage overhead. The library has a default way to partition the nodes and edges, although the programmer does have the option to specify different types. As another design decision, because most graphs have more edges than vertices the vertex metadata is kept with the edges.

cloudhary

Hey @Holladay, did you manage to find anything suggesting why operations on a graph might be better on GraphX as a function of this design decision? As you mentioned, having more edges than vertices means that a lot of duplicate data (nodes) will likely be stored across machines, so they must have empirically tested their claim.

Holladay

A little bit @cloudhary. I found a paper by the people from GraphX (so they might be biased). Section 6 has some comparisons: https://arxiv.org/pdf/1402.2394v1.pdf