‘Develop’ is the important word here, as the research covers honing algorithms on synthetic data sets rather than on real big data.
“Graphs are widely used to represent and analyse real-world objects in many domains such as social networks, business intelligence, biology, and neuroscience,” said Kaist. “When developing and testing algorithms for a large-scale graph, a synthetic graph is usually used instead of a real graph. This is because sharing and utilising large-scale real graphs is very limited due to their being proprietary or being practically impossible to collect.”
Conventionally, according to Kaist, developing and testing graph algorithms is done via the following two-step approach:
Step one generates a synthetic graph and stores it on disks. The graph is usually generated by either parameter-based generation or graph up-scaling – the former extracts a small number of parameters that can capture some properties of a given real graph and generates the synthetic graph with the parameters, the latter up-scales a given real graph to a larger one so as to preserve the properties of the original real graph as much as possible.
Step two loads the stored graph into the main memory of a graph processing engine, such as Apache GraphX, and executes a given graph algorithm on the engine. “Since the graph is too large to fit in the main memory of a single computer the graph engine typically runs on a cluster of several tens or hundreds of computers,” said Kaist, “therefore the cost of the conventional two-step approach is high.”
The Korean team does not generate and store a large-scale synthetic graph.
Instead, it loads the initial small real graph into main memory. Then, using a technique dubbed T-GPS (trillion-scale graph processing simulation), the graph algorithm confronted with the small real graph as if the large-scale synthetic graph that should be generated from the real graph exists in main memory, said Kaist, adding that after the algorithm is done, T-GPS returns the same result as the conventional two-step approach.
“The key idea of T-GPS is generating only the part of the synthetic graph that the algorithm needs to access on the fly and modifying the graph processing engine to recognize the part generated on the fly as the part of the synthetic graph actually generated,” said Kaist.
T-GPS processed a graph of one trillion edges on one computer, while the conventional two-step approach needed a cluster of eleven computers of the same specification to process of a graph of one billion edges. Not needing network access, T-GPS was up to 43 times faster than the conventional approach which has a significant communication overhead.
The work was presented at the IEEE ICDE 2021 conference as ‘Trillion-scale Graph Processing Simulation based on Top-Down Graph Upscaling’.