In a distributed graph (where the data are spread across multiple machines), the default execution plan is as follows:
One machine will be selected as the execution hub, regardless of the number or distribution of starting point vertices.
All the computation work for the query will take place at the execution hub. The vertex and edge data from other machines will be copied to the hub machine for processing.
TigerGraph Enterprise Edition offers a Distributed Query mode which provides a more optimized execution plan for queries which are likely to start at several machines and continue their traversal across several machines.
A set of machines representing one full copy of the entire graph will participate in the query. If the cluster has a replication factor of 2 (so there are two copies of each piece of data), then half the machines will participate.
The query executes in parallel across all the machines which have source vertex data for a given hop in the query. That is, each SELECT statement defines a 1-hop traversal from a set of source vertices to a set of target vertices. Unlike the default mode where all the needed data are brought to one machine, in Distributed Query mode, the computation moves across the cluster, following the traversal pattern of the query.
The output results will be gathered at one machine.
To invoke Distributed Query mode, simply insert the keyword "DISTRIBUTED" before "QUERY" in a query definition:
The basic trade-off between distributed query mode and default mode is greater parallelism for the given query vs. using more system resources, which reduces the potential for concurrency with other operations. Each machine has a certain number of workers available for concurrent execution of queries. A query in default mode uses only one worker out of the whole system. (This one worker will have multiple threads for processing edge traversals in parallel.) However, a query in distributed mode uses one query worker per machine. This means this query can run faster, but it leaves fewer workers for other queries running concurrently.
We suggest the following guidelines for deciding whether to use default mode or distributed mode.
Queries with one or a few starting point vertices and which take only a few hops → default mode is better.
Queries which start at a very large set of starting point vertices and which traverse many hops → distributed mode is better. For example, algorithms which either compute a value for every vertex or one value for the entire graph should use Distributed Mode. This includes PageRank, Centrality, and Connected Component algorithms.
For applications where the same query (same logic but with different input parameters) will be run many times in production, the application designer can simply try both modes during development and chose the one which works better for their use case and data.
Currently, Distributed Query mode cannot be used for all queries. Please note the limitations carefully. In most cases, the GSQL parser and compiler will report an error if you try to write a Distributed Query using an unsupported feature.
The following GSQL features are not supported in Distributed Query mode:
(1) Items in the Supported column are listed only for clarity, so you can compare to the Unsupported column. If a feature which is supported in non-distributed queries is not mentioned in either column, then it is supported in Distributed Query mode .
(2) If the query contains "LIMIT N", and if the number of GPEs working on this query is G, then the output size will be N +/- (G-1). In a conventional cluster configuration, there is one GPE per machine. For example, if N=10 and the graph is distributed across 4 machines, then the output size will be between 7 and 13, inclusive.