Distributed Query Mode

Distributed Query Mode

In a distributed graph (where the data are spread across multiple machines), the default execution plan is as follows:

  • One machine will be selected as the execution hub, regardless of the number or distribution of starting point vertices.

  • All the computation work for the query will take place at the execution hub. The vertex and edge data from other machines will be copied to the hub machine for processing.

TigerGraph Enterprise Edition offers a Distributed Query mode which provides a more optimized execution plan for queries which are likely to start at several machines and continue their traversal across several machines.

  • A set of machines representing one full copy of the entire graph will participate in the query. If the cluster has a replication factor of 2 (so there are two copies of each piece of data), then half the machines will participate.

  • The query executes in parallel across all the machines which have source vertex data for a given hop in the query. That is, each SELECT statement defines a 1-hop traversal from a set of source vertices to a set of target vertices. Unlike the default mode where all the needed data are brought to one machine, in Distributed Query mode, the computation moves across the cluster, following the traversal pattern of the query.

  • The output results will be gathered at one machine.

To invoke Distributed Query mode, simply insert the keyword "DISTRIBUTED" before "QUERY" in a query definition:

createQuery := CREATE [OR REPLACE][DISTRIBUTED] QUERY queryName "(" [parameterList] ")"
FOR GRAPH graphName
[RETURNS "(" baseType | accumType ")"]
[API "(" stringLiteral ")"]
[SYNTAX syntaxName]
"{" queryBody "}"

Guidelines for Selecting Distributed Query Mode

The basic trade-off between distributed query mode and default mode is greater parallelism for the given query vs. using more system resources, which reduces the potential for concurrency with other operations. Each machine has a certain number of workers available for concurrent execution of queries. A query in default mode uses only one worker out of the whole system. (This one worker will have multiple threads for processing edge traversals in parallel.) However, a query in distributed mode uses one query worker per machine. This means this query can run faster, but it leaves fewer workers for other queries running concurrently.

We suggest the following guidelines for deciding whether to use default mode or distributed mode.

  1. Queries with one or a few starting point vertices and which take only a few hops → default mode is better.

  2. Queries which start at a very large set of starting point vertices and which traverse many hops → distributed mode is better. For example, algorithms which either compute a value for every vertex or one value for the entire graph should use Distributed Mode. This includes PageRank, Centrality, and Connected Component algorithms.

  3. For applications where the same query (same logic but with different input parameters) will be run many times in production, the application designer can simply try both modes during development and chose the one which works better for their use case and data.

Supported and Unsupported Features

Beginning in v3.0, Distributed Mode supports nearly all the features of default non-distributed mode.

Support was added for many features in TigerGraph 3.0. The table below has changed significantly between v2.6 and v3.0. In addition, 3.0 has more functionality for both Distributed and Non-distributed modes.

The following GSQL features are not supported in Distributed Query mode:

Feature

Not Supported

Supported as of v3.0 [1]

General

User-defined Exceptions

  • Data update to the graph

  • Access to target vertex's values in ACCUM

  • Query calling a distributed query

Statement Types

LOADACCUM

FOREACH, WHILE, UPDATE, INSERT, DELETE

SELECT clauses

exact count for LIMIT clause [2]

SAMPLE clause

Data types

ArrayAccum

Other accumulators, SET<> parameter

LIST, SET, BAG, JSONOBJECT, JSONARRAY

Operations and Operators

Any data update to the graph, including assignment statements to vertex attributes

Vertex and edge functions

-

ALL

Accumulator and collection functions

reallocate()

all other functions

Other functions

selectVertex(),

COALESCE(), EVALUATE()

to_vertex(), to_vertex_set()

[1] The Support column is not intended to be exhaustive. Items in the Supported column are listed clarity only, so you can compare to the Unsupported column. If a feature which is supported in non-distributed queries is not mentioned in either column, then it is supported in Distributed Query mode.

[2] If the query contains "LIMIT N", and if the number of GPEs working on this query is G, then the output size will be N +/- (G-1). In a conventional cluster configuration, there is one GPE per machine. For example, if N=10 and the graph is distributed across 4 machines, then the output size will be between 7 and 13, inclusive.