1 of 20

Part 2 - Querying

This work is licensed under a Creative Commons Attribution 4.0 International License.

Table of Contents

Distributed Query Mode

In a distributed graph (where the data are spread across multiple machines), the default execution plan is as follows:

One machine will be selected as the execution hub, regardless of the number or distribution of starting point vertices.
All the computation work for the query will take place at the execution hub. The vertex and edge data from other machines will be copied to the hub machine for processing.

TigerGraph Enterprise Edition offers a Distributed Query mode which provides a more optimized execution plan for queries which are likely to start at several machines and continue their traversal across several machines.

A set of machines representing one full copy of the entire graph will participate in the query. If the cluster has a replication factor of 2 (so there are two copies of each piece of data), then half the machines will participate.
The query executes in parallel across all the machines which have source vertex data for a given hop in the query. That is, each SELECT statement defines a 1-hop traversal from a set of source vertices to a set of target vertices. Unlike the default mode where all the needed data are brought to one machine, in Distributed Query mode, the computation moves across the cluster, following the traversal pattern of the query.
The output results will be gathered at one machine.

To invoke Distributed Query mode, simply insert the keyword "DISTRIBUTED" before "QUERY" in a query definition:

createQuery := CREATE [OR REPLACE][DISTRIBUTED] QUERY queryName "(" [parameterList] ")"
               FOR GRAPH graphName
               [RETURNS "("  baseType | accumType ")"]
               [API "(" stringLiteral ")"]
               [SYNTAX syntaxName]
               "{" queryBody "}"

Guidelines for Selecting Distributed Query Mode

The basic trade-off between distributed query mode and default mode is greater parallelism for the given query vs. using more system resources, which reduces the potential for concurrency with other operations. Each machine has a certain number of workers available for concurrent execution of queries. A query in default mode uses only one worker out of the whole system. (This one worker will have multiple threads for processing edge traversals in parallel.) However, a query in distributed mode uses one query worker per machine. This means this query can run faster, but it leaves fewer workers for other queries running concurrently.

We suggest the following guidelines for deciding whether to use default mode or distributed mode.

Queries with one or a few starting point vertices and which take only a few hops → default mode is better.
Queries which start at a very large set of starting point vertices and which traverse many hops → distributed mode is better. For example, algorithms which either compute a value for every vertex or one value for the entire graph should use Distributed Mode. This includes PageRank, Centrality, and Connected Component algorithms.
For applications where the same query (same logic but with different input parameters) will be run many times in production, the application designer can simply try both modes during development and chose the one which works better for their use case and data.

Supported and Unsupported Features

Beginning in v3.0, Distributed Mode supports nearly all the features of default non-distributed mode.

Support was added for many features in TigerGraph 3.0. The table below has changed significantly between v2.6 and v3.0. In addition, 3.0 has more functionality for both Distributed and Non-distributed modes.

The following GSQL features are not supported in Distributed Query mode:

[1] The Support column is not intended to be exhaustive. Items in the Supported column are listed clarity only, so you can compare to the Unsupported column. If a feature which is supported in non-distributed queries is not mentioned in either column, then it is supported in Distributed Query mode.

[2] If the query contains "LIMIT N", and if the number of GPEs working on this query is G, then the output size will be N +/- (G-1). In a conventional cluster configuration, there is one GPE per machine. For example, if N=10 and the graph is distributed across 4 machines, then the output size will be between 7 and 13, inclusive.

Appendix

Interpreted GSQL Limitations

GSQL features not yet supported in Interpreted Mode

Currently, interpreted mode cannot be combined with distributed query execution model, i.e., a query defined with CREATE DISTRIBUTED QUERY cannot be run in interpreted mode. However, interpreted queries can still run on a distributed graph with a regular, non-distributed execution model.

The table below lists additional limitations. These limitations are expected to be temporary. We are continuing to expand the capabilities of Interpreted Mode.

The "Not Supported" column is intended to be comprehensive, but the "Supported" column is not. Rather it gives examples to show the contrast with what is not supported.