Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This work is licensed under a Creative Commons Attribution 4.0 International License.
Table of Contents
In a distributed graph (where the data are spread across multiple machines), the default execution plan is as follows:
One machine will be selected as the execution hub, regardless of the number or distribution of starting point vertices.
All the computation work for the query will take place at the execution hub. The vertex and edge data from other machines will be copied to the hub machine for processing.
TigerGraph Enterprise Edition offers a Distributed Query mode which provides a more optimized execution plan for queries which are likely to start at several machines and continue their traversal across several machines.
A set of machines representing one full copy of the entire graph will participate in the query. If the cluster has a replication factor of 2 (so there are two copies of each piece of data), then half the machines will participate.
The query executes in parallel across all the machines which have source vertex data for a given hop in the query. That is, each SELECT statement defines a 1-hop traversal from a set of source vertices to a set of target vertices. Unlike the default mode where all the needed data are brought to one machine, in Distributed Query mode, the computation moves across the cluster, following the traversal pattern of the query.
The output results will be gathered at one machine.
To invoke Distributed Query mode, simply insert the keyword "DISTRIBUTED" before "QUERY" in a query definition:
The basic trade-off between distributed query mode and default mode is greater parallelism for the given query vs. using more system resources, which reduces the potential for concurrency with other operations. Each machine has a certain number of workers available for concurrent execution of queries. A query in default mode uses only one worker out of the whole system. (This one worker will have multiple threads for processing edge traversals in parallel.) However, a query in distributed mode uses one query worker per machine. This means this query can run faster, but it leaves fewer workers for other queries running concurrently.
We suggest the following guidelines for deciding whether to use default mode or distributed mode.
Queries with one or a few starting point vertices and which take only a few hops → default mode is better.
Queries which start at a very large set of starting point vertices and which traverse many hops → distributed mode is better. For example, algorithms which either compute a value for every vertex or one value for the entire graph should use Distributed Mode. This includes PageRank, Centrality, and Connected Component algorithms.
For applications where the same query (same logic but with different input parameters) will be run many times in production, the application designer can simply try both modes during development and chose the one which works better for their use case and data.
Beginning in v3.0, Distributed Mode supports nearly all the features of default non-distributed mode.
Support was added for many features in TigerGraph 3.0. The table below has changed significantly between v2.6 and v3.0. In addition, 3.0 has more functionality for both Distributed and Non-distributed modes.
The following GSQL features are not supported in Distributed Query mode:
[1] The Support column is not intended to be exhaustive. Items in the Supported column are listed clarity only, so you can compare to the Unsupported column. If a feature which is supported in non-distributed queries is not mentioned in either column, then it is supported in Distributed Query mode.
[2] If the query contains "LIMIT N", and if the number of GPEs working on this query is G, then the output size will be N +/- (G-1). In a conventional cluster configuration, there is one GPE per machine. For example, if N=10 and the graph is distributed across 4 machines, then the output size will be between 7 and 13, inclusive.
Feature
Not Supported
Supported as of v3.0 [1]
General
User-defined Exceptions
Data update to the graph
Access to target vertex's values in ACCUM
Query calling a distributed query
Statement Types
LOADACCUM
FOREACH, WHILE, UPDATE, INSERT, DELETE
SELECT clauses
exact count for LIMIT clause [2]
Data types
ArrayAccum
Other accumulators, SET<> parameter
Operations and Operators
Any data update to the graph, including assignment statements to vertex attributes
Vertex and edge functions
-
Accumulator and collection functions
reallocate()
Other functions
selectVertex(),
COALESCE(), EVALUATE()
SAMPLE clause
LIST, SET, BAG, JSONOBJECT, JSONARRAY
ALL
all other functions
to_vertex(), to_vertex_set()
Currently, interpreted mode cannot be combined with distributed query execution model, i.e., a query defined with CREATE DISTRIBUTED QUERY cannot be run in interpreted mode. However, interpreted queries can still run on a distributed graph with a regular, non-distributed execution model.
The table below lists additional limitations. These limitations are expected to be temporary. We are continuing to expand the capabilities of Interpreted Mode.
The "Not Supported" column is intended to be comprehensive, but the "Supported" column is not. Rather it gives examples to show the contrast with what is not supported.
Category
Supported (highlights, not a full list)
Not Supported
Modes
Queries in regular (non-distributed) mode
Distributed mode
Statement types
TYPEDEF tuple
SELECT ... FROM <edge_set>
SELECT ... FROM <vertex_set>
CASE ... WHEN ... THEN
IF ... ELSE ... THEN
WHILE
BREAK, CONTINUE
FOREACH, FOREACH... RANGE+
Assignment for accumulators or local variables
UPDATE
INSERT
DELETE
LOG
Exceptions:
RAISE
TRY
SELECT block clauses
ACCUM
POST-ACCUM
WHERE
HAVING
ORDER BY (but output will not be sorted)
LIMIT
SAMPLE clause in SELECT
Attributes and Accumulators
Global and local accumulators
Global and local variables
Most accumulator types
Attributes cannot be accessed outside of the ACCUM or POST-ACCUM clauses
ArrayAccum
Previous value of accumulator with ' operator, e.g., @@acc'
Functions and Operators
Math operators
Comparison operators
Boolean operators
to_vertex(), to_vertex_set()
Math functions
Most string functions: to_string(), float_to_int(), str_to_int(), lower(), upper()
IN, NOT IN
LIKE
BETWEEN ... AND
IS NULL, IS NOT NULL (checking whether parameters are absent/present)
built-in functions
Set functions: COUNT(), MAX(), .FILTER(), etc.
isDirected()
trim()
neighbor(), neighborAttribute()
COALESCE()
evaluate()
selectVertex()
LOADACCUM()
User-Defined Functions
Data types
Explicit lists, e.g., [1, 3, 2]
JSONOBJECT, JSONARRAY
STRING COMPRESS as accumulator type
Built-in Constants GSQL_INT_MAX, GSQL_INT_MIN, GSQL_UINT_MAX
Explicit sets, e.g., (1, 3, 2)
BAG type parameters
Output options
JSON format V1
PRINT ... WHERE
PRINT TO_CSV
FILE objects