 # SELECT Statement

This section discusses the SELECT statement in depth and covers the following EBNF syntax:

The SELECT block selects a set of vertices FROM a vertex set or edge set . There are a number of optional clauses that define and/or refine the selection by constraining the vertex or edge set or the result set. There are two types of SELECT, vertex-induced and edge-induced . Both result in a vertex set, known as the result set .

# SELECT Statement Data Flow

The SELECT statement is an assignment statement with a SELECT block on the right hand side. The SELECT block has many possible clauses, which fit together in a logical flow. Overall, the SELECT block starts from a source set of vertices and returns a result set that is either a subset of the source vertices or a subset of their neighboring vertices. Along the way, computations can be performed on the selected vertices and edges. The figure below graphically depicts the overall SELECT data flow. While the ACCUM and POST-ACCUM clauses do not directly affect which vertices are included in the result set, they affect the data (accumulators) which are attached to those vertices.

# FROM Clause: Vertex and Edge Sets

There are two options for the FROM clause: vertexSet or edgeSet. If vertexSet is used, then the query will be a vertex-induced selection. If edge is used, then the query is an edge-induced selection.

## Vertex-Induced Selection

A vertex-induced selection takes an input set of vertices and produces a result set, which is a subset of the input set. The FROM argument has the form Source:s , where Source is a vertex set. Sourceis optionally followed by :s , where s is a vertex alias which represents any vertex in the set Source.

This statement can be interpreted as " Select all vertices s, from the vertex set Source ." The result is a vertex set.

Below is a simple example of a vertex-induced selection.

## Edge-Induced Selection

Multiple types can also be specified by using delimiter "|". Additionally, the keywords "_" or "ANY" can be used for denoting a set which can include any vertex or edge type.

An edge-induced selection starts from a set of vertices, defines a set of edges incident to that set, and produces a result set of vertices that are also incident to those edges. Typically, this is used to traverse from a set of source vertices over a specific edge type to a set of target vertices. The FROM clause argument (defined formally by the EBNF edgeSet rule) is structured as an edge template:Source:s-(eType:e)->tType:t . The edge template has three parts: the source vertex set (Source), the edge type or types (eType), and the target vertex type or types (tType). Both s and t are the vertex aliases and e is the edge alias. The template defines a pattern s → e → t, from source vertex s, across eType edges, to tType target vertices. The edge alias e represents any edge that fits the complete pattern. Likewise, s and t are aliases that represent any source vertices and target vertices, respectively, that fit the complete pattern.

Either the source vertex set ( s ) or target vertex set ( t ) can be used as the SELECT argument, which determines the result of the SELECT statement. Note the small difference in the two SELECT statements below.

resultSet1 is based on the source end of the edges. resultSet2 is based on the target end of the selected edges. However, resultSet1 is NOT identical to the Source vertex set. It is only those members of Source which connect to an eType edge and then to a tType vertex. Other clauses (presented later in this "SELECT Statement" section, can do additional filtering of the Source set.

## Edge Set and Target Vertex Set Options

The FROM clause chooses edges and target vertices by type. The EBNF symbol vertexEdgeType describes the options:

Note that eType and tType are optional. If eType/tType is omitted (or if ANY or _ is used), then the SELECT will seek out any edge or target vertex that is valid (i.e., there exists a valid path between two vertices over an edge). For the example below, if V1 and V2 are the only possible reachable vertex types via eType , we can omit the target vertex type, making all of the following SELECT statements equivalent. The system will infer the target vertex type at run time.

If is legal to declare an alias without explicitly stating an edge/target type. See the examples below.

Type inference is used whenever possible for the edge set and target vertex set to prune ineligible edges and thereby optimize performance. The vertex type in Source is checked against the graph schema to find all incident edge types. The knowledge of the graph schema is combined with the selection's explicit type conditions given by eType and tType, as well as explicit and implicit type conditions in the WHERE clause to determine a final set of eligible edge sets which match the pattern Source → eType → tType. With type inference, the user has the freedom to express only as much as necessary to select edges.

Similarly, the GSQL engine will infer the edge type at run time. For example, if E1, E2 , and E3 are the only possible edge types that can be traversed to reach vertices of type tType , we can omit specifying the edge type, making the following SELECT statements equivalent.

The following are a set of queries that demonstrate edge-induced SELECT blocks. The allPostsLiked and allPostsMade queries show how the target vertex type can be omitted. The allPostsLikedOrMade query uses the "|" operator to select multiple types of edges.

This example is another edge selection that uses the "|" operator to select edges that have target vertices of multiple types.

## Vertex and Edge Aliases

Vertex and edge aliases are declared within the FROM clause of a SELECT block, by using the character ":", followed by the alias name. Aliases can be accessed anywhere within the same SELECT block. They are used to reference a single selected vertex or edge of a set. It is through the vertex or edge aliases that attributes of these vertices or edges can be accessed.

For example, the following code snippets shows two different SELECT statements. The first SELECT statement starts from a vertex set called allVertices, and the vertex alias name v can access each individual vertex from allVertices. The second SELECT statement selects a set of edges. It can use the vertex alias s to reference the source vertices, or the alias t to reference the target vertices.

The following example shows an edge-based SELECT statement, declaring aliases for all three parts of the edge. In the ACCUM clause, the e and t aliases are assigned to local vertex and edge variables.

# SAMPLE Clause

The SAMPLE clause is an optional clause that selects a uniform random sample from the population of edges or target vertices specified in the FROM argument.

The SAMPLE clause draws from the edge population consisting of those edges which satisfy all three parts – source set, edge type, and target type – of the FROM clause. The SAMPLE clause is intended to provide a representative sample of the distribution of edges (or vertices) connected to hub vertices, instead of dealing with all edges. A hub vertex is a vertex with a relatively high degree. (The degree of a vertex is the number of edges which connect to it. If edges are directional, one can distinguish between indegree and outdegree.)

The expression following SAMPLE specifies the sample size, either an absolute number or a percentage of the population. The expression in sampleClause must evaluate to a positive integer. There are two sampling methods. One is sampling based on edge id. The other is based on target vertex id: if a target vertex id is sampled, all edges from this source vertex to the sampled target vertex are sampled.

Given that the sampling is random, some of the details of each of the example queries may change each time they are run.

The following query displays two modes of sampling: an absolute number of edges from a source vertex and a percentage of edges fro a source vertex. We use the computerNet graph (see Appendix D). In computerNet, there are 31 vertices and 43 edges, but only 7 vertices are source vertices. Moreover, c1, c12, and c23 are hub nodes, with at least 10 outgoing edges each. For the absolute count case, we set the size to 1 edge per source vertex, which is equivalent to a random walk. We expect exactly 7 edges to be selected. For the percentage sampling case, we sample 33% of the edges for vertices which have 3 or more outgoing edges. We expect about 15 edges, but the number may vary.

Below is an example of using SELECT to only traverse one edge for each source vertex. The vertex-attached accumulators @timesTraversedNoSample and @timesTraversedWithSample are used to keep track of the number of times an edge is traversed to reach the target vertex. Without using sampling, this occurs once for each edge; thus @timesTraversedNoSample has the same number as the in-degree of the vertex. With sampling edges, the number of edges is restricted. This is reflected in the @timesTraversedWithSample accumulator. Notice the difference in the result set. Because only one edge per source vertex is traversed when the SAMPLE clause is used, not all target vertices are reached. The vertex company3 has 3 incident edges, but in one instance of the query execution, it is never reached. Additionally, company2 has 6 incident edges, but only 4 source vertices sampled an edge incident to company2 .

# WHERE Clause

The WHERE clause is an optional clause that constrains edges and vertices specified in the FROM and SAMPLE clauses.

The WHERE clause uses a boolean condition to test each vertex or edge in the FROM set (or the sampled vertex and edge sets, if the SAMPLE clause was used).

If the expression evaluates to false for vertex/edge X, then X excluded from further consideration in the result set. The expression may use constants or any variables or parameters within the scope of the SELECT, arithmetic operators (+, -, *, /,%), comparison operators (==, !=, <, <=, >,>=), boolean operators (AND, OR, NOT), set operators (IN, NOT IN) and parentheses to enforce precedence. The WHERE conditional expression may use any of the variables within its scope (global accumulators, vertex set variables, query input parameters, the FROM clause's vertex and edge sets (or their vertex and edge aliases), or any of the attributes or accumulators of the vertex/edge sets.) For a more formal explanation of condition, see the EBNF definitions of condition and expr.

Using built-in vertex and edge attributes and functions, such as .type and .neighbors(), the WHERE clause can be used to implement sophisticated selection rules for the edge traversal. In the following example, the selection conditions are completely specified in the WHERE clause, with no edge types or vertex types mentioned in the FROM clause.

The following examples demonstrate using the WHERE clause to limit the resulting vertex set based on a vertex attribute.

The following example shows the equivalence of using WHERE as a type filter as well as its limitations.

The following example uses edge attributes to determine which workers are registered as full time for some company.

# ACCUM and POST-ACCUM Clauses

The optional ACCUM and POST-ACCUM clauses enable sophisticated aggregation and other computations across the set of vertices or edges selected by the preceding FROM, SAMPLE, and WHERE clauses. A query can contain one or both of these clauses. The statements in an ACCUM clause are applied for every edge in an edge-induced selection or every vertex in a vertex-induced selection.

If there is more than one statement in the ACCUM clause, the statements are separated by commas and executed sequentially for each selected element. However, the TigerGraph system uses parallelism to improve performance. Within an ACCUM clause, each edge is handled by a separate process. As such, there is no fixed order in which the edges are processed within the ACCUM clause and the edges should not be treated as executing sequentially. The accumulators are mutex variables shared among each of these processes. The results of any accumulation within the ACCUM clause is not complete until all edges are traversed. Any inspection of an intermediate result within the ACCUM is incomplete and may not be that meaningful.

The optional POST-ACCUM clause enables aggregation and other computations across the set of vertices (but not edges) selected by the preceding clauses. POST-ACCUM can be used without ACCUM. If it is preceded by an ACCUM clause, then it can be used for 2-stage accumulative computation: a first stage in ACCUM followed by a second stage in POST-ACCUM.

In edge-induced selection, since the ACCUM clause iterates over edges, and often two edges will connect to the same source vertex or to the same target vertex, the ACCUM clause can be repeated multiple times for one vertex.

The primary purpose of the ACCUM or POST-ACCUM clause is to collect information about the graph by updating accumulators (via += or =). See the "Accumulator" section for details on the += operation. However, other kinds of statements (e.g., branching, iteration, local assignments) are permitted to support more complex computations or to log activity. The EBNF syntax below defines the allowable kinds of statements that can occur within an ACCUM or POST-ACCUM. The DMLSubStmt list is similar to the queryBodyStmt list which applies to statements outside of a SELECT block; it is important to note the differences. Each of these statement types is discussed in one of the main sections of this reference document.

## Aliases and ACCUM/POST-ACCUM Iteration Model

To reference each element of the selected set, use the aliases defined in the FROM clause. For example, assume that we have the following aliases:

Let (V1, V2,... Vn) be the vertices in the vertex-induced selection . The following pseudocode emulates ACCUM clause behavior.

Let E = (E1, E2,... En) be the edges in the edge-induced selected set. Further, let S = (S1,S1,...Sn) and T= (T1,T2,...Tn) be the multisets (bags) of source vertices and target vertices which correspond to the edge set. S and T are bags, because they can contain repeated elements.

Note that any reference to the source alias s or target alias t is for the endpoint vertices of the current edge.

Similarly, the POST-ACCUM clause acts like a FOREACH loop on the vertex result set specified in the SELECT clause (e.g., either S or T).

## Edge/Vertex Type Inference and Conflict

If multiple edge types are specified in edge-induced selection, each ACCUM statement in ACCUM clause checks whether edge types are conflicted. If only a subset of edge types are effective in an ACCUM statement , this statement is not executed on other edge types. For example:

In the above example, line 6 is only executed on "liked" edges, because "actionTime" is the attribute of "liked" edge only. Similarly, line 7 is only executed on "friend" edges, because "gender" is the attribute of "person" only, and only "friend" edge uses "person" as target vertex. However, line 8 causes a compilation error, because it uses multiple edges where some edges cannot be supported in a part of the statement, i.e., "liked" edges doesn't have t.gender, "friend" edges doesn't have e.actionTime.

Similar to the ACCUM clause, if multiple source/target vertex types are specified in edge-induced selection and the POST-ACCUM clauses accesses source/target vertex, each ACCUM statement in POST-ACCUM clause checks whether source/target vertex types are conflicted. If only a subset of source/target vertex types are effective in a POST-ACCUM statement, this statement is not executed on other source/target vertex types.

## Rules for Updating Vertex-Attached Accumulators

Prior to v1.0, a vertex-attached accumulator could only be updated in an ACCUM or POST-ACCUM clause and only if its vertex was selected for by the preceding FROM-SAMPLE-WHERE clauses.

Beginning in v1.0, there are additional circumstances where a vertex-attached accumulator may be updated. Vertices which are referenced via a vertex-attached accumulator of a selected vertex may have their vertex-attached accumulators updated in the ACCUM clause (but not in the POST-ACCUM clause). That is, a vertex referenced by an selected vertex can be updated, with some limitations explained below. Some examples will help to illustrate this more complex condition.

• Suppose a query declares a vertex-attached accumulator which holds vertex information . We call this a vertex-holding accumulator . This could take several forms:

• A scalar accumulator, e.g., MaxAccum< VERTEX > @maxV;

• A collection accumulator: e.g., ListAccum< VERTEX > @listV;

• An accumulator containing tuple(s), where the tuple type contains a VERTEX field.

• If a vertex V is selected, then not only can V's accumulators be updated, but the vertices stored in its vertex-holding accumulators can also be updated, in the ACCUM clause.

• Before these indirectly referenced vertices can be used, they need to be activated . There are two ways to activate an indirect vertex:

• A vertex from a vertex-holding accumulator is first assigned to a local vertex variable. The vertex can now be updated through the local vertex variable.

• A FOREACH loop can iterate on a vertex-holding collection accumulator. The vertices can now be updated through the loop variable.

The following query demonstrates updates to indirectly activated vertices.

## ACCUM and POST-ACCUM Examples

We now show several examples. This example demonstrates how ACCUM or POST-ACCUM can be used to count the number of vertices in the given set.

This example uses ACCUM to find all the subjects a user posted about.

This example shows each person's posted vertices and each person's like behaviors (liked edges).

This example counts the total number of times each topic is used.

This is an example of using ACCUM and POST-ACCUM in conjunction. The ACCUM traverses the graph and finds all people who live and work in the same country. After this is determined, POST-ACCUM examines each vertex (person) to see if they work where they live.

This is an example of a POST-ACCUM only that counts the number people with a particular gender.

# HAVING Clause

The optional HAVING clause provides constraints on the result set of the SELECT. The constraints are applied after ACCUM and POST-ACCUM actions. This differs from the WHERE clause, which is applied before the ACCUM and POST-ACCUM actions.

A HAVING clause can only be used if there is an ACCUM or POST-ACCUM clause . The condition is applied to each vertex in the SELECT set (either source or target vertices) which also fulfilled the FROM and WHERE conditions. The HAVING clause is intended to test one or more of the accumulator variables that were updated in the ACCUM or POST-ACCUM clause, though the condition may be anything that equates to a boolean value. If the condition is false for a particular vertex, then that vertex is excluded from the result set.

The following example demonstrates using the HAVING clause to constrain a result set based on the vertex accumulator variable which was updated during the ACCUM clause.

If the activityThreshold parameter is set to 3, the query returns 5 vertices:

If the activityThreshold parameter is set to 2, the query would return 8 vertices. With activityThreshold = 4, the query would return no vertices.

The following example demonstrates the equivalence of a SELECT statement in which the condition for the HAVING clause is always true.

The following shows an example of equivalent result sets from using WHERE vs. HAVING. Recall that the WHERE clause is evaluated before the ACCUM and that the HAVING clause is evaluated after the ACCUM. Both constrain the result set based on a condition that vertices must meet.

The following example has a compilation error because the result set is taken from the source vertices, but the HAVING condition is checking the target vertices.

# ORDER BY Clause

The optional ORDER BY clause sorts the result set.

ASC specifies ascending order (least value first), and DESC specifies descending order (greatest value first). If neither is specified, then ascending order is used. Each expr must refer to the attributes or accumulators of a member of the result set, and the expr must evaluate to a sortable value (e.g., a number or a string). ORDER BY offers hierarchical sorting by allowing a comma-separated list of expressions, sorting first by the leftmost expr. It uses the next expression only to sort items where the current sort expr results in identical values. Any items in the result set which cannot be sorted (because the sort expressions do not pertain to them) will appear at the end of the set, after the sorted items.

The following example demonstrates the use of ORDER BY with multiple expressions. The returned vertex set is first ordered by the number of friends of the vertex, and then ordered by the number of coworkers of that vertex.

# LIMIT Clause

The optional LIMIT clause sets constraints on the number and ranking of items included in the final result set.

Each of the expr must evaluate to a nonnegative integer. To understand LIMIT, note that the tentative result set is held in the computer as a list of vertices. If the query has an ORDER BY clause, the order is specified; otherwise the list order is unknown. Assume we number the vertices as v_1 , v_2 , ..., v_n . The LIMIT clause specifies a range of vertices, starting from a lower position in the list to an upper position.

There are three forms:

Case 1: LIMIT k

• When a single expr is provided, LIMIT returns the first k elements from the tentative result set. If there are fewer than k elements available, then all elements will be returned in the result set. If k=5 and the tentative result set has at least 5 items, then the final result list will be [ v_1 , v_2 , v_3 , v_4 , v_5 ].

Case 2: LIMIT j, k

• When a comma separates two expressions, LIMIT treats the first expression j as an offset. That is, it skips the first j items in the list. The second expr k tells the maximum number of items items to include. If the list has at least 7 items, then LIMIT 2, 5 would return [ v_3 , v_4 , v_5, v_6 , v_7 ].

Case 3: LIMIT k OFFSET j

• The behavior of Case 3 is the same as that of Case 2, except that the syntax is different. The keyword OFFSET separates the two expressions, and the count comes before the offset, rather than vice versa. If the list has at least 7 items, then LIMIT 5 OFFSET 2 would return [ v_3 , v_4 , v_5, v_6 , v_7 ].

If any of the expressions evaluate to a negative integer, the results are undefined.

The following examples demonstrate the various forms of the LIMIT clause.

The first example shows the LIMIT clause when used as an upper limit. It returns a result set with a maximum size of 4 elements in the set.

The following example shows how to use the LIMIT clause with an offset.

The following example shows the alternative syntax for a result size limit with an offset. This time we try larger values for offset and size. In a large data set, limitTest(5,20) might return 20 vertices, but since we don't have 25 vertices in the original data, the output was fewer than 20 vertices.