Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
After a graph schema has been created , it can be modified. Data already stored in the graph and which is not logically part of the change will be retained. For example, if you had 100 Book vertices and then added an attribute to the Book schema, you would still have 100 Books, with default values for the new attribute. If you dropped a Book attribute, you still would have all your books, but one attribute would be gone.
To safely update the graph schema, the user should follow this procedure:
Create a SCHEMA_CHANGE JOB, which defines a sequence of ADD, ALTER and/or DROP statements.
Run the SCHEMA_CHANGE JOB (i.e. RUN SCHEMA_CHANGE JOB job_name
), which will do the following:
Attempt the schema change.
If the change is successful, invalidate any loading job or query definitions which are incompatible with the new schema.
if the change is unsuccessful, report the failure and return to the state before the attempt.
A schema change will invalidate any loading jobs or query jobs which relate to an altered part of the schema. Specifically:
A loading job becomes invalid if it refers to a vertex or and an edge which has been dropped (deleted) or altered .
A query becomes invalid if it refers to a vertex, and edge, or an attribute which has been dropped .
Invalid loading jobs are dropped, and invalid queries are uninstalled. After the schema update, the user will need to create and install new load and query jobs based on the new schema.
Jobs and queries for unaltered parts of the schema will still be available and do not need to be reinstalled. However, even though these jobs are valid (e.g., they can be run), the user may wish to examine whether they still perform the preferred operations (e.g., do you want to run them?)
Load or query operations which begin before the schema change will be completed based on the pre-change schema. Load or query operations which begin after the schema change, and which have not been invalidated, will be completed based on the post-change schema.
Only a superuser or globaldesigner can add, alter, or drop global vertex types or global edge types, which are those that are created using CREATE VERTEX or CREATE ... EDGE. This rule applies even if the vertex or edge type is used in only one graph. To make these changes, the user uses a GLOBAL SCHEMA_CHANGE JOB.
An admin or designer user can add, alter, or drop local vertex types or local edge types which are created in the context of that graph. Local vertex and edge types are created using an ADD statement inside a SCHEMA_CHANGE JOB. To alter or drop any of these local types, the admin user uses a regular SCHEMA_CHANGE JOB.
Local graphs can define vertex and edge types independently of the vertex and edge types in other graph. That is, the same name can be used in different graphs for (different) vertex or edge types.
It is even permitted for a local graph and the global graph to use the same name for their own vertex or edge types, as long as the global vertex/edge type is not used within the local graph.
The two types of schema change jobs are described below.
CREATE SCHEMA_CHANGE JOB
(local)The CREATE SCHEMA_CHANGE JOB block defines a sequence of ADD, ALTER, and DROP statements for changing a particular graph. It does not perform the schema change.
One use of CREATE SCHEMA_CHANGE JOB is to define an additional vertex type and edge type to be the structure for a secondary index. For example, if you wanted to index the postalCode attribute of the User vertex, you could create a postalCode_idx (PRIMARY_ID id string, code string) vertex type and hasPostalCode (FROM User, TO postalCode_idx) edge type. Then create an index structure having one edge from each User to a postalCode_idx vertex.
By its nature, a SCHEMA_CHANGE JOB may contain multiple statements. If the job block is used in the interactive GSQL shell, then the BEGIN and END commands should be used to permit the SCHEMA_CHANGE JOB to be entered on several lines. if the job is stored in a command file to be read in batch mode, then BEGIN and END are not needed.
Remember to include a semicolon at the end of each DROP, ALTER, or ADD statement within the JOB block.
If a SCHEMA_CHANGE JOB defines a new edge type which connects to a new vertex type, the ADD VERTEX statement should precede the related ADD EDGE statement. However, the ADD EDGE and ADD VERTEX statements can be in the same SCHEMA_CHANGE JOB.
ADD VERTEX | EDGE
(local)The ADD statement defines a new type of vertex or edge and automatically adds it to a graph schema. The syntax for the ADD VERTEX | EDGE statement is analogous to that of the CREATE VERTEX | EDGE | GRAPH statements. It may only be used within a SCHEMA_CHANGE JOB.
ALTER VERTEX | EDGE
The ALTER
statement adds attributes to or removes attributes from an existing vertex type or edge type. It may only be used within a SCHEMA_CHANGE JOB
. The basic format is as follows:
ALTER ... ADD
ALTER ... ADD
can add attributes to vertex or edge types.
Added attributes are appended to the end of the schema. The new attributes may include DEFAULT fields. To add attributes to a vertex type, the syntax is as follows:
For example:
To add to an edge's endpoint vertex types or attributes, the syntax is as follows:
ALTER EDGE .. ADD PAIR
ALTER EDGE ... ADD PAIR
adds one or more edge pairs, which refer to the FROM
and TO
vertex types of an edge type. To add an edge pair, put the vertex type names in parentheses after keywords FROM
and TO
.
Syntax
Example
In the example below, the first statement in the schema change job will add an edge pair (FROM person, TO company
) to the edge type visit
. The second example adds two edge pairs to the edge type has_pet
; the edge type can now connect both person
and dog
vertices, as well as person
and bird
vertices.
ALTER ... DROP
The syntax for ALTER ... DROP is analogous to that of ALTER ... ADD.
ALTER VERTEX ... WITH
(Beta)The
statement ALTER VERTEX WITH TAGGABLE
is used to mark a vertex type as taggable or untaggable. Vertex types are untaggable by default. When a vertex type is marked as taggable, the vertex type can be used to create a tag-based graph. Additionally, users with the tag-access privilege can tag vertices whose vertex type is marked as taggable.
DROP VERTEX | EDGE
(local)The DROP statement removes the specified vertex type or edge type from the database dictionary. The DROP statement should only be used when graph operations are not in progress.
DROP TUPLE
For tuples that are defined within a graph schema, you can drop them by using the following command.
ADD TAG
ADD TAG
defines a tag for the graph. Tags can be used to create tag-based graphs, allowing for finer grain access control.
DROP TAG
DROP TAG
drops a tag or multiple tags from the schema, and deletes the tag from each vertex to which it is attached. DROP TAG
cannot be run if the tag to be dropped is used in the definition of a tag-based graph; the graph must be dropped first.
RUN SCHEMA_CHANGE JOB
RUN SCHEMA_CHANGE JOB job_name performs the schema change job. After the schema has been changed, the GSQL system checks all existing GSQL queries. If an existing GSQL query uses a dropped vertex, edge, or attribute, the query becomes invalid, and GSQL will show the message "Query query_name becomes invalid after schema update, please update it.".
Below is an example. The schema change job add_reviews adds a Review vertex type and two edge types to connect reviews to users and books, respectively.
DROP SCHEMA_CHANGE JOB
To drop (remove) a schema change job, run DROP JOB schema_change_job
name from the GSQL shell. The specific schema change job will be removed from GSQL. Refer to the Creating a Loading Job page for more information about dropping jobs.
USE GLOBAL
The USE GLOBAL command changes a superuser's mode to Global mode. In global mode, a superuser can define or modify global vertex and edge types, as well as specifying which graphs use those global types. For example, the user should run USE GLOBAL before creating or running a GLOBAL SCHEMA_CHANGE JOB.
CREATE GLOBAL SCHEMA_CHANGE JOB
The CREATE GLOBAL SCHEMA_CHANGE JOB
block defines a sequence of ADD
, ALTER
, and DROP
statements that modify either the attributes or the graph membership of global vertex or edge types. Unlike the non-global schema change job, the header does not include a graph name. However, the ADD
/ALTER
/DROP
statements in the body do mention graphs.
Although both global and local schema change jobs have ADD
and DROP
statements, they have different meanings. The table below outlines the differences.
Remember to include a semicolon at the end of each DROP
, ALTER
, or ADD
statement within the JOB block.
ADD VERTEX | EDGE
(global)The ADD statement adds existing global vertex or edge types to one of the graphs.
ALTER VERTEX | EDGE
The ALTER
statement is used to add attributes to or remove attributes from an existing global vertex type or edge type. The ALTER VERTEX / EDGE
syntax for global schema changes is the same as that for local schema change jobs.
The ALTER statement is used to add attributes to or remove attributes from an existing vertex type or edge type. It can also be used to add or remove source (FROM) vertex types or destination (TO) vertex types of an edge type. It may only be used within a SCHEMA_CHANGE JOB. The basic format is as follows:
ALTER ... ADD
Added attributes are appended to the end of the schema. The new attributes may include DEFAULT fields. To add attributes to a vertex type, the syntax is as follows:
For example:
To add to an edge's endpoint vertex types or attributes, the syntax is as follows:
For example:
ALTER EDGE .. ADD PAIR
ALTER EDGE ... ADD PAIR
adds one or more edge pairs, which refer to the FROM
and TO
vertex types of an edge type. To add an edge pair, put the vertex type names in parentheses after keywords FROM
and TO
.
Syntax
Example
In the example below, the first statement in the schema change job will add an edge pair (FROM person, TO company
) to the edge type visit
. The second example adds two edge pairs to the edge type has_pet
; the edge type can now connect both person
and dog
vertices, as well as person
and bird
vertices.
ALTER ... DROP
The syntax for ALTER ... DROP
is analogous to that of ALTER ... ADD
.
ALTER VERTEX ... WITH
(Beta)The statement ALTER VERTEX WITH TAGGABLE
is used to mark a vertex type as taggable or untaggable. Vertex types are untaggable by default. When a vertex type is marked as taggable, the vertex type can be used to create a tag-based graph. Additionally, users with the tag-access privilege can tag vertices whose vertex type is marked as taggable.
DROP VERTEX | EDGE
(global)The DROP statement removes specified global vertex or edge types from one of the graphs. The command does not delete any data.
RUN GLOBAL SCHEMA_CHANGE JOB
RUN GLOBAL SCHEMA_CHANGE JOB job_name performs the global schema change job. After the schema has been changed, the GSQL system checks all existing GSQL queries. If an existing GSQL query uses a dropped vertex, edge, or attribute, the query becomes invalid, and GSQL will show the message "Query query_name becomes invalid after schema update, please update it.".
Below is an example. The schema change alter_friendship_make_library drops the on_date attribute from the friend_of edge and adds Book type to the library graph.
DROP GLOBAL SCHEMA_CHANGE JOB
Global schema change jobs can be dropped by using the DROP JOB command. Refer to the Creating a Loading Job page for more information about dropping jobs.
DROP ALL
The DROP ALL command clears all graph data, all graph schemas, all loading jobs, and all queries. It should only be used when the intent is to erase an entire database design and to start over.
This command is only available to superusers and only when they are in global mode.
Local
Global
ADD
Defines a new local vertex/edge type; adds it to the graph's domain
Adds one or more existing global vertex/edge types to a graph's domain.
DROP
Deletes a local vertex/edge type and its vertex/edge instances
Removes one or more existing global vertex/edge types from a graph's domain.
ALTER
Adds or drops attributes from a local vertex/edge type.
Adds or drops attributes from a global vertex/edge type, which may affect several graphs.
In GSQL's Data Definition and Loading (DDL) language, users can define their own token functions if the built-in token functions do not meet their needs. This guide describes the procedures to define token functions.
All user-defined token functions are kept in a file named tokenbank.cpp
. Use the GET TokenBank
command in GSQL to download the current UDF file to any location on your machine. The path after the keyword TO
specifies the path where the file will be output to.
The file and the directories will be created if they do not exist, and the file must end with the file extension .hpp
. An example is shown below:
If you only supply a directory but not a filename, the file will be created with the default filename TokenBank.cpp
.
Define the token function in the file you just downloaded. The function should be a C++ function. The function can either return a value that is used for an attribute expression or used in a WHERE
clause as a condition expression. Depending on the return type of the function, the signature of the function must match the allowed format.
If your token function is used to return an attribute expression, the signature of the function must follow the format specified in the table below depending on the attribute type.
The parameters iToken
, iTokenLen
, and iTokenNum
must be named exactly as such, and are used to describe the input tokens:
iToken
is an array of the string tokens,
iTokenLen
is an array of the lengths of the string tokens
iTokenNum
is the number of tokens.
For token functions for attribute types STRING
or STRING COMPRESS
, the return type for the C++ function is void
. Use the parameter oToken
to store the string you want returned, and in GSQL the token function will return the string stored in oToken
:
oToken
is the returned string value
oTokenLen
is the length of the return string
Note that the input tokens are always in string (char*
) format. If necessary, convert them to other types inside the function.
The built-in token function gsql_concat
is used as an example below. It takes multiple token parameters and returns a string.
WHERE
ClauseUser-defined token functions (described above) can also be used to construct the boolean conditional expression in the WHERE
clause. However, there are some restrictions in the WHERE
clause:
In the clause "WHERE
conditions
",
The only user-defined token functions allowed are those that return a boolean value.
If a user-defined token function is used in a WHERE
clause, then it must constitute the entire condition; it cannot be combined with another function or operator to produce a subsequent value. However, the arguments of the UDF can include other functions.
The source code for the built-in token function gsql_token_equal
is used as an example for how to write a user-defined token function.
After defining the token functions, use the PUT TokenBank
command to upload the functions. The path after the keyword FROM
is the absolute path to the TokenBank.cpp
file. An example is shown below:
Before data can be loaded into the graph store, the user must define a graph schema. A graph schema is a "dictionary" that defines the types of entities, vertices and edges, in the graph and how those types of entities are related to one another. Each vertex or edge type has a name and a set of attributes (properties) associated with it. For example, a Book vertex could have title, author, publication year, genre, and language attributes.
In the figure below, circles represent vertex types, and lines represent edge types. The labeling text shows the name of each type. This example has four types of vertices: User, Occupation, Book, and Genre. Also, the example has 3 types of edges: user_occupation, user_book_rating, and book_genre. Note that this diagram does not say anything about how many users or books are in the graph database. It also does not indicate the cardinality of the relationship. For example, it does not specify whether a User may connect to multiple occupations.
An edge connects two vertices; in TigerGraph terminology these two vertices are the source vertex and the target vertex . An edge type can be either directed or undirected. A directed edge has a clear semantic direction, from the source vertex to the target vertex. For example, if there is an edge type that represents a plane flight segment, each segment needs to distinguish which airport is the origin (source vertex) and which airport is the destination (target vertex). In the example schema below, all of the edges are undirected. A useful test to decide whether an edge should be directed or undirected is the following: "An edge type is directed if knowing there is a relationship from A to B does not tell me whether there is a relationship from B to A." Having nonstop service from Chicago to Shanghai does not automatically imply there is nonstop service from Shanghai to Chicago.
An expanded schema is shown below, containing all the original vertex and edge types plus three additional edge types: friend_of, sequel_of, and user_book_read . Note that friend_of joins a User to a User. The friendship is assumed to be bidirectional, so the edge type is undirected. Sequel_of joins a Book to a Book but it is directed, as evidenced by the arrowhead. The Two Towers is the sequel of The Fellowship of the Ring , but the reverse is not true. User_book_read is added to illustrate that there may be more than one edge type between a pair of vertex types.
The TigerGraph system user designs a graph schema to fit the source data and the user's needs and interests. The TigerGraph system user should consider what type of relationships are of interest and what type of analysis is needed. The TigerGraph system lets the user modify an existing schema, so the user is not locked into the initial design decision.
In the first schema diagram above, there are seven entities: four vertex types and three edge types.You may wonder why it was decided to make Occupation a separate vertex type instead of an attribute of User. Likewise, why is Genre a vertex type instead of an attribute of Book? These are examples of design choices. Occupation and Genre were separated out as vertex types because in graph analysis, if an attribute will be used as a query variable, it is often easier to work with as a vertex type.
Once the graph designer has chosen a graph schema, the schema is ready to be formalized into a series of GSQL statements.
Graph Creation and Modification Privileges
Only superusers and globaldesigners can define global vertex types. global edge types, and graphs, using CREATE VERTEX / EDGE / GRAPH. However, once a graph has been created, its admin and designer users can customize its schema, including adding new local vertex types and local edge types, by using a SCHEMA_CHANGE JOB, described in the next section.
Available to superuser and globaldesigner roles only.
The CREATE VERTEX statement defines a new global vertex type, with a name and an attribute list. At a high level of abstraction, the format is
More specifically, the syntax is as follows, assuming that the vertex ID is listed first:
Beginning with v2.3, there are two syntaxes for specifying the primary id/key:
Legacy PRIMARY_ID syntax: The legacy syntax remains valid, but there are additional options and additional flexibility:
PRIMARY_ID and WITH primary_id_as attribute
PRIMARY KEY syntax. This syntax is modeled after SQL.
The primary_id is a required field whose purpose is to uniquely identify each vertex instance. GSQL creates a hash index on the primary id with O(1) time complexity. Its data type may be STRING, INT, or UINT. The syntax for the primary_id_name_type term is as follows:
NOTE: In default mode, the primary_id field is not one of the attribute fields. The purpose of this distinction is to minimize storage space for vertices. The functional consequence of this difference is that a query cannot read the primary_id or use it as part of an expression.
Example:
Instead of the legacy PRIMARY_ID syntax, starting with v2.3, GSQL now offers another option for specifying the primary key. The keyword phrase PRIMARY KEY may be appended to any one of the attributes in the attribute list, though it is conventional for it to be the first attribute. Each vertex instance must have a unique value for the primary key attribute. GSQL creates a hash index on the PRIMARY KEY attribute with O(1) time complexity. It is recommended that the primary key data type be STRING, INT, or UINT.
Note the differences between PRIMARY_ID and PRIMARY KEY:
"PRIMARY_ID" precedes the (name, type) pair. "PRIMARY KEY" follows the (name, type) pair.
In default mode, a PRIMARY_ID is not an attribute, but the WITH primary_id_as_attribute="true" clause can be used to make it an attribute. Alternately, the PRIMARY KEY is always an attribute; the WITH option is unneeded.
Example:
PRIMARY KEY is not supported in GraphStudio. If you define a vertex type using the PRIMARY KEY
syntax, you will not be able to operate on the graph with that vertex type or the global schema in GraphStudio.
GSQL supports composite keys - grouping multiple attributes to create a primary key for a specific vertex. To specify a composite key, use the keyword PRIMARY KEY
followed by the attributes that form the composite key enclosed in parentheses in the CREATE VERTEX
command.
Example:
Composite keys are not supported in GraphStudio. If you define a vertex type with composite keys, you will not be able to operate on the graph with that vertex type or the global schema in GraphStudio.
The attribute list, enclosed in parentheses, is a list of one or more id definitions and attribute descriptions separated by commas:
The available attribute types, including user-defined types, are listed in the section Attribute Data Types.
Every attribute data type has a built-in default value (e.g., the default value for INT type is 0). The DEFAULT default_value
option overrides the built-in value.
Any number of additional attributes may be listed after the primary_id attribute. Each attribute has a name, type, and optional default value (for primitive-type, DATETIME, or STRING COMPRESS attributes only)
Example:
Create vertex types for the graph schema of Figure 1.
Unlike the tables in a relational database, vertex types do not need to have a foreign key attribute for one vertex type to have a relationship to another vertex type. Such relationships are handled by edge types.
By default, when the loader stores a vertex and its attributes in the graph store, it also stores some statistics about the vertex's outdegree – how many connections it has to other vertices. The optional WITH STATS clause lets the user control how much information is recorded. Recording the information in the graph store will speed up queries which need degree information, but it increases the memory usage. There are two* options. If "outdegree_by_edgetype" is chosen, then each vertex records a list of degree count values, one value for each type of edge in the schema. If "none" is chosen, then no degree statistics are recorded with each vertex. If the WITH STATS clause is not used, the loader acts as if "outdegree_by_edgetype" were selected.
The graph below has two types of edges between persons: phone_call and text. For Bobby, the "outdegree_by_edgetype" option records how many phone calls Bobby made (1) and how many text messages Bobby sent (2). This information can be retrieved using the built-in vertex function outdegree(). To get the outdegree of a specific edge type, provide the edgetype name as a string parameter. To get the total outdegree, omit the parameter.
Available to superuser and globaldesigner roles only.
The CREATE EDGE statement defines a new global edge type. There are two forms of the CREATE EDGE statement, one for directed edges and one for undirected edges. Each edge type must specify that it connects FROM one vertex type TO another vertex type. Additional pairs of FROM,To vertex types may be added. Then additional attributes may be added. Each attribute follows the same requirements as described in the Attribute List subsection for the "CREATE VERTEX" section.
As of v3.0, a single edge type can be defined between multiple pairs of vertex types, e.g.
Note that edges do not have a PRIMARY_ID field. Instead, each edge is uniquely identified by a FROM vertex, a TO vertex, and optionally other attributes. The edge type may also be a distinguishing characteristic. For example, as shown in Figure 2 above, there are two types of edges between User and Book. Therefore, both types would have attribute lists which begin (FROM User, To Book,...).
An edge type can be defined which connects FROM and/or TO any of the currently defined types of vertices. Use the wildcard symbol * to indicate "any vertex type". For example, the any_edge type below can connect from any vertex to any other vertex:
Note: If new vertex types are added after a wildcard edge type is defined, the new vertex types are NOT included in the wildcard. That is, "*" is an alias for the vertex types that existed at the point in time that the CREATE EDGE statement is executed.
If a CREATE DIRECTED EDGE statement includes the WITH REVERSE_EDGE=" rev_name " optional clause, then an additional directed edge type called " rev_name
" is automatically created, with the FROM and TO vertices swapped. Moreover, whenever a new edge is created, a reverse edge is also created. The reverse edge will have the same attributes, and whenever the principal edge is updated, the corresponding reverse edge is also updated.
In a TigerGraph system, reverse edges provide the most efficient way to perform graph queries and searches that need to look "backwards". For example, referring to the schema of Figure 2, the query "What is the sequel of Book X, if it has one?" is a forward search, usingsequel_of edges. However, the query "Is Book X a sequel? If so, what Book came before X?" requires examining reverse edges.
Example:
Create undirected edges for the three edge types in Figure 1.
The user_occupation
and book_genre
edges have no attributes. A user_book_rating
edge symbolizes that a user has assigned a rating to a book. Therefore it includes an additional attribute rating
. In this case the rating
attribute is defined to be an integer, but it could just as easily have been set to be a float attribute.
Example :
Create the additional edges depicted in Figure 2.
Every time the GSQL loader creates a sequel_of
edge, it will also automatically create a preceded_by
edge, pointing in the opposite direction.
TYPEDEF
User-defined tuple types defined in a query cannot be used outside of its query or across queries. To use a user-defined tuple or an accumulator that uses a user-defined tuple across queries (such as for the return type of a subquery ), the tuple and the accumulator type must be defined on the catalog level as part of the schema. User-defined types at the catalog level can only be used for query return value types, and cannot be used as an attribute data type.
TYPEDEF
statements can be used outside of a query to define tuple types, GroupBy accumulator types, and heap accumulator types. Once defined, all graphs in the database have access to these user-defined types, and subqueries can be defined to return the user-defined types.
The example below defines a tuple type myTuple
and a heap accumulator type myHeap
, so that the subquery subquery1
can return a value of myHeap
type to its outer query query1
.
The STRING COMPRESS data type achieves compression by mapping each unique attribute value to a small integer. The mapping table ("this string" = "this integer") is called the dictionary. If two such attributes have the same or similar sets of possible values, then it is desirable to have them share one dictionary because it uses less storage space.
When a STRING COMPRESS attribute is declared in a vertex or edge, the user can optionally provide a name for the dictionary. Any attributes which share the same dictionary name will share the same dictionary. For example, v1.attr1, v1.attr2, and e.attr1 below share the same dictionary named "e1".
User-defined indexes (or secondary indexes. as they are called commonly called in the database industry) are a valuable feature that enhances the performance of a database system. Indexes allow users to perform fast lookups on non-key columns or attributes without a full-fledged scan.
The TigerGraph database allows users to define on vertex attributes. User has the flexibility to create indexes in an empty graph initially or to add indexes later when the database is running. If the index is added on an existing vertex, index data is built in the background.
Indexes can be created on vertices on a single attribute of the following data types only: STRING, UINT, INT, DATETIME, and STRING COMPRESS. Indexes will be used to optimize queries with all predicate types. However, if a predicate uses an in-built function, then index will not be used to optimize the query. Also, built-in queries are not optimized using indexes.
Indexes are very important for data retrieval performance. However, adding indexes will affect write performance. For this reason, users should be judicious about adding indexes. Users should review the querying patterns to decide where Indexes can help.
Users can create and drop indexes using ALTER VERTEX command as shown below.
Example:
ALTER VERTEX User ADD INDEX user_country_index ON (country);
Available to superuser and globaldesigner roles only.
Multiple Graph support
If the optional MultiGraph service is enabled, CREATE GRAPH can be invoked multiple times to define multiple graphs, and vertex types and edge types may be re-used (shared) among multiple graphs. There is an option to assign an admin use for the new graph.
After all the required vertex and edge types are created, the CREATE GRAPH command defines a graph schema, which contains the given vertex types and edge types, and prepares the graph store to accept data. The vertex types and edge types may be listed in any order.
The optional WITH ADMIN clause sets the named user to be the admin for the new graph.
As a convenience, executing CREATE GRAPH will set the new graph to be the working graph.
Instead of providing a list of specific vertex types and edge types, it is also possible to define a graph type that includes all the available vertex types and edge types. It is also legal to create a graph with an empty domain. A SCHEMA_CHANGE can be used later to add vertex and edge types.
Examples :
Create graph Book_rating for the edge and vertex types defined for Figure 1:
The following code example shows the full set of statements to define the expanded user-book-rating graph:
CREATE GRAPH ... AS
(Beta)CREATE GRAPH ... AS
creates a tag-based graph of an existing graph. Tag-based graphs include vertices with specific tags from their base graphs, and have their own access control. Users can be granted roles on a tag-based graph and their roles will give them privileges that only apply to the resources in the tag-based graph.
This command can only be run on the base graph and requires the user to have the schema-editing privilege on the base graph.
The syntax for creating tag-based graphs is the same as creating a regular graph except that a base graph must be specified with the AS
clause after the CREATE GRAPH
command, and the definition of the graph must include at least one tagged vertex type. Edges are not tagged in a tag-based graph, but edges with either a source or a target outside of the tag-based graph are not visible to users of the tag-based graph.
Use the ampersand operator (&
) to express vertices with multiple tags:
Use a colon to specify tags directly after the graph name to include everything in the base graph that has the specified tags:
New requirement for MultiGraph support. Applies even if only one graph exists.
Before a user can make use of a graph, first the user must be granted a role on that graph by an admin user of that graph or by a superuser. (Superusers are automatically granted the admin role on every graph). Second, for each GSQL session, the user must set a working graph. The USE GRAPH command sets or changes the user's working graph, for the current session.
For more about roles and privileges, see the document Managing User Privileges and Authentication.
Instead of the USE GRAPH command, gsql can be invoked with the -g <graph_name> option.
Available to superuser and globaldesigner roles only. The effect of this command takes into account shared domains.
The DROP GRAPH deletes the logical definition of the named graph. Furthermore, it will also delete all local vertex or edge types. Local vertex and edge types are created by an ADD VERTEX/EDGE statement within a SCHEMA_CHANGE JOB and so belong only to that graph. Any shared types are unaffected. To delete only selected vertex types or edge types, see DROP VERTEX | EDGE in the Section "Modifying a Graph Schema".
Available only to superusers.
The DROP ALL statement clears the graph store (i.e. deletes all data) and removes all definitions from the catalog: vertex types, edge types, graph types, jobs, and queries.
DROP ALL
, along with all DROP operations, is nonreversible.
The SHOW command can be used to show certain aspects of the graph, instead of manually filtering through the entire graph schema when using the ls command. You can either type the exact identifier or use regular expression / Linux globbing to search.
This feature supports the ? and * from linux globbing operations, and also regular expression matching. Usage of the feature is limited to the scope of the graph the user is currently in - if you are using a global graph, you will not be able to see vertices that are not included in your current graph.
Regular expression searching will not work with escaping characters.
To use regular expressions, you will need to use the -r flag after the part of the schema you wish to show. If you wish to dive deeper into regular expressions, visit "Java Patterns". The following are a few examples of what is supported by the SHOW command.
Identifiers Identifiers are user-defined names. An identifier consists of letters, digits, and the underscore. Identifiers may not begin with a digit. Identifiers are case sensitive.
Keywords and Reserved Words Keywords are words with a predefined semantic meaning in the language. Keywords are not case sensitive. Reserved words are set aside for use by the language, either now or in the future. Reserved words may not be reused as user-defined identifiers. In most cases, a keyword is also a reserved word. For example, VERTEX is a keyword. It is also a reserved word, so VERTEX may not be used as an identifier.
Statements Each line corresponds to one statement (except in multi-line mode). Usually, there is no punctuation at the end of a top-level statement. Some statements, such as CREATE LOADING JOB, are block statements which enclose a set of statements within themselves. Some punctuation may be needed to separate the statements within a block.
Comments Within a command file, comments are text that is ignored by the language interpreter. Single line comments begin with either # or //. A comment may be on the same line with interpreted code . Text to the left of the comment marker is interpreted, and text to the right of the marker is ignored. Multi-line comment blocks begin with /* and end with */
To enter the GSQL shell and work in interactive mode, type gsql
from an operating system shell prompt. A user name, password, and a graph name may also be provided on the command line.
If a user name is provided but not a password, the GSQL system will then ask for the user's password:
If a user name is not given, then GSQL will assume that you are attempting to log in as the default tigergraph user:
To exit the GSQL shell, type either exit
or quit
at the GSQL prompt:
GSQL> EXIT
or GSQL> QUIT
For added security, you can configure your GSQL client session to automatically timeout after a period of inactivity. Set the GSQL_CLIENT_IDLE_TIMEOUT_SEC bash environment variable=<num_sec>. Then every time you start a GSQL session, the idle timeout will be applied. To disable the timeout, omit <num,_sec>. The default setting is no timeout. Example, using the Linux export command to set the environment variable:
Multiple shell sessions of GSQL may be run at the same time. This feature can be used to have multiple clients (human or machine) using the system to perform concurrent operations. A basic locking scheme is used to maintain isolation and consistency.
In interactive mode, the default behavior is to treat each line as one statement; the GSQL interpreter will activate as soon as the End-Of-Line character is entered.
Multi-line mode allows the user to enter several lines of text without triggering immediate execution. This is useful when a statement is very long and the user would like to split it into multiple lines. It is also useful when defining a JOB, because jobs typically contain multiple statements.
To enter multi-line mode, use the command BEGIN. The end-of-line character is now disabled from triggering execution. The shell remains in multi-line mode until the command END is entered. The END command also triggers the execution of the multi-line block. In the example below, BEGIN and END are used to allow the SELECT statement to be split into several lines:
Alternately, the ABORT command exits multi-line mode and discards the multi-line block.
A command file is a text file containing a series of GSQL statements. Blank lines and comments are ignored. By convention, GSQL command files end with the suffix . gsql
, but this is not a requirement. Command files are automatically treated as multi-line mode, so BEGIN and END statements are not needed. Command files may be run either from within the GSQL shell by prefixing the filename with an @ symbol:
GSQL> @file.gsql
or from the operating system (i.e., a Linux shell) by giving the filename as the argument after gsql:
os$ gsql file.gsql
Similarly, a single GSQL command can be run by enclosing the command string in quotation marks and placing it at the end of the GSQL statement. Either single or double quotation marks. It is recommended to use single quotation marks to enclose the entire command and double quotation marks to enclose any strings within the command.
In the example below, the file name_query.gsql contains the multi-line CREATE QUERY block to define the query namesSimilar.
The help
command displays a summary of the available GSQL commands:
GSQL> HELP [BASIC|QUERY]
Note that the HELP command has options for showing more details about certain categories of commands.
The ls
command displays the catalog : all the vertex types, edge types, graphs, queries, jobs, and session parameters which have been defined by the user.
The --reset option will clear the entire graph data store and erase all related definitions (graph schema, loading jobs, and queries) from the Dictionary. The data deletion cannot be undone; use with extreme caution. The REST++, GPE, and GSE modules will be turned off.
The table below summaries the basic system commands introduced so far.
Notes on the LS command
Starting with v1.2, the output of the LS command is sensitive to the user and the active graph:
If the user has not set an active graph or specified "USE GLOBAL":
If the user is a superuser, then LS displays global vertices, global edges, and all graph schemas.
If the user is not a superuser, then LS displays nothing (null).
If the user has set an active graph, then LS displays the schema, jobs, queries, and other definitions for that particular graph.
Session parameters are built-in system variables whose values are valid during the current session; their values do not endure after the session ends. In interactive command mode, a session starts and ends when entering and exiting interactive mode, respectively. When running a command file, the session lasts during the execution of the command file.
Use the SET command to set the value of a session parameter:
Each attribute of a vertex or edge has an assigned data type. The following types are currently supported.
For FLOAT and DOUBLE values, the GSQL Loader supports exponential notation as shown (e.g., 1.25 E-7).
The GSQL Query Language currently only reads values without exponents. It may display output values with exponential notation, however.
Some numeric expressions may return a non-numeric string result, such as "inf" for Infinity or "NaN" for Not a Number.
Additionally, GSQL also supports the following complex data types:
LIST
: A list is an ordered collection of elements of the same type.
Default value: an empty list []
Supported element types: INT
, UINT
, DOUBLE
, FLOAT
, STRING
, STRING COMPRESS
, DATETIME
, and UDT
To declare a list type, use angle brackets <>
to enclose the element type, e.g. LIST<STRING>.
Due to multithreaded GSQL loading, the initial order of elements loaded into a LIST might be different than the order in which they appeared in the input data.
SET
: A set is an unordered collection of unique elements of the same type.
Default value: an empty set ()
Supported element types: INT
, UINT
, DOUBLE
, FLOAT
, STRING
, STRING COMPRESS
, DATETIME
, and UDT
.
To declare a set type, use angle brackets <>
to enclose the element type, e.g. SET<INT>
MAP
: A map is a collection of key-value pairs. It cannot contain duplicate keys and each key maps to one value.
Default value: an empty map
Supported key types: INT
, STRING
, STRING COMPRESS
, and DATETIME
Supported value types: INT
, DOUBLE
, STRING
, STRING COMPRESS
, DATETIME
, and UDT
.
To declare a map type, use <>
to enclose the types, with a comma to separate the key and value types, e.g., MAP<INT, DOUBLE>
.
A User-Defined Tuple (UDT) represents an ordered structure of several fields of the same or different types. The supported field types are listed below. Each field in a UDT has a fixed size. A STRING
field must be given a size in characters, and the loader will only load the first given number of characters. An INT
or UINT
field can optionally be given a size in bytes.
A UDT must be defined before being used as a field in a vertex type or edge type. To define a UDT, use the TYPEDEF TUPLE
statement. Below is an example of a TYPEDEF TUPLE
statement:
In the above example, myTuple
is the name of the UDT. It contains four fields: an 1-byte INT
field named field1
, a 4-byte UINT
field named field2
, a 10-character STRING
field named field3
, and an (8-byte) DOUBLE
field named field4
.
After a graph schema has been created, the system is ready to load data into the graph store. The GSQL language offers easy-to-understand and easy-to-use commands for data loading which perform many of the same data conversion, mapping, filtering, and merging operations that are found in enterprise ETL (Extract, Transform, and Load) systems.
The GSQL system can read structured or semistructured data from text files. The loading language syntax is geared towards tabular or JSON data, but conditional clauses and data manipulation functions allow for reading data that is structured in a more complex or irregular way. For tabular data, each line in the data file contains a series of data values, separated by commas, tabs, spaces, or any other designated ASCII characters (only single character separators are supported). A line should contain only data values and separators, without extra whitespace. From a tabular view, each line of data is a row, and each row consists of a series of column values.
Loading data is a two-step process. First, a loading job is defined. Next, the job is executed with a RUN LOADING JOB
statement. These two statements, and the components of the loading job, are detailed below.
The structure of a loading job will be presented hierarchically, top-down:
CREATE LOADING JOB, which may contain a set of DEFINE and LOAD statements
DEFINE statements
LOAD statements, which can have several clauses
All blank spaces are meaningful in string fields in CSV and JSON. Either pre-process your data files to remove extra spaces, or use GSQL's token processing functions gsql_trim
, gsql_ltrim
, and gsql_rtrim
(Built-in Loader Token Functions).
TigerGraph's syntax for defining and running loading jobs offers several advantages:
The TigerGraph platform can handle concurrent loading jobs, which can greatly increase throughput.
The data file locations can be specified at compile time or at run time. Run-time settings override compile-time settings.
A loading job definition can include several input files. When running the job, the user can choose to run only part of the job by specifying only some of the input files.
Loading jobs can be monitored, aborted, and restarted.
Among its several duties, the RESTPP component manages loading jobs. There can be multiple RESTPP-LOADER subcomponents, each of which can handle a loading job independently. The maximum number of concurrent loading jobs is set by the configuration parameter FileLoader.ReplicaNumber
.
Furthermore, if the TigerGraph graph is distributed (partitioned) across multiple machine nodes, each machine's RESTPP-LOADER(s) can be put into action. Each RESTPP-LOADER only reads local input data files, but the resulting graph data can be stored on any machine in the cluster.
To maximize loading performance in a cluster, use at least two loaders per machine, and assign each loader approximately the same amount of data.
A concurrent-capable loading job can logically be separated into parts according to each file variable. When a concurrent-capable loading job is compiled, a RESTPP endpoint is generated for the loading job, which you can call to load data into your graph as an alternative to RUN LOADING JOB
.
Example loading jobs and data files for the book_rating
schema defined earlier in the document are available in the $(gadmin config get System.AppRoot)/document/examples
folder in your TigerGraph platform installation.
The CREATE LOADING JOB statement is used to define a block of DEFINE, LOAD, and DELETE statements for loading data to or removing data from a particular graph. The sequence of statements is enclosed in curly braces. Each statement in the block, including the last one, should end with a semicolon.
LOAD or DELETE Statements
****A loading job may contain either LOAD
or DELETE
statements but not both.
A loading job that includes both will be rejected when the CREATE
statement is executed.
To drop (remove) a job, run DROP JOB job_name
. The job will be removed from GSQL. To drop all jobs, run either of the following commands:
The scope of ALL
depends on the user's current scope. If the user has set a working graph, then DROP ALL
removes all the jobs for that graph. If a superuser has set their scope to be global, then DROP ALL
removes all jobs across all graph spaces.
DEFINE
statementsA DEFINE
statement is used to define a local variable or expression to be used by the subsequent LOAD
statements in the loading job.
DEFINE FILENAME
The DEFINE FILENAME
statement defines a filename variable. The variable can then be used later in the JOB
block by a LOAD
statement to identify its data source. Every concurrent loading job must have at least one DEFINE FILENAME
statement.
The filevar
_ _is optionally followed by a filepath_string
, which tells the job where to find input data. As the name suggests, _filepath_string _is a string value. Therefore, it should start and end with double quotes.
filepath_string
There are four options for _filepath_string _:
path
_: _either an absolute path or relative path for either a file or a folder on the machine where the job is run. If it is a folder, then the loader will attempt to load each non-hidden file in the folder.
If this path is not valid when CREATE LOADING JOB is executed, GSQL will report an error.
An absolute path may begin with the session variable $sys.data_root.
Then, when running this loading job, first set a value for the parameter, and then run the job:
As the name implies, session parameters only retain their value for the duration of the current GSQL session. If the user exits GSQL, the settings are lost.
"all:" path
: If the path is prefixed with all:
, then the loading job will attempt to run on every machine in the cluster which has a RESTPP component, and each machine will look locally for data at path. If the path is invalid on any of the machines, the job will be aborted. Also, the session parameter $sys.data_root may not be used.
"any:" path
: If the path is prefixed with any:
, then the loading job will attempt to run on every machine in the cluster which has a RESTPP component, and each machine will look locally for data at the specified path. **If the path is invalid on any of the machines, those machines where the path is not valid are skipped. **Also, the session parameter $sys.data_root may not be used.
A list of machine-specific paths: A machine_alias is a name such as m1, m2, etc. which is defined when the cluster configuration is set. For this option, the _filepath_string _may include a list of paths, separated by commas. If several machines have the same path, the paths can be grouped together by using a list of machine aliases, with the vertical bar "|" as a separator. The loading job will run on whichever machines are named; each RESTPP-LOADER will work on its local files.
The DEFINE HEADER statement defines a sequence of column names for an input data file. The first column name maps to the first column, the second column name maps to the second column, etc.
The DEFINE INPUT_LINE_FILTER statement defines a named Boolean expression whose value depends on column attributes from a row of input data. When combined with a USING reject_line_rule clause in a LOAD statement, the filter determines whether an input line is ignored or not.
LOAD
statementA LOAD
statement describes how to parse a data line into column values (tokens), and then describes how the values should be used to create a new vertex or edge instance. One LOAD
statement can be used to generate multiple vertices or edges, each vertex or edge having its own destination clause, as shown below. Additionally, two or more LOAD statements may refer to the same input data file. In this case, the GSQL loader will merge their operations so that both of their operations are executed in a single pass through the data file.
The LOAD
statement has many options. This reference guide provides examples of key features and options. The Knowledge Base and FAQs_** **_and the tutorials, such as GSQL 101, provide additional solution- and application-oriented examples.
Different LOAD
statement types have different rules for the USING
clause; see the USING
clause section below for specifics.
filevar
_ _must have been previously defined in a DEFINE FILENAME statement.
filepath_string
_ _must satisfy the same rules given above in the DEFINE FILENAME section.
A Destination clause** **describes how the tokens from a data source should be used to construct one of three types of data objects: a vertex, an edge, or a row in a temporary table (TEMP_TABLE
). The destination clause formats for the three types are very similar, but we show them separately for clarity:
For the TO VERTEX
and TO EDGE
destination clauses, the following rules for its parameters apply:
The _vertex_type_name
_or _edge_type_name
_must match the name of a vertex or edge type previously defined in a CREATE VERTEX
or CREATE UNDIRECTED|DIRECTED EDGE
statement.
The values in the value list_(id_expr
, attr_expr1
, attr_expr2
,...) _are assigned to the id(s) and attributes of a new vertex or edge instance, in the same order in which they are listed in the CREATE
statement.
_id_expr
_obeys the same attribute rules as attr_expr
, except that _only attr_expr
_can use the reducer function, which is introduced later.
For edge clauses, the source_id_expr
and target_id_expr
can each optionally be followed by a source_type_expr
and target_type_expr
, respectively. The source_type_expr
and target_type_expr
must evaluate to one of the allowed endpoint vertex types for the given edge type. By specifying the vertex type, this tells the loader what id types to expect. This may be important when the edge type is defined to accept more than one type of source/target vertex.
The TO TEMP_TABLE
clause defines a new, temporary data structure. Its unique characteristics will be described in a separate subsection. For now, we focus on TO VERTEX
and TO EDGE
.
For fast loading of edge data, referential integrity checking is disabled by default.
For an edge to be valid, it must refer to endpoint vertices that exist. To support fast, out-of-order loading,** if one or both of the endpoint vertices do not yet exist, the loader will create vertices with the necessary IDs and default attribute values.** Due to the loader's UPSERT semantics, if the vertex data is loaded later, it will be automatically merged with the dummy vertices. The user can disable this feature and perform regular referential integrity checking by setting the VERTEX_MUST_EXIST=true
option.
Suppose we have the following vertex and edge types:
A Visit
edge can connect two Person
vertices or a Person
to a Company
. A Person
has a string ID, while a Company has an INT
ID. Then suppose the Visit
edge source data comes from a single CSV file, containing both variants of edges. Note that the 2nd column ($1) contains either Person
or Company
, and that the 3rd column ($2) contains either a string or an integer.
Using the optional target_type_expr
field, we can load both variants of the Visit
edge with a single clause.
Known issue: you must include a USING clause
when loading data into edge types with different FROM-TO
vertex pairs, even if all options are default.
A LOAD
statement processes each line of an input file, splitting each line (according to the SEPARATOR
character, see Section "Other Optional LOAD
Clauses" for more details) into a sequence of tokens. Each destination clause provides a token-to-attribute mapping which defines how to construct a new vertex, an edge, or a temp table row instance (e.g., one data object). The tokens can also be thought of as the column values in a table. There are two ways to refer to a column, by position or by name. Assuming a column has a name, either method may be used, and both methods may be used within one expression.
By Position: The columns (tokens) are numbered from left to right, starting with $0. The next column is $1, and so on.
By Name: Columns can be named, either through a header line in the input file, or through a DEFINE HEADER
statement. If a header line is used, then the first line of the input file should be structured like a data line, using the same separator characters, except that each column contains a column name string instead of a data value. Names are enclosed in double quotes, e.g. $"age"
.
**Data file name: **$sys.file_name refers to the current input data file.
In a simple case, a token value is copied directly to an attribute. For example, in the following LOAD
statement,
The PRIMARY_ID
of a person vertex comes from column $0
of the file "xx/yy/a.csv".
The next attribute of a person vertex comes from column $1
.
The next attribute of a person vertex is given the value "xx/y/a.csv" (the filename itself).
Users do not need to explicitly define a primary ID. Given the attributes, one will be selected as the primary key.
A basic principle in the GSQL Loader is cumulative loading. Cumulative loading means that a particular data object might be written to (i.e., loaded) multiple times, and the result of the multiple loads may depend on the full sequence of writes. This usually means that If a data line provides a valid data object, and the WHERE
clause and OPTION
clause are satisfied, then the data object is loaded.
Valid input: For each input data line, each destination clause constructs one or more new data objects. To be a **valid data object, **it must have an ID value of the correct type, have correctly typed attribute values, and satisfy the optional WHERE
clause. If the data object is not valid, the object is rejected (skipped) and counted as an error in the log file. The rules for invalid attributes values are summarized below:
UINT
: Any non-digit character. (Out-of-range values cause overflow instead of rejection)
INT
: Any non-digit or non-sign character. (Out-of-range values cause overflow instead of rejection)
FLOAT
and DOUBLE
: Any wrong format
STRING
, STRING COMPRESS
, FIXED_BINARY
: N/A
DATETIME
: Wrong format, invalid date time, or out of range.
BOOL
: Any value not listed later.
Complex type: Depends on the field type or element type. Any invalid field (in UDT
), element (in LIST
or SET
), key or value (in MAP
) causes rejection.
**New data objects: **If a valid data object has a new ID value, then the data object is added to the graph store. Any attributes which are missing are assigned the default value for that data type or for that attribute.
Overwriting existing data objects: If a valid data object has a ID value for an existing object, then the new object overwrites the existing data object, with the following clarifications and exceptions:
The attribute values of the new object overwrite the attribute values of the existing data object.
Missing tokens: If a token is missing from the input line so that the generated attribute is missing, then that attribute retains its previous value.
A STRING token is never considered missing; if there are no characters, then the string is the empty string
Skipping an attribute: A LOAD
statement can specify that a particular attribute should not be loaded by using the special character _ (underscore) as its attribute expression (attr_expr). For example,
means to skip the next-to-last attribute. This technique is used when it is known that the input data file does not contain data for every attribute.
If the load operation is creating a new vertex or edge, then the skipped attribute will be assigned the default value.
If the load operation is overwriting an existing vertex or edge, then the skipped attribute will retain its existing value.
An attribute expression may use column tokens (e.g., $0
), literals (constant numeric or string values), any of the built-in loader token functions, or a user-defined token function. Attribute expressions may **not **contain mathematical or boolean operators (such as +
, *
, AND
). The rules for attribute expressions are the same as those for id expressions, but an attribute expression can additionally use a reducer function:
_id_expr _:= $column_number | $"column_name" | constant | $sys.file_name | token_function_name( _id_expr _[, _id_expr _]* )
_attr_expr _:= _id_expr _| REDUCE(reducer_function_name(id __expr _))
Note that token functions can be nested, that is, a token function can be used as an input parameter for another token function. The built-in loader token/reducer functions and user-defined token functions are described in the section "Built-In Loader Token Functions".
The subsections below describe details about loading particular data types.
A floating point value has the basic format
In the first case, the decimal point and following digits are required. In the second case, some digits are required (looking like an integer), and the following decimal point and digits are optional.
In both cases, the leading sign ( "+" or "-") is optional. The exponent, using "e" or "E", is optional. Commas and extra spaces are not allowed.
DATETIME
AttributeWhen loading data into a DATETIME
attribute, the GSQL loader will automatically read a string representation of DateTime information and convert it to internal DateTime representation. The loader accepts any of the following string formats:
%Y-%m-%d %H:%M:%S
(e.g., 2011-02-03 01:02:03)
%Y/%m/%d %H:%M:%S
(e.g., 2011/02/03 01:02:03)
%Y-%m-%dT%H:%M:%S.000z
(e.g., 2011-02-03T01:02:03.123z, 123 will be ignored)
%Y-%m-%d
(only date, no time, e.g., 2011-02-03 )
%Y/%m/%d
(only date, no time, e.g., 2011/02/03)
Any integer value (Unix Epoch time, where Jan 1, 1970 at 00:00:00 is integer 0)
Format notation:
%Y
is a 4-digit year. A 2-digit year is not a valid value.
%m
and %d
represent month (1 to 12) and day (1 to 31), respectively. Leading zeroes are optional.
%H
, %M
, %S
are hours (0 to 23), minutes (0 to 59) and seconds (0 to 59), respectively. Leading zeroes are optional.
When loading data, the loader checks whether the values of year, month, day, hour, minute, second are out of the valid range. If any invalid value is present, e.g. '2010-13-05' or '2004-04-31 00:00:00', the attribute is invalid and the object (vertex or edge) is not created.
When loading data from CSV files the following values are accepted for BOOL attributes :
True: TRUE
,True
,true
,1
False: FALSE
,False
,false
,0
When loading data from JSON documents, the valid BOOL values are true
and false
.
To load a UDT attribute, state the name of the UDT type, followed by the list of attribute expressions for the UDT's fields, in parentheses. See the example below.
There are three methods to load a LIST
or a SET
.
The first method is to load multiple rows of data that share the same ID values and append the individual attribute values to form a collection of values. The collections are formed incrementally by reading one value from each eligible data line and appending the new value into the collection. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not. If the id value(s) is new, then a new vertex or edge is created with a new list/set containing the single value. If the id(s) has been used before, then the value from the new line is appended to the existing list/set. Below shows an example:
The job load_set_list
will load two test_vertex
vertices because there are two unique id values in the data file. Vertex 1 has attribute values with iset = [10,20]
and ilist = [10,20,20]
. Vertex 3 has values iset = [30,40]
and ilist = [30, 30, 40]
. Note that a set doesn't contain duplicate values, while a list can contain duplicate values.
Because GSQL loading is multi-threaded, the order of values loaded into a LIST might not match the input order.
If the input file contains multiple columns which should be all added to the LIST or SET, then a second method is available. Use the LIST() or SET() function as in the example below:
The third method is to use the SPLIT()
function to read a compound token and split it into a collection of elements, to form a LIST
or SET
collection. The SPLIT()
function takes two arguments: the column index and the element separator. The element separator should be distinct from the separator throughout the whole file. Below shows an example:
The SPLIT()
function cannot be used for UDT type elements.
There are three methods to load a MAP
.
The first method is to load multiple rows of data that share the same id values. The maps are formed incrementally by reading one key-value pair from each eligible data line. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not. If the id value(s) is new, then a new vertex or edge is created with a new map containing the single key-value pair. If the id(s) has been used before, then the loading job checks whether the key exists in the map or not. If the key doesn't exist in the map, the new key-value pair is inserted. Otherwise, the value will be replaced by the new value.
The loading order might not be the same as the order in the raw data. If a data file contains multiple lines with the same id and same key but different values, loading them together results in a nondeterministic final value for that key.
Method 1: Below is the syntax to load a MAP
by the first method: Use an arrow (->) to separate the map's key and value.
****
Method 2: The second method is to use the MAP()
function. If there are multiple key-value pairs among multiple columns, MAP()
can load them together. Below is an example:
Method 3: The third method is to use the SPLIT()
function. Similar to the SPLIT()
in loading LIST
or SET
, the SPLIT()
function can be used when the key-value pair is in one column and separated by a key-value separator, or multiple key-value pairs are in one column and separated by element separators and key-value separators. SPLIT()
here has three parameters: The first is the column index, the second is the key-value separator, and the third is the element separator. The third parameter is optional. If one row of raw data only has one key-value pair, the third parameter can be skipped. Below are the examples without and with the given element separator.
The SPLIT() function cannot be used for UDT type elements.
Loading a Composite Key for a vertex works no differently than normal loading. Simply load all the attributes as you would for a vertex with a single-attribute primary key. The primary key will automatically be constructed from the appropriate attributes.
When loading to an edge where either TO_VERTEX
or FROM_VERTEX
contains a composite key, the composite set of attributes must be enclosed in parentheses. See the example below.
If an edge has been defined using a wildcard vertex type, a vertex type name must be specified, following the vertex id, in a LOAD
statement for the edge. An example is shown below:
Token functions are functions which operate on tokens. Some may be used to construct attribute expressions and some may be used for conditional expressions in the WHERE
clause.
To use a token function, replace the attribute in the destination clause of the LOAD
statement with the function call. The arguments of the function can be a column from the FILE
object.
The following token functions can be used in an id or attribute expression
The timestamp parameter should be in one of the following formats:
"%Y-%m-%d %H:%M:%S"
"%Y/%m/%d %H:%M:%S"
"%Y-%m-%dT%H:%M:%S.000z" // text after the dot . is ignored
Users can write their own token functions in C++ and install them in the GSQL system. To learn how to add a user-defined token function, see Add a User-Defined Token Function.
A reducer function aggregates multiple values of a non-id attribute into one attribute value of a single vertex or edge. Reducer functions are computed incrementally; that is, each time a new input token is applied, a new resulting value is computed.
To reduce and load aggregate data to an attribute, the attribute expression has the form
where _reducer_function _is one of the functions in the table below. _input_expr _can include non-reducer functions, but reducer functions cannot be nested.
Each reducer function is overloaded so that one function can be used for several different data types. For primitive data types, the output type is the same as the _input_expr _type. For LIST, SET, and MAP containers, the _input_expr _type is one of the allowed element types for these containers (see "Complex Types" in the Attribute Data Types section). The output is the entire container.
Each function supports a certain set of attribute types. Calling a reducer function with an incompatible type crashes the service. In order to prevent that, use the WHERE clause (introduced below) together with IS NUMERIC or other operators, functions, predicates for type checking if necessary.
The WHERE
clause is an optional clause. The WHERE
clause's condition is a boolean expression. The expression may use column token variables, token functions, and operators which are described below. The expression is evaluated for each input data line. If the condition is true, then the vertex or edge instance is loaded into the graph store. If the condition is false, then this instance is skipped. Note that all attribute values are treated as string values in the expression, so the type conversion functions to_int()
and to_float()
, which are described below, are provided to enable numerical conditions.
The GSQL Loader language supports most of the standard arithmetic, relational, and boolean operators found in C++. Standard operator precedence applies, and parentheses provide the usual override of precedence.
**Arithmetic Operators: +, -, *, /, ^ ** ****Numeric operators can be used to express complex operations between numeric types. Just as in ordinary mathematical expressions, parentheses can be used to define a group and to modify the order of precedence.
Because computers necessarily can only store approximations for most DOUBLE
and FLOAT
type values, it is not recommended to test these data types for exact equality or inequality. Instead, one should allow for an acceptable amount of error. The following example checks if $0 = 5
, with an error of 0.00001 permitted:
**Relational Operators: <, >, ==, !=, <=, >= ** ****Comparisons can be performed between two numeric values or between two string values. \
Predicate Operators:
**AND, OR, NOT **operators are the same as in SQL. They can be used to combine multiple conditions together. E.g., _$0 < "abc" AND $1 > "abc" _selects the rows with the first token less than "abc" and the second token greater than "abc". E.g., _NOT $1 < "abc" _selects the rows with the second token greater than or equal to "abc". \
**IS NUMERIC _token _IS NUMERIC **returns true if **token **is in numeric format. Numeric format include integers, decimal notation, and exponential notation. Specifically, IS NUMERIC is true if token matches the following regular expression: (+/-) ? [0-9] + (.[0-9]) ? [0-9] * ((e/E)(+/-) ? [0-9] +) ? . Any leading space and trailing space is skipped, but no other spaces are allowed. E.g., _$0 IS NUMERIC _checks whether the first token is in numeric format. \
**IS EMPTY _token _IS EMPTY **returns true if **token **is an empty string. E.g., _$1 IS EMPTY _checks whether the second token is empty. \
**IN ****_token _IN ( _set_of_values _) **returns true if **token **is equal to one member of a set of specified values. The values may be string or numeric types. E.g., _$2 IN ("abc", "def", "lhm") _tests whether the third token equals one of the three strings in the given set. E.g., _to_int($3) IN (10, 1, 12, 13, 19) _tests whether the fourth token equals one of the specified five numbers. \
**BETWEEN ... AND _token _BETWEEN _lowerVal _AND _upperVal **_returns true if **token **is within the specified range, inclusive of the endpoints. The values may be string or numeric types. E.g., _$4 BETWEEN "abc" AND "def" _checks whether the fifth token is greater than or equal to "abc" and also less than or equal to "def" E.g., _to_float($5) BETWEEN 1 AND 100.5 _checks whether the sixth token is greater than or equal to 1.0 and less than or equal to 100.5.
The GSQL loading language provides several built-in functions for the WHERE clause.
The token functions in the WHERE clause and those token functions used for attribute expression are different. They cannot be used exchangeably.
OPTION
clauseThere are no supported options for the OPTION clause at this time.
TAGS
clause (Beta)The TAGS
clause specifies the tags to be applied to the vertices loaded by the LOAD
statement.
If a LOAD
statement has a TAGS
clause, it will tag the vertices with the tags specified in the TAGS
clause. Before vertices can be loaded and tagged with a LOAD
statement, the vertex type must first be marked as taggable, and the tags must be defined.
Users have two options when it comes to how to merge tags if the target vertices exist in the graph:
BY OR
: Add the new tags to the existing set of tags.
BY OVERWRITE
: Overwrite existing tags with the new tags.
USING
clauseA USING
clause contains one or more optional parameter value pairs:
If multiple LOAD statements use the same source (the same file path, the same TEMP_TABLE, or the same file variable), the USING clauses in these LOAD statements must be the same. Therefore, we recommend that if multiple destination clauses share the same source, put all of these destination clauses into the same LOAD statement.
The parser will not treat separator characters found within a pair of quotation marks as a separator. For example, if the parsing conditions are QUOTE="double", SEPARATOR=","
, the comma in "Leonard,Euler"
will not separate Leonard and Euler into separate tokens. \
If QUOTE is not declared, quotation marks are treated as ordinary characters.
If QUOTE is declared, but a string does not contain a matching pair of quotation marks, then the string is treated as if QUOTE is not declared.
Only the string inside the first pair of quote (from left to right) marks are loaded. For example QUOTE="double", the string a"b"c"d"e will be loaded as b.
There is no escape character in the loader, so the only way to include quotation marks within a string is for the string body to use one type of quote (single or double) and to declare the other type as the string boundary marker.
Previously, ill-formatted strings such as **a"a,b"ac,d **would be parsed as a,b,d ignoring a,a,c. The expected input string should be a,"a,b",ac,d. In v2.4, incorrectly formatted strings such as this example will be parsed normally, giving you this result: a"a,b"ac and d.
When the USING option JSON_FILE="true"
is used, the loader loads JSON objects instead of tabular data. A JSON object is an unordered set of key/value pairs, where each value may itself be an array or object, leading to nested structures. A colon separates each key from its value, and a comma separates items in a collection. A more complete description of JSON format is available at www.json.org. The JSON loader requires that each input line has exactly one JSON object. Instead of using column values as tokens, the JSON loader uses JSON values as tokens, that is, the second part of each JSON key/value pair. In a GSQL loading job, a JSON field is identified by a dollar sign $ followed by the colon-separated sequence of nested key names to reach the value from the top level. For example, given the JSON object {"abc":{"def": "this_value"}}
, the identifier $"abc":"def"
is used to access "this_value"
. The double quotes are mandatory.
An example is shown below:
To specify an end-of-line character other than the standard one, use the EOL option, as shown below.
In the above data encoding.json, the order of fields are not fixed and some fields are missing. The JSON loader ignores the order and accesses the fields by the nested key names. The missing fields are loaded with default values. The result vertices are:
TigerGraph can load data from Parquet files if they are stored in AWS S3 buckets. For more details on how to set up S3 data sources and loading jobs, read the AWS S3 Loader User Guide. In the background TigerGraph uses the JSON loading functionality to read data from Parquet files, so the JSON specific information in the previous section applies.
In order to load Parquet data, you need to:
Specify "file.reader.type": "parquet"
in the S3 file configuration file or argument
Specify JSON_FILE="true"
in the USING clause of the LOAD statements
Refer to JSON keys (≈ Parquet "column names") instead of column numbers
You will probably want to add USING EOF="true"
to your RUN LOADING JOB statement to explicitly indicate to the loading job to stop after consuming all data from the Parquet source, not to expect further entries.
An example of a Parquet loading setup is shown below:
VERTEX_MUST_EXIST
ParameterNormally, if vertices do not exist when loading data to edges, a vertex will be created for the connecting edge, using default values for all attributes. Using the VERTEX_MUST_EXIST="true" option will load data only if the vertices on both sides of an edge already exist, therefore no longer creating extra vertices.
TEMP_TABLE
and Flatten FunctionsThe keyword TEMP_TABLE
triggers the use of a temporary data table which is used to store data generated by one LOAD
statement, for use by a later LOAD
statement. Earlier we introduced the syntax for loading data to a TEMP_TABLE
:
This clause is designed to be used in conjunction with the flatten
or flatten_json_array
function in one of the attr_expr
expressions. The flatten function splits a multi-value field into a set of records. Those records can first be stored in a temporary table, and then the temporary table can be loaded into vertices and/or edges. Only one flatten function is allowed in one temp table destination clause.
There are two versions of the flatten function: One parses single-level groups and the other parses two-level groups. There are also two versions of the flatten_json_array function: One splits an array of primitive values, and the other splits an array of JSON objects.
flatten( column_to_be_split, separator, 1 )
is used to parse a one-level group into individual elements. An example is shown below:
The following loading job contains two LOAD
statements. The first one loads input data to Book
vertices and to a TEMP_TABLE
. The second one loads the TEMP_TABLE
data to Genre
vertices and book_genre
edges.
Line 5 says that the third column ($2
) of each input line should be split into separate tokens, with comma (,
) as the separator. Each token will have its own row in table t1
. The first column is labeled bookcode
with value $0
and the second column is genre
with one of the $2
tokens. The contents of TEMP_TABLE t1
are shown below:
Then, lines 8 to 10 say to read TEMP_TABLE t1
and to do the following for each row:
Create a Genre
vertex for each new value of genre
.
Create a book_genre
edge from bookcode
to genre
. In this case, each row of TEMP_TABLE t1
generates one book_genre
edge.
The final graph will contain two Book
vertices (101 and 102), five Genre vertices, and six book_genre
edges.
flatten( _column_to_be_split, group_separator, sub_field_separator, number_of_sub_fields_in_one_group _) is used for parse a two-level group into individual elements. Each token in the main group may itself be a group, so there are two separators: one for the top level and one for the second level. An example is shown below.
The flatten function now has four parameters instead of three. The additional parameter is used to record the genre_name in the Genre vertices.
In this example, in the genres column ($2), there are multiple groups, and each group has two sub-fields, genre_id and genre_name. After running the loading job, the file book2.dat will be loaded into the TEMP_TABLE t2 as shown below.
flatten_json_array($" _array_name _") parses a JSON array of primitive (string, numberic, or bool) values, where "array_name" is the name of the array. Each value in the array creates a record. Below is an example:
The above data and loading job creates the following temporary table:
flatten_json_array ( _$"array_name", $"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" _) parses a JSON array of JSON objects. "array_name" is the name of the array, and the following parameters _$"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" _are the field key names in each object in the array. See complete example below:
When splitting a JSON array of JSON objects, the primitive values are skipped and only JSON objects are processed. As in the example above, the 4th line's "plug-ins" field will not generate any record because its "plug-ins" array doesn't contain any JSON object. Any field which does not exist in the object will be loaded with default value. The above example generates the temporary table shown below:
**flatten_json_array() **can also be used to split a column of a tabular file, where the column contains JSON arrays. An example is given below:
The second column in the csv file is a JSON array which we want to split. flatten_json_array() can be used in this case without the USING JSON_FILE="true" clause:
The above example generates the temporary table shown below:
flatten_json_array in csv
flatten_json_array() does not work if the separator appears also within the json array column. For example, if the separator is comma, the csv loader will erroneously divide the json array into multiple columns. Therefore, it is recommended that the csv file use a special column separator, such as "|" in the above example .
In addition to loading data, a LOADING JOB can be used to perform the opposite operation: deleting vertices and edges, using the DELETE statement. DELETE cannot be used in offline loading. Just as a LOAD statement uses the tokens from each input line to set the id and attribute values of a vertex or edge to be created, a DELETE statement uses the tokens from each input line to specify the id value of the item(s) to be deleted.
In the v2.0 syntax, there is now a " FROM (filepath_string | filevar)
" clause just before the WHERE clause.
There are four variations of the DELETE statement. The syntax of the four cases is shown below.
An example using book_rating data is shown below:
There is a separate DELETE statement in the GSQL Query Language. The query delete statement can leverage the query language's ability to explore the graph and to use complex conditions to determine which items to delete. In contrast, the loading job delete statement requires that the id values of the items to be deleted must be specified in advance in an input file.
This work is licensed under a Creative Commons Attribution 4.0 International License.
The figures below illustrates the sequence of steps and the dependencies to progress from no graph to a loaded graph and a query result, for TigerGraph platform version 0.8 and higher. Note that online and offline follow the same flow.
v3.1: Added reserved prefix gsql_sys_
gsql_sys_
These keywords in the language are non-reserved, so users may use them for user-defined identifiers.
v3.1: Added DESCRIPTION,
TAG
, TAGS
There are two aspects to clearing the system: flushing the data and clearing the schema definitions in the catalog. Two different commands are available.
Available only to superusers.
The CLEAR GRAPH STORE command flushes all the data out of the graph store (database). By default, the system will ask the user to confirm that you really want to discard all the graph data. To force the clear operation and bypass the confirmation question, use the -HARD option, e.g.,
Clearing the graph store does not affect the schema.
Use the -HARD option with extreme caution. There is no undo option. -HARD must be in all capital letters.
CLEAR GRAPH STORE stops all the TigerGraph servers (GPE, GSE, RESTPP, Kafka, and Zookeeper).
Loading jobs and queries are aborted.
Running a loading job executes a previously installed loading job. The job reads lines from an input source, parses each line into data tokens, and applies loading rules and conditions to create new vertex and edge instances to store in the graph data store. The input sources could be defined in the load job or could be provided when running the job. Additionally, loading jobs can also be run by submitted an HTTP request to the REST++ server.
Note that the keyword LOADING is included. This makes it more clear to users and to GSQL that the job is a loading job and not some other type of job ( such as a SCHEMA_CHANGE JOB).
When a concurrent loading job is submitted, it is assigned a job ID number, which is displayed on the GSQL console. The user can use this job ID to refer to the job, for a status update, to abort the job, or to re-start the job. These operations are described later in this section.
-noprint
By default, the command will print several lines of status information while the loading is running. If the -noprint option is included, the output will omit the progress and summary details, but it will still display the job id and the location of the log file.
-dryrun
If -dryrun is used, the system will read the data files and process the data as instructed by the job, but will NOT load any data into the graph. This option can be a useful diagnostic tool.
-n [i], j
The -n
option limits the loading job to processing only a range of lines of each input data file. The -n flag accepts one or two arguments. For example, -n 50
means read lines 1 to 50.
-n 10, 50
means read lines 10 to 50. The special symbol $ is interpreted as "last line", so -n 10,$
means reads from line 10 to the end.
Below are the parameters available for the RUN QUERY
command introduced by the USING
clause.
filevar
listThe optional USING clause may contain a list of file variables. Each file variable may optionally be assigned a filepath_string, obeying the same format as in the CREATE LOADING JOB. This list of file variables determines which parts of a loading job are run and what data files are used.
When a loading job is compiled, it generates one RESTPP endpoint for each filevar and filepath_string. As a consequence, a loading job can be run in parts. When RUN LOADING JOB is executed, only those endpoints whose filevar or file identifier (" __GSQL_FILENAME_n__"
) is mentioned in the USING clause will be used. However, if the USING clause is omitted, then the entire loading job will be run.
If a filepath_string is given, it overrides the filepath_string defined in the loading job. If a particular filevar is not assigned a filepath_string either in the loading job or in the RUN LOADING JOB statement, then an error is reported and the job exits.
CONCURRENCY
The CONCURRENCY
parameter sets the maximum number of concurrent requests that the loading job may send to the GPE. The default is 256.
BATCH_SIZE
The BATCH_SIZE
parameter sets the number of data lines included in each concurrent request sent to the GPE. The default is 8192.
Starting with v2.0, there are now commands to check loading job status, abort a loading job and, restart a loading job.
When a loading job starts, the GSQL server assigns it a job id and displays it for the user to see. The job id format is typically the name of the graph, followed by the machine alias, following by a code number, e.g., gsql_demo_m1.1525091090494
By default, an active loading job will display periodic updates of its progress. There are two ways to inhibit these automatic output displays:
Run the loading job with the -noprint option.
After the loading job has started, enter CTRL+C. This will abort the output display process, but the loading job will continue.
The command SHOW LOADING JOB shows the current status of either a specified loading job or all current jobs:
The display format is the same as that displayed during the periodic progress updates of the RUN LOADING JOB command. If you do not know the job id, but you know the job name and possibly the machine, then the ALL option is a handy way to see a list of active job ids.
The command ABORT LOADING JOB aborts either a specified load job or all active loading jobs:
The output will show a summary of aborted loading jobs.
The command RESUME LOADING JOB will restart a previously-run job which ended for some reason before completion.
If the job is finished, this command will do nothing. The RESUME command should pick up where the previous run ended; that is, it should not load the same data twice.
Every loading job creates a log file. When the job starts, it will display the location of the log file. Typically, the file is located at
<TigerGraph.root.dir>/logs/restpp/restpp_loader_logs/<graph_name>/<job_id>.log
This file contains the following information which most users will find useful:
A list of all the parameter and option settings for the loading job
A copy of the status information that is printed
Statistics report on the number of lines successfully read and parsed
The statistics report include how many objects of each type is created, and how many lines are invalid due to different reasons. This report also shows which lines cause the errors. Here is the list of statistics shown in the report. There are two types of statistics. One is file level (the number of lines), and the other is data object level (the number of objects). If an file level error occurs, e.g., a line does not have enough columns, this line of data is skipped for all LOAD statements in this loading job. If an object level error or failed condition occurs, only the corresponding object is not created, i.e., all other objects in the same loading job are still created if no object level error or failed condition for each corresponding object.
Note that failing a WHERE clause is not necessarily a bad result. If the user's intent for the WHERE clause is to select only certain lines, then it is natural for some lines to pass and some lines to fail.
Below is an example.
The above loading job and data generate the following report
There are a total of 7 data lines. The report shows that
Six of the lines are valid data lines
One line (Line 7) does not have enough tokens.
Of the 6 valid lines,
Three of the 6 valid lines generate valid movie vertices.
One line has an invalid attribute (Line 1: year)
Two lines (Lines 4 and 5) do not pass the WHERE clause.
Function | Output type | Description |
---|---|---|
Function name | Output type | Description of function |
---|---|---|
id | attr1 |
---|---|
bookcode | genre |
---|---|
bookcode | genre_id | |
---|---|---|
id | length |
---|---|
id | score | age | length |
---|---|---|---|
id | score | age | length |
---|---|---|---|
The following words are reserved for use by the Data Definition Language. That is, a graph schema or loading job may not use any of these words for the name of a vertex type, edge type, graph name, tag, or attribute. There is a separate list for . The compiler will reject the use of a reserved word as well as any word beginning with a reserved prefix as a user-defined identifier.
clears both the data and the schema.
Another way to run a loading job is through the POST /ddl/{graph_name}
endpoint of the REST++ server. Since the REST++ server has more direct access to the graph processing engine, this can execute more quickly than a RUN LOADING JOB
statement in GSQL. For details on how to use the endpoint, please see .
Attribute type
Function signature
Function return type
STRING
or STRING COMPRESS
extern "C" void funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum,
char* const oToken, uint32_t& oTokenLen)
void
. The value of oToken
will be returned in GSQL
BOOL
extern "C" bool funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
bool
UINT
extern "C" uint64_t funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
uint64_t
INT
extern "C" int64_t funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
int64_t
FLOAT
extern "C" float funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
float
DOUBLE
extern "C" double funcName (const char* const iToken[], uint32_t iTokenLen[], uint32_t iTokenNum)
double
WITH STATS option (case insensitive)
Bobby.outdegree()
Bobby.outdegree("text")
Bobby.outdegree("phone_call")
"none"
not available
not available
not available
"outdegree_by_edgetype" (default)
3
2
1
Command
Description
HELP[BASIC|QUERY]
Display the help menu for all or a subset of the commands
LS
Display the catalog, which records all the vertex types, edge types, graphs, queries, jobs, and session parameters that have been defined for the current active graph. See notes below concerning graph- and role-dependent visibility of the catalog.
BEGIN
Enter multi-line edit mode (only for console mode within the shell)
END
Finish multi-line edit mode and execute the multi-line block.
ABORT
Abort multi-line edit mode and discard the multi-line block.
@file.gsql
Run the gsql statements in the command file file.gsql
from within the GSQL shell.
os$ gsql file.gsql
Run the gsql statements in the command file file.gsql
from an operating system shell.
os$ gsql 'command_string'
Run a single gsql statement from the operating system shell.
os$ gsql --reset
Clear the graph store and erase the dictionary.
Session Parameter
Meaning and Usage
sys.data_root
The value should be a string, representing the absolute or relative path to the folder where data files are stored. After the parameter has been set, a loading statement can reference this parameter with $sys.data_root.
gsql_src_dir
The value should be a string, representing the absolute or relative path to the root folder for the gsql system installation. After the parameter has been set, a loading statement can reference this parameter with $gsql_src_dir.
exit_on_error
When this parameter is true (default), if a semantic error occurs while running a GSQL command file, the GSQL shell will terminate. Accepted parameter values: true, false (case insensitive). If the parameter is set to false, then a command file which is syntactically correct will continue running, even if certain runtime errors in individual commands occur. Specifically, this affects these commands:
CREATE
INSTALL QUERY
RUN JOB
Semantic errors include a reference to a nonexistent entity or an improper reuse of an entity.
This session parameter does not affect GSQL interactive mode; GSQL interactive mode does not exit on any error.
This session parameter does not affect syntactic errors: GSQL will always exit on a syntactic error.
Name
Default value
Valid input format (regex)
Range and precision
Description
INT
0
[-+]?[0-9]+
–2^63 to +2^63 - 1 (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
8-byte signed integer
UINT
0
[0-9]+
0 to 2^64 - 1 (18,446,744,073,709,551,615)
8-byte unsigned integer
FLOAT
0.0
[ -+ ] ? [ 0 - 9 ] * \. ? [ 0 - 9 ] +( [ eE ] [ -+ ] ? [ 0 - 9 ] + ) ?
+/- 3.4 E +/-38, ~7 bits of precision
4-byte single-precision floating point number Examples: 3.14159, .0065e14, 7E23 See note below.
DOUBLE
0.0
[ -+ ] ? [ 0 - 9 ] * \. ? [ 0 - 9 ] +( [ eE ] [ -+ ] ? [ 0 - 9 ] + ) ?
+/- 1.7 E +/-308, ~15 bits of precision
8-byte double-precision floating point number. Has the same input and output format as FLOAT, but the range and precision are greater. See note below.
BOOL
false
true, false (case insensitive), 1, 0
true, false
boolean true and false, represented within GSQL as true and false , and represented in input and output as 1 and 0
STRING
Empty string
.*
UTF-8
character string. The string value can optionally be enclosed by single quote marks or double quote marks. Please see the QUOTE parameter in Section "Other Optional LOAD Clauses".
Field type
User-specified size
Size
Range (N is size)
INT
Optional
1, 2, 4 (default), or 8 bytes
0 to 2^(N*8) - 1
UINT
Optional
1, 2, 4 (default), or 8 bytes
-2^(N*8 - 1) to 2^(N*8 - 1) - 1
FLOAT
No
4 bytes
-3.4 E-38 to 3.4 E38
DOUBLE
No
8 bytes
-1.7 E-308 to 1.7 E308
DATETIME
No
1582-10-15 00:00:00 to 9999-12-31 23:59:59
BOOL
No
true
or false
STRING
Required
Any number of characters
Any string under N characters
gsql_reverse( _in_string _)
string
Returns a string with the characters in the reverse order of the input string in_string.
gsql_concat( _string1, string2,...,stringN _)
string
Returns a string which is the concatenation of all the input strings.
gsql_uuid_v4()
string
Returns a version-4 UUID.
gsql_split_by_space( _in_string _)
string
Returns a modified version of in_string, in which each space character is replaced with ASCII 30 (decimal).
gsql_substring(str, beginIndex [, length])
string
Returns the substring beginning at beginIndex, having the given length.
gsql_find(str, _substr _)
int
Returns the start index of the substring within the string. If it is not found, then return -1.
gsql_length(_str _)
int
Returns the length of the string.
gsql_replace(str, oldToken, newToken [, max])
string
Returns the string resulting from replacing all matchings of oldToken with newToken in the original string. If a _max _count is provided, there can only be up to that many replacements.
gsql_regex_replace( str, regex, replaceSubstr )
string
Returns the string resulting from replacing all substrings in the input string that match the given regex token with the substitute string.
gsql_regex_match(str, regex )
bool
Returns true if the given string token matches the given regex token and false otherwise.
gsql_to_bool( _in_string _)
bool
Returns true if the _in_string _is either "t" or "true", with case insensitive checking. Returns false otherwise.
gsql_to_uint( _in_string _)
uint
If in_string is the string representation of an unsigned int, the function returns that integer. If in_string is the string representation of a nonnegative float, the function returns that number cast as an int.
gsql_to_int( _in_string _)
int
If in_string is the string representation of an int, the function returns that integer. If in_string is the string representation of a float, the function returns that number cast as an int.
gsql_ts_to_epoch_seconds( _timestamp _)
uint
Converts a timestamp in canonical string format to Unix epoch time, which is the int number of seconds since Jan. 1, 1970. Refer to the timestamp input format note below.
gsql_current_time_epoch(0)
uint
Returns the current time in Unix epoch seconds. *By convention, the input parameter should be 0, but it is ignored.
flatten( column_to_be_split, group_separator, 1 )
flatten( column_to_be_split, group_separator, sub_field_separator, number_of_sub_fields_in_one_group )
See the section "TEMP_TABLE and Flatten Functions" below.
flatten_json_array ( $"array_name" )
flatten_json_array ( $"array_name", $"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" )
See the section "TEMP_TABLE and Flatten Functions" below.
split( column_to_be_split, element_separator )
split( column_to_be_split, key_value_separator, element _separator )
See the section "Loading a LIST or SET Attribute" above.
See the section "Loading a MAP Attribute" above.
gsql_upper( _in_string _)
string
Returns the input string in upper-case.
gsql_lower( _in_string _)
string
Returns the input string in lower-case.
gsql_trim( _in_string _)
string
Trims whitespace from the beginning and end of the input string.
gsql_ltrim( in_string )
gsql_rtrim( in_string )
string
Trims white space from either the beginning or the end of the input string (Left or right).
gsql_year( timestamp )
int
Returns 4-digit year from timestamp. Refer to timestamp input format note below.
gsql_month( timestamp )
int
Returns month (1-12) from timestamp. Refer to timestamp input format note below.
gsql_day( timestamp )
int
Returns day (1-31) from timestamp. Refer to timestamp input format note below.
gsql_year_epoch( epoch )
int
Returns 4-digit year from Unix epoch time, which is the int number of seconds since Jan. 1, 1970.
gsql_month_epoch( epoch )
int
Returns month (1-12) from Unix epoch time, which is the int number of seconds since Jan. 1, 1970.
gsql_day_epoch( epoch )
int
Returns day (1-31) from Unix epoch time, which is the int number of seconds since Jan. 1, 1970.
Function name
Data type of arg: Description of function's return value
max( _arg _)
INT, UINT, FLOAT, DOUBLE: maximum of all _arg _values cumulatively received
min( _arg _)
INT, UINT, FLOAT, DOUBLE: minimum of all _arg _values cumulatively received
add( _arg _)
INT, UINT, FLOAT, DOUBLE: sum of all arg values cumulatively received STRING: concatenation of all arg values cumulatively received LIST, SET element: list/set of all arg values cumulatively received MAP (key -> value) pair: key-value dictionary of all key-value pair arg values cumulatively received
and( _arg _)
BOOL: AND of all arg values cumulatively received INT, UINT: bitwise AND of all arg values cumulatively received
or( _arg _)
BOOL: OR of all arg values cumulatively received INT, UINT: bitwise OR of all arg values cumulatively received
overwrite( _arg _)
non-container: arg LIST, SET: new list/set containing only arg
ignore_if_exists( _arg _)
Any: If an attribute value already exists, return(retain) the existing value. Otherwise, return(load) _arg _.
to_int( _main_string _)
int
Converts _main_string _to an integer value.
to_float( _main_string _)
float
Converts _main_string _to a float value.
concat( _string1, string2 _)
string
Returns a string which is the concatenation of _string1 _and _string2 _.
token_len( _main_string _)
int
Returns the length of main_string.
gsql_is_not_empty_string( _main_string _)
bool
Returns true if _main_string _is empty after removing white space. Returns false otherwise.
gsql_token_equal( _string1, string2 _)
bool
Returns true if _string1 _is exactly the same (case sensitive) as _string2 _. Returns false otherwise.
gsql_token_ignore_case_equal( _string1, string2 _)
bool
Returns true if _string1 _is exactly the same (case insensitive) as _string2 _. Returns false otherwise.
gsql_is_true( _main_string _)
bool
Returns true if _main_string _is either "t" or "true" (case insensitive). Returns false otherwise.
gsql_is_false( _main_string _)
bool
Returns true if _main_string _is either "f" or "false" (case insensitive). Returns false otherwise.
Parameter
Meaning of Value
Allowed Values
SEPARATOR
specifies the special character that separates tokens (columns) in the data file
any single ASCII character.
Default is comma ","
"\t"
for tab"\xy"
for ASCII decimal code xy
EOL
the end-of-line character
any ASCII sequence
Default = "\n"
(system-defined newline character or character sequence)
QUOTE (See note below)
specifies explicit boundary markers for string tokens, either single or double quotation marks. See more details below.
"single" for ' "double" for "
HEADER
whether the data file's first line is a header line.
The header assigns names to the columns.
The LOAD statement must refer to an actual file with a valid header.
"true", "false"
Default is "false"
USER_DEFINED_HEADER
specifies the name of the header variable, when a header has been defined in the loading job, rather than in the data file
the variable name in the preceding DEFINE HEADER statement
REJECT_LINE_RULE
if the filter expression evaluates to true, then do not use this input data line.
name of filter from a preceding DEFINE INPUT_LINE_FILTER statement
JSON_FILE (See Loading JSON Data section below)
whether each line is a json object (see Section "JSON Loader" below for more details)
"true", "false"
Default is "false"
NEW_VERTEX_ONLY
If true, treat vertices as insert-only. If the input data refers to a vertex which already exists, do not update it.
If false, upsert vertices.
"true", "false"
Default is "false"
VERTEX_MUST_EXIST
(See VERTEX_MUST_EXIST section below)
If true, only insert or update an edge If both endpoint vertices already exist. If false, always insert new edges, creating endpoint vertices as needed, using given id and default values for other parameters.
"true", "false"
Default is "false"
"UTF-7"
30
"UTF-1"
0
"UTF-6"
3
101
fiction
101
fantasy
101
young_adult
102
fiction
102
science_fiction
102
Chinese
101
FIC
fiction
101
FTS
fantasy
101
YA
young adult
102
FIC
fiction
102
SF
science fiction
102
CHN
Chinese
C
3
c++
3
"golang"
default
"noidea"
default
"pascal"
1.0
"old"
12
"c++"
2.0
default
12
"java"
2.22
default
30
"python"
3.0
default
30
"go"
4.0
"new"
30
golang
-1 (default)
noidea
-1 (default)
pascal
1
old
-1 (default)
c++
2
unknown (default)
12
java
2.22
new
2
python
3
unknown (default)
4
go
4
new
-1 (default)
File level statistics | Explanation |
Valid lines | The number of valid lines in the source file |
Reject lines | The number of lines which are rejected by reject_line_rules |
Invalid Json format | The number of lines with invalid JSON format |
Not enough token | The number of lines with missing column(s) |
Oversize token | The number of lines with oversize token(s). Please increase "OutputTokenBufferSize" in the |
Object level statistics | Explanation |
Valid Object | The number of objects which have been loaded successfully |
No ID found | The number of objects in which PRIMARY_ID is empty |
Invalid Attributes | The number of invalid objects caused by wrong data format for the attribute type |
Invalid primary id | The number of invalid objects caused by wrong data format for the PRIMARY_ID type |
incorrect fixed binary length | The number of invalid objects caused by the mismatch of the length of the data to the type defined in the schema |
Name
Default value
Supported data format
Range and Precision
Description
STRING COMPRESS
Empty string
.*
UTF-8
String with a finite set of categorical values. More compact storage of STRING, if there is a limited number of different values and the value is rarely accessed. Otherwise, it may use more memory.
DATETIME
UTC time 0
See Section " Loading DATETIME Attribute "
1582-10-15 00:00:00 to 9999-12-31 23:59:59
Date and time (UTC) as the number of seconds elapsed since the start of Jan 1, 1970. Time zones are not supported. Displayed in YYYY-MM-DD hh:mm:ss format.
FIXED_BINARY(
n
)
N/A
N/A
Stream of n binary-encoded bytes
(suitable only in limited circumstances)