Creating a Loading Job

After a graph schema has been created, the system is ready to load data into the graph store. The GSQL language offers easy-to-understand and easy-to-use commands for data loading which perform many of the same data conversion, mapping, filtering, and merging operations which are found in enterprise ETL (Extract,Transform, and Load) systems.

The GSQL system can read structured or semistructured data from text files. The loading language syntax is geared towards tabular or JSON data, but conditional clauses and data manipulation functions allow for reading data that is structured in a more complex or irregular way. For tabular data, each line in the data file contains a series of data values, separated by commas, tabs, spaces, or any other designated ASCII characters (only single character separators are supported). A line should contain only data values and separators, without extra whitespace. From a tabular view, each line of data is a row, and each row consists of a series of column values.

Loading data is a two-step process. First, a loading job is defined. Next, the job is executed with the RUN statement. These two statements, and the components with the loading job, are detailed below.

The structure of a loading job will be presented hierarchically, top-down:

CREATE ... JOB, which may contain a set of DEFINE and LOAD statements

  • DEFINE statements

  • LOAD statements, which can have several clauses

New LOADING JOB Capabilities

Beginning with v2.0, the TigerGraph platform introduces an extended syntax for defining and running loading jobs which offers several advantages:

  • The TigerGraph platform can handle concurrent loading jobs, which can greatly increase throughput.

  • The data file locations can be specified at compile time or at run time. Run-time settings override compile-time settings.

  • A loading job definition can include several input files. When running the job, the user can choose to run only part of the job by specifying only some of the input files.

  • Loading jobs can be monitored, aborted, and restarted.

Concurrent Loading

Among its several duties, the RESTPP component manages loading jobs. Previously, RESTPP could manage only one loading job at a time. In v2.0, there can be multiple RESTPP-LOADER subcomponents, each of which can handle a loading job independently. The maximum number of concurrent loading jobs is set by the configuration parameter RESTPP-LOADER.Replicas.

Furthermore, if the TigerGraph graph is distributed (partitioned) across multiple machine nodes, each machine's RESTPP-LOADER(s) can be put into action. Each RESTPP-LOADER only reads local input data files, but the resulting graph data can be stored on any machine in the cluster.

To provide this added capability for loading, there is an expanded syntax for creating loading jobs and running loading jobs. Below is a summary of changes and additions. Full details are then presented, in the remainder of this document (GSQL Language Reference Part 1).

  • A loading job begins with CREATE LOADING JOB. (Note that the keyword "LOADING" is included.)

  • A new statement type, DEFINE FILENAME, is added, to define filename variables.

  • The file locations can refer either to the local machine, to specific machines, or to all machines.

  • When a job starts, it is assigned a job_id. Using the job_id, you can check status, abort a job, or restart a job.

Below is a simple example:

A concurrent-capable loading job can logically be separated into parts according to each file variable. When a concurrent-capable loading job is compiled, a RESTPP endpoint is generated for each file variable. Then, the job can be run in portions, according to each file variable.

CREATE LOADING JOB Block

The v2.0 CREATE LOADING JOB can be distinguished from the pre-v2.0 loading jobs first by its header, and then by whether its contains DEFINE FILENAME statements or not. Once the loading type has been determined, there are subsequent rules for the format of the individual LOAD statements and then the RUN statement.

CREATE LOADING JOB

The CREATE LOADING JOB statement is used to define a block of DEFINE, LOAD, and DELETE statements for loading data to or removing data from a particular graph. The sequence of statements is enclosed in curly braces. Each statement in the block, including the last one, should end with a semicolon.

DROP JOB statement

To drop (remove) a job, run "DROP JOB job_name". The job will be removed from GSQL. To drop all jobs, run either of the following commands: DROP JOB ALL DROP JOB *

DEFINE statements

A DEFINE statement is used to define a local variable or expression to be used by the subsequent LOAD statements in the loading job.

DEFINE FILENAME

The DEFINE FILENAME statement defines a filename variable. The variable can then be used later in the JOB block by a LOAD statement to identify its data source. Every concurrent loading job must have at least one DEFINE FILENAME statement.

The filevar is optionally followed by a filepath_string , which tells the job where to find input data. As the name suggests, filepath_string is a string value. Therefore, it should start and end with double quotes.

filepath_string

There are four options for filepath_string :

  • path : either an absolute path or relative path for either a file or a folder on the machine where the job is run. If it is a folder, then the loader will attempt to load each non-hidden file in the folder.

An absolute path may begin with the session variable $sys.data_root.

Then, when running this loading job, first set a value for the parameter, and then run the job:

As the name implies, session parameters only retain their value for the duration of the current GSQL session. If the user exits GSQL, the settings are lost.

  • "all:" path : If the path is prefixed with all: , then the loading job will attempt to run on every machine in the cluster which has a RESTPP component, and each machine will look locally for data at path . I f the path is not valid on any of the machines, the job will be aborted . Also, the session parameter $sys.data_root may not be used.

  • "any:" path : If the path is prefixed with any: , then the loading job will attempt to run on every machine in the cluster which has a RESTPP component, and each machine will look locally for data at path . If the path is not valid on any of the machines, those machines are skipped. Also, the session parameter $sys.data_root may not be used.

  • A list of machine-specific paths : A machine_alias is a name such as m1, m2, etc. which is defined when the cluster configuration is set. For this option, the filepath_string may include a list of paths, separated by commas. If several machines have the same path, the paths can be grouped together by using a list of machine aliases, with the vertical bar "|" as a separator. The loading job will run on whichever machines are named; each RESTPP-LOADER will work on its local files.

DEFINE HEADER

The DEFINE HEADER statement defines a sequence of column names for an input data file. The first column name maps to the first column, the second column name maps to the second column, etc.

DEFINE INPUT_LINE_FILTER

The DEFINE INPUT_LINE_FILTER statement defines a named Boolean expression whose value depends on column attributes from a row of input data. When combined with a USING reject_line_rule clause in a LOAD statement, the filter determines whether an input line is ignored or not.

LOAD statements

A LOAD statement tells the GSQL loader how to parse a data line into column values (tokens), and then describes how the values should be used to create a new vertex or edge instance. One LOAD statement can be used to generate multiple vertices or edges, each vertex or edge having its own Destination_Clause , as shown below. Additionally, two or more LOAD statements may refer to the same input data file. In this case, the GSQL loader will merge their operations so that both of their operations are executed in a single pass through the data file.

The LOAD statement has many options. This reference guide provides examples of key features and options. The Platform Knowledge Base / FAQs and the tutorials, such as Get Started with TigerGraph , provide additional solution- and application-oriented examples.

Different LOAD statement types have different rules for the USING clause; see the USING clause section below for specifics.

LOAD statement

The filevar must have been previously defined in a DEFINE FILENAME statement.

The filepath_string must satisfy the same rules given above in the DEFINE FILENAME section.

The remainder of this section of the document will provide details on the format and use of the file_path, Destination_Clause, its subclauses. USING clause is introduced later in Section "Other Optional LOAD Clauses".

Destination Clause

A Destination_Clause describes how the tokens from a data source should be used to construct one of three types of data objects : a vertex, an edge, or a row in a temporary table (TEMP_TABLE). The destination clause formats for the three types are very similar, but we show them separately for clarity:

For the TO VERTEX and TO EDGE destination clauses, the vertex_type_name or edge_type_name must match the name of a vertex or edge type previously defined in a CREATE VERTEX or CREATE UNDIRECTED|DIRECTED EDGE statement. The values in the VALUE list(id_expr, attr_expr1, attr_expr2,...) are assigned to the id(s) and attributes of a new vertex or edge instance, in the same order in which they are listed in the CREATE statement. id_expr obeys the same attribute rules as attr_expr , except that only attr_expr can use the reducer function, which is introduced later.

In contrast, the TO TEMP_TABLE clause is defining a new, temporary data structure. Its unique characteristics will be described in a separate subsection. For now, we focus on TO VERTEX and TO EDGE.

Attributes and Attribute Expressions

A LOAD statement processes each line of an input file, splitting each line (according to the SEPARATOR character, see Section "Other Optional LOAD Clauses" for more details) into a sequence of tokens. Each destination clause provides a token-to-attribute mapping which defines how to construct a new vertex, an edge, or a temp table row instance (e.g., one data object). The tokens can also be thought of as the column values in a table. There are two ways to refer to a column, by position or by name. Assuming a column has a name, either method may be used, and both methods may be used within one expression.

By Position : The columns (tokens) are numbered from left to right, starting with $0. The next column is $1, and so on.

By Name : Columns can be named, either through a header line in the input file, or through a DEFINE HEADER statement. If a header line is used, then the first line of the input file should be structured like a data line, using the same separator characters, except that each column contains a column name string instead of a data value. Names are enclosed in double quotes, e.g. $"age".

Data file name: $sys.file_name refers to the current input data file.

In a simple case, a token value is copied directly to an attribute. For example, in the following LOAD statement,

  • The PRIMARY_ID of a person vertex comes from column $0 of the file "xx/yy/a.csv".

  • The next attribute of a person vertex comes from column $1.

  • The next attribute of a person vertex is given the value "xx/y/a.csv" (the filename itself).

Cumulative Loading

A basic principle in the GSQL Loader is cumulative loading. Cumulative loading means that a particular data object might be written to (i.e., loaded) multiple times, and the result of the multiple loads may depend on the full sequence of writes. This usually means that If a data line provides a valid data object, and the WHERE clause and OPTION clause are satisfied, then the data object is loaded.

  • Valid input : For each input data line, each destination clause constructs one or more new data objects. To be a valid data object, it must have an ID value of the correct type, have correctly typed attribute values, and satisfy the optional WHERE clause. If the data object is not valid, the object is rejected (skipped) and counted as an error in the log file. The rules for invalid attributes values are summarized below:

  1. UINT: Any non-digit character. (Out-of-range values cause overflow instead of rejection)

  2. INT: Any non-digit or non-sign character. (Out-of-range values cause overflow instead of rejection)

  3. FLOAT and DOUBLE: Any wrong format

  4. STRING, STRING COMPRESS, FIXED_BINARY: N/A

  5. DATETIME: Wrong format, invalid date time, or out of range.

  6. Complex type: Depends on the field type or element type. Any invalid field (in UDT), element (in LIST or SET), key or value (in MAP) causes rejection.

  • New data objects: If a valid data object has a new ID value, then the data object is added to the graph store. Any attributes which are missing are assigned the default value for that data type or for that attribute.

  • Overwriting existing data objects : If a valid data object has a ID value for an existing object, then the new object overwrites the existing data object, with the following clarifications and exceptions:

  1. The attribute values of the new object overwrite the attribute values of the existing data object.

  2. Missing tokens : If a token is missing from the input line so that the generated attribute is missing, then that attribute retains its previous value.

  • Skipping an attribute : A LOAD statement can specify that a particular attribute should NOT be loaded by using the special character _ (underscore) as its attribute expression (attr_expr). For example,

means to skip the next-to-last attribute. This technique is used when it is known that the input data file does not contain data for every attribute.

  1. If the LOAD is creating a new vertex or edge, then the skipped attribute will be assigned the default value.

  2. If the LOAD is overwriting an existing vertex or edge, then the skipped attribute will retain its existing value.

More Complex Attribute Expressions

An attribute expression may use column tokens (e.g., $0), literals (constant numeric or string values), any of the built-in loader token functions, or a user-defined token function. Attribute expressions may not contain mathematical or boolean operators (such as +, *, AND). The rules for attribute expressions are the same as those for id expressions, but an attribute expression can additionally use a reducer function:

  • id_expr := $column_number | $"column_name" | constant | $sys.file_name | token_function_name( id_expr [, id_expr ]* )

  • attr_expr := id_expr | REDUCE(reducer_function_name(id _expr ))

Note that token functions can be nested, that is, a token function can be used as an input parameter for another token function. The built-in loader token/reducer functions and user-defined token functions are described in the section "Built-In Loader Token Functions".

The subsections below describe details about loading particular data types.

Loading a DOUBLE or FLOAT Attribute

A floating point value has the basic format

In the first case, the decimal point and following digits are required. In the second case, some digits are required (looking like an integer), and the following decimal point and digits are optional.

In both cases, the leading sign ( "+" or "-") is optional. The exponent, using "e" or "E", is optional. Commas and extra spaces are not allowed.

Loading a DATETIME Attribute

When loading data into a DATETIME attribute, the GSQL loader will automatically read a string representation of datetime information and convert it to internal datetime representation. The loader accepts any of the following string formats:

  • %Y-%m-%d %H:%M:%S (e.g., 2011-02-03 01:02:03)

  • %Y/%m/%d %H:%M:%S (e.g., 2011/02/03 01:02:03)

  • %Y-%m-%dT%H:%M:%S.000z (e.g., 2011-02-03T01:02:03.123z, 123 will be ignored)

  • %Y-%m-%d (only date, no time, e.g., 2011-02-03 )

  • %Y/%m/%d (only date, no time, e.g., 2011/02/03)

  • Any integer value (Unix Epoch time, where Jan 1, 1970 at 00:00:00 is integer 0)

Format notation:

%Y is a 4-digit year. A 2-digit year is not a valid value.

%m and %s are a month (1 to 12) and a day (1 to 31), respectively. Leading zeroes are optional.

%H, %M, %S are hours (0 to 23), minutes (0 to 59) and seconds (0 to 59), respectively. Leading zeroes are optional.

When loading data, the loader checks whether the values of year, month, day, hour, minute, second are out of the valid range. If any invalid value is present, e.g. '2010-13-05' or '2004-04-31 00:00:00', the attribute is invalid and the object (vertex or edge) is not created.

Loading a User-Defined Type (UDT) Attribute

To load a UDT attribute, state the name of the UDT type, followed by the list of attribute expressions for the UDT's fields, in parentheses. See the example below.

Loading a LIST or SET Attribute

There are three methods to load a LIST or a SET.

The first method is to load multiple rows of data which share the same id values and append the individual attribute values to form a collection of values. The collections are formed incrementally by reading one value from each eligible data line and appending the new value into the collection. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not. If the id value(s) is new, then a new vertex or edge is created with a new list/set containing the single value. If the id(s) has been used before, then the value from the new line is appended to the existing list/set. Below shows an example:

The job load_set_list will load two test_vertex vertices because there are two unique id values in the data file. Vertex 1 has attribute values with iset = [10,20] and ilist = [10,20,20]. Vertex 3 has values iset = [30,40] and ilist = [30, 30, 40]. Note that a set doesn't contain duplicate values, while a list can contain duplicate values.

If the input file contains multiple columns which should be all added to the LIST or SET, then a second method is available. Use the LIST() or SET() function as in the example below:

The third method is to use the SPLIT () function to read a compound token and split it into a collection of elements, to form a LIST or SET collection. The SPLIT() function takes two arguments: the column index and the element separator. The element separator should be distinct from the separator through the whole file. Below shows an example:

Loading a MAP Attribute

There are three methods to load a MAP.

The first method is to load multiple rows of data which share the same id values. The maps are formed incrementally by reading one key-value pair from each eligible data line. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not. If the id value(s) is new, then a new vertex or edge is created with a new map containing the single key-value pair. If the id(s) has been used before, then the loading job checks whether the key exists in the map or not. If the key doesn't exist in the map, the new key-value pair is inserted. Otherwise, the value will be replaced by the new value.

Method 1 : Below is the syntax to load a MAP by the first method: Use an arrow (->) to separate the map's key and value.

Method 2 : The second method is to use the MAP() function. If there are multiple key-value pairs among multiple columns, MAP() can load them together. Below is an example:

Method 3 : The third method is to use the SPLIT() function. Similar to the SPLIT() in loading LIST or SET, the SPLIT() function can be used when the key-value pair is in one column and separated by a key-value separator, or multiple key-value pairs are in one column and separated by element separators and key-value separators. SPLIT() here has three parameters: The first is the column index, the second is the key-value separator, and the third is the element separator. The third parameter is optional. If one row of raw data only has one key-value pair, the third parameter can be skipped. Below are the examples without and with the given element separator.

Loading Composite Key Attributes

Loading a Composite Key for a vertex works no differently that normal loading. Simply load all the attributes as you word for a vertex with a single-attribute primary key. The primary key will automatically be constructed from the appropriate attributes.

When loading to an edge where either TO_VERTEX or FROM_VERTEX contains a composite key, the composite set of attributes must be enclosed in parameters. See the example below.

Loading Wildcard Type Edges

If an edge has been defined using a wildcard vertex type, a vertex type name must be specified, following the vertex id, in a load statement for the edge. An example is shown below:

Built-in Loader Token Functions

The GSQL Loader provides several built-in functions which operate on tokens. Some may be used to construct attribute expressions and some may be used for conditional expressions in the WHERE clause.

Token Functions for Attribute Expressions

The following token functions can be used in an id or attribute expression

Timestamp Input Format

Reducer Functions

A reducer function aggregates multiple values of a non-id attribute into one attribute value of a single vertex or edge. Reducer functions are computed incrementally; that is, each time a new input token is applied, a new resulting value is computed.

To reduce and load aggregate data to an attribute, the attribute expression has the form

where reducer_function is one of the functions in the table below. input_expr can include non-reducer functions, but reducer functions cannot be nested.

Each reducer function is overloaded so that one function can be used for several different data types. For primitive data types, the output type is the same as the input_expr type. For LIST, SET, and MAP containers, the input_expr type is one of the allowed element types for these containers (see "Complex Types" in the Attribute Data Types section). The output is the entire container.

WHERE Clause

The WHERE clause is an optional clause. The WHERE clause's condition is a boolean expression. The expression may use column token variables, token functions, and operators which are described below. The expression is evaluated for each input data line. If the condition is true, then the vertex or edge instance is loaded into the graph store. If the condition is false, then this instance is skipped. Note that all attribute values are treated as string values in the expression, so the type conversion functions to_int() and to_float(), which are described below, are provided to enable numerical conditions.

Operators in the WHERE Clause

The GSQL Loader language supports most of the standard arithmetic, relational, and boolean operators found in C++. Standard operator precedence applies, and parentheses provide the usual override of precedence.

  • Arithmetic Operators: +, -, *, /, ^ Numeric operation can be used to express complex operation between numeric types. Just as in ordinary mathematical expressions, parentheses can be used to define a group and to modify the order of precedence.

  • Relational Operators: <, >, ==, !=, <=, >= Comparisons can be performed between two numeric values or between two string values.

  • Predicate Operators:

    • AND, OR, NOT operators are the same as in SQL. They can be used to combine multiple conditions together. E.g., $0 < "abc" AND $1 > "abc" selects the rows with the first token less than "abc" and the second token greater than "abc". E.g., NOT $1 < "abc" selects the rows with the second token greater than or equal to "abc".

    • IS NUMERIC token IS NUMERIC returns true if token is in numeric format. Numeric format include integers, decimal notation, and exponential notation. Specifically, IS NUMERIC is true if token matches the following regular expression: (+/-) ? [0-9] + (.[0-9]) ? [0-9] * ((e/E)(+/-) ? [0-9] +) ? . Any leading space and trailing space is skipped, but no other spaces are allowed. E.g., $0 IS NUMERIC checks whether the first token is in numeric format.

    • IS EMPTY token IS EMPTY returns true if token is an empty string. E.g., $1 IS EMPTY checks whether the second token is empty.

    • IN token IN ( set_of_values ) returns true if token is equal to one member of a set of specified values. The values may be string or numeric types. E.g., $2 IN ("abc", "def", "lhm") tests whether the third token equals one of the three strings in the given set. E.g., to_int($3) IN (10, 1, 12, 13, 19) tests whether the fourth token equals one of the specified five numbers.

    • BETWEEN ... AND token BETWEEN lowerVal AND upperVal returns true if token is within the specified range, inclusive of the endpoints. The values may be string or numeric types. E.g., $4 BETWEEN "abc" AND "def" checks whether the fifth token is greater than or equal to "abc" and also less than or equal to "def" E.g., to_float($5) BETWEEN 1 AND 100.5 checks whether the sixth token is greater than or equal to 1.0 and less than or equal to 100.5.

Token functions in the WHERE clause

The GSQL loading language provides several built-in functions for the WHERE clause.

User-Defined Token Functions

Users can write their own token functions in C++ and install them in the GSQL system. The system installation already contains a source code file containing sample functions. Users simply add their customized token functions to this file. The file for user-defined token functions for attribute expressions or WHERE clauses is at <tigergraph.root.dir>/dev/gdk/gsql/src/TokenBank/TokenBank.cpp. There are a few examples in this file, and details are presented below .

Testing your functions is simple. In the same directory with the TokenBank.cpp file is a command script called compile.

1. To test that your function compiles:

2. To test that your function works correctly, write your own test and add it to the main() procedure in the TokenBank.cpp. Then, compile the file and run it. Note that files located in ../TokenLib need to be included:

User-defined Token Functions for Attribute Expressions

The parameters are as follows: iToken is the array of string tokens, iTokenLen is the array of the length of the string tokens, and iTokenNum is the number of tokens. Note that the input tokens are always in string (char*) format.

If the attribute type is not string nor string compress, the return type should be the corresponding type: bool for bool; uint64_t for uint; int64_t for int; float for float double for double. If the attribute type is string or string compress, the return type should be void, and use the extra parameters ( char *const oToken, uint32_t& oTokenLen) for storing the return string. oToken is the returned string value, and oTokenLen is the length of this string.

The built-in token function gsql_concat is used as an example below. It takes multiple-token parameter and returns a string.

User-defined Token Functions for WHERE Clause

User-defined token functions (described above) can also be used to construct the boolean conditional expression in the WHERE clause. However, there are some restrictions in the WHERE clause:

The source code for the built-in token function gsql_token_equal is used as an example for how to write a user-defined token function.

Other Optional LOAD Clauses

OPTION clause

There are no supported options for the OPTION clause at this time.

USING clause

A USING clause contains one or more optional parameter value pairs:

QUOTE parameter

The parser will not treat separator characters found within a pair of quotation marks as a separator. For example, if the parsing conditions are QUOTE="double", SEPARATOR=",", the comma in "Leonard,Euler" will not separate Leonard and Euler into separate tokens.

  • If QUOTE is not declared, quotation marks are treated as ordinary characters.

  • If QUOTE is declared, but a string does not contain a matching pair of quotation marks, then the string is treated as if QUOTE is not declared.

  • Only the string inside the first pair of quote (from left to right) marks are loaded. For example QUOTE="double", the string a"b"c"d"e will be loaded as b.

  • There is no escape character in the loader, so the only way to include quotation marks within a string is for the string body to use one type of quote (single or double) and to declare the other type as the string boundary marker.

Loading JSON Data

When the USING option JSON_FILE="true" is used, the loader loads JSON objects instead of tabular data. A JSON object is an unordered set of key/value pairs, where each value may itself be an array or object, leading to nested structures. A colon separates each key from its value, and a comma separates items in a collection. A more complete description of JSON format is available at www.json.org . The JSON loader requires that each input line has exactly one JSON object. Instead of using column values as tokens, the JSON loader uses JSON values as tokens, that is, the second part of each JSON key/value pair. In a GSQL loading job, a JSON field is identified by a dollar sign $ followed by the colon-separated sequence of nested key names to reach the value from the top level. For example, given the JSON object {"abc":{"def": "this_value"}}, the identifier $"abc":"def" is used to access "this_value". T he double quotes are mandatory.

An example is shown below:

To specify an end-of-line character other than the standard one, use the EOL option, as shown below.

In the above data encoding.json, the order of fields are not fixed and some fields are missing. The JSON loader ignores the order and accesses the fields by the nested key names. The missing fields are loaded with default values. The result vertices are:

VertexMustExist Parameter

Normally, if vertices do not exist when loading data to edges, a vertex will be created for the connecting edge, using default values for all attributes. Using the VERTEXMUSTEXIST="true" option will load data only if the vertices on both sides of an edge already exist, therefore no longer creating extra vertices.

TEMP_TABLE and Flatten Functions

The keyword TEMP_TABLE triggers the use of a temporary data table which is used to store data generated by one LOAD statement, for use by a later LOAD statement. Earlier we introduced the syntax for loading data to a TEMP_TABLE:

This clause is designed to be used in conjunction with the flatten or flatten_json_array function in one of the attr_expr expressions. The flatten function splits a multi-value field into a set of records. Those records can first be stored into a temporary table, and then the temporary table can be loaded into vertices and/or edges. Only one flatten function is allowed in one temp table destination clause.

There are two versions of the flatten function: One parses single-level groups and the other parses two-level groups. There are also two versions of the flatten_json_array function: One splits an array of primitive values, and the other splits an array of JSON objects.

One-Level Flatten Function

flatten( column_to_be_split, separator, 1 ) is used to parse a one-level group into individual elements. An example is shown below:

The loading job contains two LOAD statements. The first one loads input data to Book vertices and to a TEMP_TABLE. The second one loads the TEMP_TABLE data to Genre vertices and book_genre edges.

Line 5 says that the third column ($2) of each input line should be split into separate tokens, with comma "," as the separator. Each token will have its own row in table t1. The first column is labeled "bookcode" with value $0 and the second column is "genre" with one of the $2 tokens. The contents of TEMP_TABLE t1 are shown below:

Then, lines 8 to 10 say to read TEMP_TABLE t1 and to do the following for each row:

  • Create a Genre vertex for each new value of "genre".

  • Create a book_genre edge from "bookcode" to "genre". In this case, each row of TEMP_TABLE t1 generates one book_genre edge.

The final graph will contain two Book vertices (101 and 102), five Genre vertices, and six book_genre edges.

Two-Level Flatten Function

flatten( column_to_be_split, group_separator, sub_field_separator, number_of_sub_fields_in_one_group ) is used for parse a two-level group into individual elements. Each token in the main group may itself be a group, so there are two separators: one for the top level and one for the second level. An example is shown below.

The flatten function now has four parameters instead of three. The additional parameter is used to record the genre_name in the Genre vertices.

In this example, in the genres column ($2), there are multiple groups, and each group has two sub-fields, genre_id and genre_name. After running the loading job, the file book2.dat will be loaded into the TEMP_TABLE t2 as shown below.

Flatten a JSON Array of Primitive Values

flatten_json_array($" array_name ") parses a JSON array of primitive (string, numberic, or bool) values, where "array_name" is the name of the array. Each value in the array creates a record. Below is an example:

The above data and loading job creates the following temporary table:

Flatten a JSON Array of JSON Objects

flatten_json_array ( $"array_name", $"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" ) parses a JSON array of JSON objects. "array_name" is the name of the array, and the following parameters $"sub_obj_1", $"sub_obj_2", ..., $"sub_obj_n" are the field key names in each object in the array. See complete example below:

When splitting a JSON array of JSON objects, the primitive values are skipped and only JSON objects are processed. As in the example above, the 4th line's "plug-ins" field will not generate any record because its "plug-ins" array doesn't contain any JSON object. Any field which does not exist in the object will be loaded with default value. The above example generates the temporary table shown below:

Flatten a JSON Array in a CSV file

flatten_json_array() can also be used to split a column of a tabular file, where the column contains JSON arrays. An example is given below:

The second column in the csv file is a JSON array which we want to split. flatten_json_array() can be used in this case without the USING JSON_FILE="true" clause:

The above example generates the temporary table shown below:

DELETE statement

In addition to loading data, a LOADING JOB can be used to perform the opposite operation: deleting vertices and edges, using the DELETE statement. DELETE cannot be used in offline loading. Just as a LOAD statement uses the tokens from each input line to set the id and attribute values of a vertex or edge to be created, a DELETE statement uses the tokens from each input line to specify the id value of the item(s) to be deleted.

There are four variations of the DELETE statement. The syntax of the four cases is shown below.

An example using book_rating data is shown below:

offline2online Job Conversion (DEPRECATED)

The gsql command offline2online converts an installed offline loading job to an equivalent online loading job or set of jobs.

Online Job Names

An offline loading job contains one or more LOAD statements, each one specifying the name of an input data file. The offline2online will convert each LOAD statement into a separate online loading job. The data filename will be appended to the offline job name, to create the new online job name. For example, if the offline job has this format:

then running the GSQL command offline2online loadEx will create two new online loading jobs, called loadEx_fileA and loadEx_fileB . The converted loading jobs are installed in the GSQL system; they are not available as text files. However, if there are already jobs with these names, then a version number will be appended: first "_1", then "_2", etc.

For example, if you were to execute offline2online loadEx three times, this would generate the following online jobs:

  • 1st time: loadEx_fileA, loadEx_fileB

  • 2nd time: loadEx_fileA_1, loadEx_fileB_1

  • 3rd time: loadEx_fileA_2, loadEx_fileB_2

Conversion and RUN JOB Details

When running any online loading job, the input data filename and the separator character must be provided. See sections on the USING clause and Running a Loading Job for more details.

If an online loading job is run with the HEADER="true" option, it will skip the first line in the data file, but it will not read that line to get the column names. Therefore, offline jobs which read and use column header names must be manually converted to online jobs.

The following example is taken from the Social Network case in the GSQL Tutorial with Real-Life Examples . In version 0.2 of the tutorial, we used offline loading. The job below uses the same syntax as v0.2, but some names have been updated:

To run, this job:

Note that the first LOAD statement has HEADER="true", but is does not make use of column names. It simply uses column indices $0, $1, $2, and $3. Therefore, the HEADER option can still be used with the converted job. Running offline2online load_social1 , creates two new jobs called load_social_social_users.csv and load_social_social_connection.csv.

The equivalent run commands for the jobs are the following: