Creating a Loading Job
After a graph schema has been created, the system is ready to load data into the graph store. The GSQL language offers commands for data loading which perform many of the same data conversion, mapping, filtering, and merging operations that are found in enterprise ETL (Extract, Transform, and Load) systems.
The GSQL system can read structured or semi-structured data from text files. The loading language syntax is geared towards tabular or JSON data, but conditional clauses and data manipulation functions allow for reading data that is structured in a more complex or irregular way. For tabular data, each line in the data file contains a series of data values, separated by commas, tabs, spaces, or any other designated ASCII characters (only single character separators are supported). A line should contain only data values and separators, without extra whitespace. From a tabular view, each line of data is a row, and each row consists of a series of column values.
Loading data is a two-step process.
-
First, a loading job is defined with a
CREATE LOADING JOB
statement. -
Next, the job is executed with a
RUN LOADING JOB
statement.
These two statements, and the components of the loading job, are detailed below.
The structure of a loading job will be presented hierarchically, top-down:
CREATE LOADING JOB
, which may contain a set of DEFINE and LOAD statements
-
DEFINE
statements -
LOAD
statements, which can have several clauses
All blank spaces are meaningful in string fields in CSV and JSON.
Either pre-process your data files to remove extra spaces, or use GSQL’s token processing functions gsql_trim , gsql_ltrim , and gsql_rtrim ([_token_functions]).
|
Loading job capabilities
TigerGraph’s syntax for defining and running loading jobs offers several advantages:
-
The TigerGraph platform can handle concurrent loading jobs, which can greatly increase throughput.
-
The data file locations can be specified at compile time or at run time. Run-time settings override compile-time settings.
-
A loading job definition can include several input files. When running the job, the user can choose to run only part of the job by specifying only some of the input files.
-
Loading jobs can be monitored, aborted, and restarted.
Concurrent Loading
Among its several duties, the RESTPP component manages loading jobs. There can be multiple RESTPP-LOADER subcomponents, each of which can handle a loading job independently. The maximum number of concurrent loading jobs is set by the configuration parameter FileLoader.ReplicaNumber
.
Furthermore, if the TigerGraph graph is distributed (partitioned) across multiple machine nodes, each machine’s RESTPP-LOADER(s) can be put into action. Each RESTPP-LOADER only reads local input data files, but the resulting graph data can be stored on any machine in the cluster.
To maximize loading performance in a cluster, use at least two loaders per machine, and assign each loader approximately the same amount of data. |
A concurrent-capable loading job can logically be separated into parts according to each file variable. When a concurrent-capable loading job is compiled, a RESTPP endpoint is generated for the loading job, which you can call to load data into your graph as an alternative to RUN LOADING JOB
.
Example loading jobs and data files for the |
CREATE LOADING JOB
The CREATE LOADING JOB
statement is used to define a block of DEFINE
, LOAD
, and DELETE
statements for loading data to or removing data from a particular graph.
The sequence of statements is enclosed in curly braces.
Each statement in the block, including the last one, should end with a semicolon.
CREATE LOADING JOB job_name FOR GRAPH Graph_Name {
[zero or more DEFINE statements;] (1)
[zero or more LOAD statements;] | [zero or more DELETE statements;] (2)
}
1 | While one loading job may define multiple data sources (files), keep the number below 100 for best performance. |
2 | A loading job may contain either LOAD or DELETE statements but not both.
A loading job that includes both will be rejected when the CREATE statement is executed. |
Loading data to global vertices and edges
The CREATE VERTEX
and CREATE EDGE
commands create global vertices and edges.
See the CREATE VERTEX and CREATE EDGE pages for more detail.
Global vertex and edge type data is shared across all graphs, no matter which graph it is loaded to.
In other words, if v1
and e2
are of GLOBAL
type, the data loaded in this loading job will be accessible in other graphs that use v1
and e2
vertices and edges.
To create a local vertex or edge, run a schema change job after a graph has already been created and use the ADD VERTEX
or ADD EDGE
command to add a vertex or edge to a single graph.
See Global vs. local schema changes for more information.
Example
CREATE LOADING JOB job_1 FOR GRAPH Graph_1 {
DEFINE FILENAME file1 = "/data/v1.csv"; (1)
DEFINE FILENAME file2;
LOAD file1 TO VERTEX v1 VALUES ($0, $1, $2);
LOAD file2 TO EDGE e2 VALUES ($0, $1);
}
RUN LOADING JOB job1 USING file1="m1:/data/v1_1.csv", file2="m2:/data/e2.csv" (2)
1 | File path specified at compile time. |
2 | Run-time specification will override path specified at compile time. |
DROP JOB
statement
To drop (remove) a job, run DROP JOB job_name
. The job will be removed from GSQL.
To drop all jobs, run either of the following commands:
DROP JOB ALL
DROP JOB *
The scope of DROP JOB ALL
depends on the user’s current scope.
If the user has set a working graph, then DROP ALL
removes all the jobs for that graph.
If a superuser has set their scope to be global, then DROP JOB ALL
removes all jobs across all graph spaces.
DEFINE
statements
A DEFINE
statement is used to define a local variable or expression to be used by the subsequent LOAD
statements in the loading job.
DEFINE FILENAME
The DEFINE FILENAME
statement defines a filename variable.
The variable can then be used later in the JOB
block by a LOAD
statement to identify its data source.
Every concurrent loading job must have at least one DEFINE FILENAME
statement.
Having more than 100 file or folder sources will degrade performance. Consider either consolidating sources or splitting your work into separate loading jobs. |
DEFINE FILENAME filevar ["=" filepath_string ];
filepath_string = (path | " all :" path | " any :" path | mach_aliases " :" path ["," mach_aliases ":" path ]*)
mach_aliases = name["|"name]*
Parameters
filevar
-
The name of the filename variable.
filepath_string
-
The
filevar
is optionally followed by afilepath_string
, which tells the job where to find input data.filepath_string
is a string value and should start and end with double quotes.There are four options for
filepath_string
:-
An absolute or relative path for either a file or a folder on the machine where the job is run. If it is a folder, then the loader will attempt to load each non-hidden file in the folder. If this path is not valid when
CREATE LOADING JOB
is executed, GSQL will report an error.An absolute path may begin with the session parameter
$sys.data_root
. -
An absolute or relative path for either a file or a folder on all machines in the cluster: If the path is prefixed with
all:
, then the loading job will attempt to run on every machine in the cluster which has a RESTPP component, and each machine will look locally for data at path. If the path is invalid on any of the machines, the job will be aborted. Also, the session parameter$sys.data_root
may not be used.ALL:path
examples"ALL:/data/graph.csv"
-
An absolute or relative path for either a file or a folder on any machine in the cluster: If the path is prefixed with
any:
, then the loading job will attempt to run on every machine in the cluster which has a RESTPP component, and each machine will look locally for data at the specified path. If the path is invalid on any of the machines, those machines where the path is not valid are skipped. Also, the session parameter$sys.data_root
may not be used.ANY:path
examples"ANY:/data/graph.csv"
-
A list of machine-specific paths: A machine alias is a name such as m1, m2, etc. which is defined when the cluster configuration is set. For this option, the
filepath_string
may include a list of paths, separated by commas. If several machines have the same path, the paths can be grouped together by using a list of machine aliases, with the vertical bar "|" as a separator. The loading job will run on whichever machines are named; each RESTPP-LOADER will work on its local files.machine-specific path example"m1:/data1.csv, m2|m3|m5:/data/data2.csv"
-
DEFINE HEADER
The DEFINE HEADER
statement defines a sequence of column names for an input data file.
The first column name maps to the first column, the second column name maps to the second column, etc.
DEFINE HEADER header_name = " column_name "[," column_name "]*;
DEFINE INPUT_LINE_FILTER
The DEFINE INPUT_LINE_FILTER
statement defines a named Boolean expression whose value depends on column attributes from a row of input data.
When combined with a USING reject_line_rule
clause in a LOAD
statement, the filter determines whether an input line is ignored or not.
DEFINE INPUT_LINE_FILTER filter_name = boolean_expression_using_column_variables ;
Load statement
A LOAD
statement describes how to parse a data line into column values (tokens), and then describes how the values should be used to create a new vertex or edge instance.
One LOAD
statement can be used to generate multiple vertices or edges, each vertex or edge having its own destination clause, as shown below.
Additionally, two or more LOAD statements may refer to the same input data file.
In this case, the GSQL loader will merge their operations so that both of their operations are executed in a single pass through the data file.
The LOAD
statement has many options. This reference guide provides examples of key features and options. Tutorials such as GSQL 101 provide additional solution- and application-oriented examples.
Different LOAD
statement types have different rules for the USING
clause; see the USING
clause section below for specifics.
Syntax
LOAD [ filepath_string|filevar|TEMP_TABLE table_name ]
destination_clause [, destination_clause ]*
[ TAGS clause ]
[ USING clause ];
filevar
must have been previously defined in a DEFINE FILENAME
statement.
filepath_string
must satisfy the same rules given above in the DEFINE FILENAME
section.
Destination clause
A destination clause describes how the tokens from a data source should be used to construct one of three types of data objects: a vertex, an edge, or a row in a temporary table (TEMP_TABLE
).
This section describes destination clauses for loading to vertices and edges.
TO VERTEX vertex_type_name VALUES (id_expr [, attr_expr]*)
[WHERE conditions] [OPTION (options)]
TO EDGE edge_type_name VALUES (source_id_expr [source_type_expr],
target_id_expr [target_type_expr]
[, attr_expr]*)
[WHERE conditions] [OPTION (options)]
For the TO VERTEX
and TO EDGE
destination clauses, the following rules for its parameters apply:
-
The
vertex_type_name
oredge_type_name
must match the name of a vertex or edge type previously defined in aCREATE VERTEX
orCREATE UNDIRECTED|DIRECTED EDGE
statement. -
The values in the value list_(
id_expr
,attr_expr1
,attr_expr2
,…)_ are assigned to the ID(s) and attributes of a new vertex or edge instance, in the same order in which they are listed in theCREATE
statement. -
id_expr
obeys the same attribute rules asattr_expr
, except that onlyattr_expr
can use the reducer function, which is introduced later. -
For edge clauses, the
source_id_expr
andtarget_id_expr
can each optionally be followed by asource_type_expr
andtarget_type_expr
, respectively. Thesource_type_expr
andtarget_type_expr
must evaluate to one of the allowed endpoint vertex types for the given edge type. By specifying the vertex type, this tells the loader what id types to expect. This may be important when the edge type is defined to accept more than one type of source/target vertex.
To support fast, out-of-order loading, if one or both of the endpoint vertices of an edge do not yet exist, the loader will create vertices with the necessary IDs and default attribute values. If the vertex data is loaded later, it will be automatically merged with the automatically created vertices. The user can disable this feature and perform regular referential integrity checking by setting the |
Examples
Suppose we have the following vertex and edge types:
CREATE VERTEX Person (pid STRING PRIMARY KEY, birthdate DATETIME)
CREATE VERTEX Company (cid INT PRIMARY KEY, industry STRING)
CREATE DIRECTED EDGE Visit (FROM Person, TO Person
| FROM Person, TO Company, year INT)
Vertex loading
The Person
vertex type has a primary ID pid
and an attribute birthdate
.
If we have the following CSV file with two columns:
Name,DOB
Sam,1994-05-01
Chris,1985-12-25
Pat,1996-11-13
Joe,1975-11-02
Taylor,1988-04-25
If we want to use a person’s name as their primary ID, and load the DOB column into their birthdate
attribute, we can use the following LOAD
statement:
LOAD file1 TO VERTEX Person VALUES ($"Name", $"DOB") USING HEADER = "TRUE"
Edge loading
A Visit
edge can connect two Person
vertices or a Person
to a Company
.
A Person
has a string ID, while a Company has an INT
ID.
Then suppose the Visit
edge source data comes from a single CSV file, containing both variants of edges.
Note that the 2nd column ($1
) contains either Person
or Company
, and that the 3rd column ($2
) contains either a string or an integer.
Sam,Person,Joe,2012
Sam,Company,4057,2017
Chris,Company,9401,2016
Pat,Person,Taylor,2020
Using the optional target_type_expr
field, we can load both variants of the Visit
edge with a single clause.
LOAD file2 TO EDGE Visit VALUES ($0, $2 $1, $3) USING SEPARATOR=",";
Known issue: When loading data into edge types defined with multiple === WHERE clause The ==== Operators in the WHERE Clause The GSQL Loader language supports most of the standard arithmetic, relational, and boolean operators found in C++. Standard operator precedence applies, and parentheses provide the usual override of precedence.
|
Because computers can only store approximations for most DOUBLE
and FLOAT
type values, testing these data types for exact equality or inequality is not recommended.
Instead, one should allow for an acceptable amount of error.
The following example checks if $0
is equal to 5
, with an error of 0.00001 permitted:
WHERE to_float($0) BETWEEN 5-0.00001 AND 5+0.00001
- Relational Operators
-
<
,>
,==
,!=
,⇐
,>=
. Comparisons can be performed between two numeric values or between two string values. - Predicate Operators
-
AND
,OR
,NOT
-
operators are the same as in SQL. They can be used to combine multiple conditions together. E.g., $0 < "abc" AND $1 > "abc" selects the rows with the first token less than "abc" and the second token greater than "abc". E.g., NOT $1 < "abc" selects the rows with the second token greater than or equal to "abc".
IS NUMERIC
-
<token> IS NUMERIC
returns true if<token>
is in numeric format. Numeric format include integers, decimal notation, and exponential notation. Specifically,IS NUMERIC
is true if token matches the following regular expression:(/-) ? [0-9] + (.[0-9]) ? [0-9] * ((e/E)(/-) ? [0-9] +) ?
. Any leading space and trailing space is skipped, but no other spaces are allowed. For example,$0 IS NUMERIC
checks whether the first token is in numeric format. IS EMPTY
-
<token> IS EMPTY
returns true if<token>
is an empty string. For example,$1 IS EMPTY
checks whether the second token is empty. IN
-
<token> IN ( <set_of_values> )
returns true if<token>
is equal to one member of a set of specified values. The values may be string or numeric types. For example,$2 IN ("abc", "def", "lhm")
tests whether the third token equals one of the three strings in the given set. For example,to_int($3) IN (10, 1, 12, 13, 19)
tests whether the fourth token equals one of the specified five numbers. BETWEEN … AND
-
<token> BETWEEN <lower_val> AND <upper_val>
returns true if<token>
is within the specified range, inclusive of the endpoints. The values may be string or numeric types. For example,$4 BETWEEN "abc" AND "def"
checks whether the fifth token is greater than or equal to "abc" and also less than or equal to "def";to_float($5) BETWEEN 1 AND 100.5
checks whether the sixth token is greater than or equal to 1.0 and less than or equal to 100.5.
==== Token functions in the WHERE clause
The GSQL loading language provides several built-in functions for the WHERE
clause.
The token functions in the WHERE clause and those token functions used for attribute expression are different. They cannot be used interchangeably.
Function name | Output type | Description of function |
---|---|---|
|
Converts main_string to an integer value. |
|
|
Converts main_string to a float value. |
|
|
Returns a string which is the concatenation of string1 and string2 . |
|
INT |
Returns the length of main_string. |
|
|
Returns true if main_string is empty after removing white space. Returns false otherwise. |
|
|
Returns true if string1 is exactly the same (case-sensitive) as string2 . Returns false otherwise. |
|
|
Returns true if string1 is exactly the same (case-insensitive) as string2. Returns false otherwise. |
|
|
Returns true if main_string is either "t" or "true" (case-insensitive). Returns false otherwise. |
|
|
Returns true if main_string is either "f" or "false" (case-insensitive). Returns false otherwise. |
=== TAGS
clause (Deprecated)
Tag-based Vertex-Access Access Control is deprecated as of October 2023. TigerGraph will be introducing a more flexible scheme in an upcoming release. |
The TAGS
clause specifies the tags to be applied to the vertices loaded by the LOAD
statement.
TAGS "(" tag_name (, tag_name)* ")" BY [ OR | OVERWRITE ]
If a LOAD
statement has a TAGS
clause, it will tag the vertices with the tags specified in the TAGS
clause.
Before vertices can be loaded and tagged with a LOAD
statement, the vertex type must first be marked as taggable, and the tags must be defined.
Users have two options when it comes to how to merge tags if the target vertices exist in the graph:
-
BY OR
: Add the new tags to the existing set of tags. -
BY OVERWRITE
: Overwrite existing tags with the new tags.
=== USING
clause
A USING
clause contains one or more optional parameter value pairs:
USING parameter=value [,parameter=value]*
If multiple LOAD
statements use the same source (the same file path, the same temporary data table, or the same file variable), the USING
clauses in these LOAD
statements must be the same.
Therefore, we recommend that if multiple destination clauses share the same source, put all of these destination clauses into the same LOAD
statement.
Parameter | Description | Allowed values |
---|---|---|
|
Specifies the special character that separates tokens (columns) in the data file. |
Any single ASCII character. Default: comma
|
|
Specifies the end-of-line character. |
Any ASCII sequence Default: |
|
Specifies explicit boundary markers for string tokens, either single or double quotation marks. The parser will not treat separator characters found within a pair of quotation marks as a separator.
For example, if the parsing conditions are
|
|
|
Whether the data file’s first line is a header line. The header assigns names to the columns. The |
Default is |
|
Specifies the name of the header variable, when a header has been defined in the loading job, rather than in the data file |
The variable name in the preceding |
|
If the filter expression evaluates to true, then do not use this input data line. |
name of filter from a preceding |
|
Whether each line is a json object (see Loading JSON Data below for more details) |
Default is |
|
If true, treat vertices as insert-only. If the input data refers to a vertex which already exists, do not update it. If false, upsert vertices. |
Default is |
|
If true, only insert or update an edge if both endpoint vertices already exist. If false, always insert new edges, creating endpoint vertices as needed, using given id and default values for other parameters. |
Default is "false" |
=== Loading JSON Data
When the USING option JSON_FILE="true"
is used, the loader loads JSON objects instead of tabular data.
A JSON object is an unordered set of key/value pairs, where each value may itself be an array or object, leading to nested structures.A colon separates each key from its value, and a comma separates items in a collection.
The JSON loader requires that each input line has exactly one JSON object (see JSON lines.
The JSON loader uses JSON values as tokens, that is, the second part of each JSON key/value pair.
In a GSQL loading job, a JSON field is identified by a dollar sign $
followed by the colon-separated sequence of nested key names to reach the value from the top level.
For example, given the JSON object {"abc":{"def": "this_value"}}
, the identifier $"abc":"def"
is used to access "this_value"
.The double quotes are mandatory.
An example is shown below:
- Loading job
-
JSON loading using EOL
CREATE VERTEX encoding (PRIMARY_ID id STRING, length FLOAT default 10) CREATE UNDIRECTED EDGE encoding_edge (FROM encoding, TO encoding) CREATE GRAPH encoding_graph (*) CREATE LOADING JOB json_load FOR GRAPH encoding_graph { LOAD "encoding.jsonl" TO VERTEX encoding VALUES ($"encoding", $"indent":"length") USING JSON_FILE="true"; } RUN LOADING JOB json_load
- .json file
-
encoding.jsonl
{"encoding": "UTF-7","plug-ins":["c"],"indent" : { "length" : 30, "use_space": true }} {"encoding":"UTF-1","indent":{"use_space": "dontloadme"}, "plug-ins" : [null, true, false] } {"plug-ins":["C","c++"],"indent":{"length" : 3, "use_space": false},"encoding":"UTF-6"}
In the aforementioned file, the order of fields are not fixed and some fields are missing. The JSON loader ignores the order and accesses the fields by the nested key names. The missing fields are loaded with default values. The result vertices are:
id | attr1 |
---|---|
"UTF-7" |
30 |
"UTF-1" |
0 |
"UTF-6" |
3 |
=== Loading Parquet Data
TigerGraph can load data from Parquet files if they are stored in AWS S3 buckets. For more details on how to set up S3 data sources and loading jobs, read the AWS S3 Loader User Guide. In the background TigerGraph uses the JSON loading functionality to read data from Parquet files, so the JSON specific information in the previous section applies.
In order to load Parquet data, you need to:
-
Specify
"file.reader.type": "parquet"
in the S3 file configuration file or argument -
Specify
JSON_FILE="true"
in the USING clause of the LOAD statements -
Refer to JSON keys (≈ Parquet "column names") instead of column numbers
You will probably want to add USING EOF="true"
to your RUN LOADING JOB
statement to explicitly indicate to the loading job to stop after consuming all data from the Parquet source, not to expect further entries.
An example of a Parquet loading setup is shown below:
CREATE DATA_SOURCE S3 s3ds = "{\"file.reader.settings.fs.s3a.access.key\":\"myaccesskey\",\"file.reader.settings.fs.s3a.secret.key\":\"mysecretkey\"}" FOR GRAPH companyGraph
CREATE LOADING JOB parquet_load FOR GRAPH companyGraph {
DEFINE FILENAME f = "$s3ds:{\"file.uris\": \"s3://mybucket/mydata.parquet\", \"file.reader.type\": \"parquet\"}";
LOAD f
TO VERTEX members VALUES($"members", $"members") USING JSON_FILE="true";
}
RUN LOADING JOB parquet_load USING EOF="true"
== Attributes and attribute expressions
A LOAD
statement processes each line of an input file, splitting each lin into a sequence of tokens.
Each destination clause provides a token-to-attribute mapping which defines how to construct a new vertex, an edge, or a temp table row instance (e.g., one data object).
The tokens can also be thought of as the column values in a table.
There are two ways to refer to a column, by position or by name.
Assuming a column has a name, either method may be used, and both methods may be used within one expression.
By Position: The columns (tokens) are numbered from left to right, starting with $0. The next column is $1, and so on.
By Name: Columns can be named, either through a header line in the input file, or through a DEFINE HEADER
statement.
If a header line is used, then the first line of the input file should be structured like a data line, using the same separator characters, except that each column contains a column name string instead of a data value.Names are enclosed in double quotes, e.g. $"age"
.
Data file name: $sys.file_name refers to the current input data file.
In a simple case, a token value is copied directly to an attribute.For example, in the following LOAD
statement,
LOAD "xx/yy/a.csv" TO VERTEX Person VALUES ($0, $1, $sys.file_name)
-
The
PRIMARY_ID
of a person vertex comes from column$0
of the file "xx/yy/a.csv". -
The next attribute of a person vertex comes from column
$1
. -
The next attribute of a person vertex is given the value "xx/y/a.csv" (the filename itself).
Users do not need to explicitly define a primary ID. Given the attributes, one will be selected as the primary key.
=== Cumulative loading
A basic principle in the GSQL Loader is cumulative loading. Cumulative loading means that a particular data object might be written to (i.e., loaded) multiple times, and the result of the multiple loads may depend on the full sequence of writes. This usually means that If a data line provides a valid data object, and the WHERE
clause and OPTION
clause are satisfied, then the data object is loaded.
-
Valid input: For each input data line, each destination clause constructs one or more new data objects. To be a valid data object, it must have an ID value of the correct type, have correctly typed attribute values, and satisfy the optional
WHERE
clause. If the data object is not valid, the object is rejected (skipped) and counted as an error in the log file. The rules for invalid attributes values are summarized below:-
UINT
: Any non-digit character. (Out-of-range values cause overflow instead of rejection) -
INT
: Any non-digit or non-sign character. (Out-of-range values cause overflow instead of rejection) -
FLOAT
andDOUBLE
: Any wrong format -
STRING
,FIXED_BINARY
: N/A -
DATETIME
: Wrong format, invalid date time, or out of range. -
BOOL
: Any value not listed later. -
Complex type: Depends on the field type or element type. Any invalid field (in
UDT
), element (inLIST
orSET
), key or value (inMAP
) causes rejection.
-
-
New data objects: If a valid data object has a new ID value, then the data object is added to the graph store. Any attributes which are missing are assigned the default value for that data type or for that attribute.
-
Overwriting existing data objects: If a valid data object has an ID value for an existing object, then the new object overwrites the existing data object, with the following clarifications and exceptions:
-
The attribute values of the new object overwrite the attribute values of the existing data object.
-
Missing tokens: If a token is missing from the input line so that the generated attribute is missing, then that attribute retains its previous value.
-
A STRING token is never considered missing; if there are no characters, then the string is the empty string
-
Skipping an attribute: A
LOAD
statement can specify that a particular attribute should not be loaded by using the special character _ (underscore) as its attribute expression (attr_expr). For example,
LOAD TO VERTEX Person VALUES ($0, $1, _, $2)
means to skip the next-to-last attribute. This technique is used when it is known that the input data file does not contain data for every attribute.
-
If the load operation is creating a new vertex or edge, then the skipped attribute will be assigned the default value.
-
If the load operation is overwriting an existing vertex or edge, then the skipped attribute will retain its existing value.
=== More complex attribute expressions
An attribute expression may use column tokens (e.g., $0
), literals (constant numeric or string values), any of the built-in loader token functions, or a user-defined token function. Attribute expressions may not contain mathematical or boolean operators (such as +
, *
, AND
). The rules for attribute expressions are the same as those for id expressions, but an attribute expression can additionally use a reducer function:
-
id_expr := $column_number | $"column_name" | constant | $sys.file_name | token_function_name(id_expr [, id_expr ]*)
-
attr_expr := id_expr | REDUCE(reducer_function_name(id _expr))
Note that token functions can be nested, that is, a token function can be used as an input parameter for another token function. The built-in loader token/reducer functions and user-defined token functions are described in the section "Built-In Loader Token Functions".
The subsections below describe details about loading particular data types.
=== Loading a DOUBLE
or FLOAT
attribute
A floating point value has the basic format
[sign][digits].[digits](e|E)[sign][digits]
In the first case, the decimal point and following digits are required. In the second case, some digits are required (looking like an integer), and the following decimal point and digits are optional.
In both cases, the leading sign ("+" or "-") is optional. The exponent, using "e" or "E", is optional. Commas and extra spaces are not allowed.
# Valid floating point values
-198256.03
+16.
-.00036
7.14285e15
9.99E-22
# Invalid floating point values
-198,256.03
9.99 E-22
=== Loading a DATETIME
attribute
When loading data into a DATETIME
attribute, the GSQL loader will automatically read a string representation of DateTime information and convert it to internal DateTime representation. The loader accepts any of the following string formats:
-
%Y-%m-%d %H:%M:%S
(e.g., 2011-02-03 01:02:03) -
%Y/%m/%d %H:%M:%S
(e.g., 2011/02/03 01:02:03) -
%Y-%m-%dT%H:%M:%S.000z
(e.g., 2011-02-03T01:02:03.123z, 123 will be ignored) -
%Y-%m-%d
(only date, no time, e.g., 2011-02-03) -
%Y/%m/%d
(only date, no time, e.g., 2011/02/03) -
Any integer value (Unix Epoch time, where Jan 1, 1970 at 00:00:00 is integer 0)
Format notation:
%Y is a 4-digit year. A 2-digit year is not a valid value.
%m and %s are a month (1 to 12) and a day (1 to 31), respectively. Leading zeroes are optional.
%H, %M, %S are hours (0 to 23), minutes (0 to 59) and seconds (0 to 59), respectively. Leading zeroes are optional.
When loading data, the loader checks whether the values of year, month, day, hour, minute, second are out of the valid range. If any invalid value is present, e.g. '2010-13-05' or '2004-04-31 00:00:00', the attribute is invalid and the object (vertex or edge) is not created.
=== Loading a BOOL
attribute
When loading data from CSV files the following values are accepted for BOOL attributes :
-
True:
TRUE
,True
,true
,1
-
False:
FALSE
,False
,false
,0
When loading data from JSON documents, the valid BOOL values are true
and false
.
=== Loading a User-defined type (UDT) attribute
To load a UDT attribute, state the name of the UDT type, followed by the list of attribute expressions for the UDT’s fields, in parentheses. See the example below.
TYPEDEF TUPLE <f1 INT (1), f2 UINT, f3 STRING (10), f4 DOUBLE > My_Tuple // define a UDT
CREATE VERTEX Vertex_UDT (id STRING PRIMARY KEY, att_udt My_Tuple)
CREATE GRAPH Test_Graph (Vertex_UDT)
CREATE LOADING JOB load_udt FOR GRAPH Test_Graph {
DEFINE FILENAME f;
LOAD f TO VERTEX Vertex_UDT VALUES ($0, My_Tuple($1, $2, $3, $4));
// $1 is loaded as f1, $2 is loaded as f2, and so on
}
RUN LOADING JOB load_udt USING f="./udt.csv"
=== Loading a LIST or SET attribute
There are three methods to load a LIST
or a SET
.
The first method is to load multiple rows of data that share the same ID values and append the individual attribute values to form a collection of values. The collections are formed incrementally by reading one value from each eligible data line and appending the new value into the collection. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not. If the id value(s) is new, then a new vertex or edge is created with a new list/set containing the single value. If the id(s) has been used before, then the value from the new line is appended to the existing list/set. Below shows an example:
- Schema and loading job
-
Example: Cumulative loading of multiple rows to a SET/LIST
CREATE VERTEX Test_Vertex (PRIMARY_ID id STRING, iset SET<INT>, ilist LIST<INT>) CREATE UNDIRECTED EDGE Test_Edge(FROM Test_vertex, TO Test_vertex) CREATE GRAPH Test_Set_List (*) CREATE LOADING JOB load_set_list FOR GRAPH Test_Set_List { DEFINE FILENAME f; LOAD f TO VERTEX Test_Vertex VALUES ($0, $1, $1); } RUN LOADING JOB load_set_list USING f="./list_set_vertex.csv"
- Data file
-
list_set_vertex.csv
1,10 3,30 1,20 3,30 3,40 1,20
The job load_set_list
will load two test_vertex
vertices because there are two unique id values in the data file.Vertex 1 has attribute values with iset = [10,20]
and ilist = [10,20,20]
.Vertex 3 has values iset = [30,40]
and ilist = [30, 30, 40]
.Note that a set doesn’t contain duplicate values, while a list can contain duplicate values.
Because GSQL loading is multi-threaded, the order of values loaded into a LIST might not match the input order.
If the input file contains multiple columns which should be all added to the LIST or SET, then a second method is available.Use the LIST() or SET() function as in the example below:
CREATE VERTEX V_Set (PRIMARY_ID id STRING, nick_names SET<STRING>)
CREATE VERTEX V_List (PRIMARY_ID id STRING, lucky_nums LIST<INT>)
CREATE GRAPH Test_Graph (*)
CREATE LOADING JOB load_set_list FOR GRAPH Test_Graph {
DEFINE FILENAME f;
LOAD f TO VERTEX v_set VALUES ($0, SET($1,$2,$3));
LOAD f TO VERTEX v_list VALUES ($0, LIST($2,$4));
}
The third method is to use the SPLIT()
function to read a compound token and split it into a collection of elements, to form a LIST
or SET
collection.
The SPLIT()
function takes two arguments: the column index and the element separator.
The element separator should be distinct from the separator throughout the whole file.Below shows an example:
- Schema and loading job
-
Example: SET/LIST loading by SPLIT() example
CREATE VERTEX Test_Vertex (PRIMARY_ID id STRING, ustrset SET<STRING>, ilist LIST<INT>) CREATE UNDIRECTED EDGE Test_Edge(FROM Test_Vertex, TO Test_Vertex) CREATE GRAPH Test_Split (*) CREATE LOADING JOB set_list_job FOR GRAPH Test_Split { DEFINE FILENAME f; LOAD f TO VERTEX Test_Vertex VALUES ($0, SPLIT($1,"|") , SPLIT($2,"#")); } RUN LOADING JOB set_list_job USING f="./split_list_set.csv"
- Data file
-
split_list_set.csv
vid,names,numbers v1,mike|tom|jack, 1 # 2 # 3 v2,john, 5 # 4 # 8
The SPLIT()
function cannot be used for UDT type elements.
=== Loading a MAP attribute
There are three methods to load a MAP
.
The first method is to load multiple rows of data that share the same ID values. The maps are formed incrementally by reading one key-value pair from each eligible data line. When the loading job processes a line, it checks to see whether a vertex or edge with that id value(s) already exists or not:
-
If the ID value(s) is new, then a new vertex or edge is created with a new map containing the single key-value pair.
-
If the ID(s) has been used before, then the loading job checks whether the key exists in the map or not.
-
If the key doesn’t exist in the map, the new key-value pair is inserted.
-
If the key does exist, the newly loaded value replaces the old value.
-
The loading order might not be the same as the order in the raw data. If a data file contains multiple lines with the same ID and same key but different values, loading them together results in a nondeterministic final value for that key.
==== Load rows of key-value pairs by ID
To load rows of key-value pairs by ID, use the arrowhead ->
to indicate the key-value pair relationship between the columns, and enclose them in parentheses.
For example, key-value pairs with tokens in column 2 as keys and tokens in column 3 as values is denoted by the following: ($1 -> $2)
.
CREATE VERTEX V_Map (PRIMARY_ID id STRING, att_map MAP<INT, STRING>)
CREATE GRAPH Test_Graph (*)
CREATE LOADING JOB load_map FOR GRAPH Test_Graph {
DEFINE FILENAME f;
LOAD f TO VERTEX v_map VALUES ($0, ($1 -> $2));
}
==== Load key-value pairs in multiple columns with MAP()
The second method is to use the MAP()
function.
If there are multiple key-value pairs among multiple columns, MAP()
can load them together:
CREATE VERTEX V_Map (PRIMARY_ID id STRING, att_map MAP<INT, STRING>)
CREATE GRAPH Test_Graph (*)
CREATE LOADING JOB load_map FOR GRAPH Test_Graph {
DEFINE FILENAME f;
LOAD f TO VERTEX V_Map VALUES ($0, MAP(($1 -> $2), ($3 -> $4)));
// $1 and $3 are keys and $2 and $4 are the corresponding values.
}
==== Load key-value pairs by splitting values in one column
The third method is to use the SPLIT()
function.
The SPLIT()
function can be used when the key-value pair is in one column and separated by a key-value separator, or multiple key-value pairs are in one column and separated by element separators and key-value separators.
SPLIT()
has three parameters: The first is the column index, the second is the key-value separator, and the third is the element separator.
The third parameter is optional.
If one row of raw data only has one key-value pair, the third parameter can be skipped.
Below are the examples without and with the given element separator.
- Data file 1
-
Example data with one key-value pair per line
vid,key_value v1,1:mike v2,2:tom v1,3:lucy
- Data file 2
-
Example data with multiple key-value pairs per line
vid,key_value_list v1,1:mike#4:lin v2,2:tom v1,3:lucy#1:john#6:jack
- Loading job
-
Loading a MAP by method 3: SPLIT() function
CREATE VERTEX v_map (PRIMARY_ID id STRING, att_map MAP<INT, STRING>) CREATE GRAPH Test_Graph (*) CREATE LOADING JOB load_map FOR GRAPH Test_Graph { DEFINE FILENAME f; LOAD f TO VERTEX v_map VALUES ($0, SPLIT($1, ":", "#")); }
The SPLIT()
function cannot be used for UDT type elements.
=== Loading composite key attributes
Loading a Composite Key for a vertex works no differently than normal loading. Simply load all the attributes as you would for a vertex with a single-attribute primary key. The primary key will automatically be constructed from the appropriate attributes.
When loading to an edge where either TO_VERTEX
or FROM_VERTEX
contains a composite key, the composite set of attributes must be enclosed in parentheses. See the example below.
// Schema setup
CREATE VERTEX Composite_Person (id uint, name string, PRIMARY KEY (name,id))
CREATE VERTEX Composite_Movie (id uint, title string, country string, year uint, primary key (title,year,id))
CREATE DIRECTED EDGE Composite_Roles (from Composite_Person,to Composite_Movie, role string) with reverse_edge="composite_roles_reverse"
CREATE GRAPH My_Graph(*)
// Loading job
CREATE LOADING JOB composite_load FOR GRAPH My_Graph {
LOAD "$sys.data_root/movies.csv" TO VERTEX Composite_Movie VALUES
($"id", $"title", $"country" ,$"year") USING header ="true", separator=",";
LOAD "$sys.data_root/persons.csv" TO VERTEX Composite_Person VALUES
($"id",$"name") USING header = "true", separator =",";
LOAD "$sys.data_root/compositeroles.csv" TO EDGE Composite_Roles VALUES
(($"personName", $"personId"),($"movieTitle",$"movieYear",$"movieId"),$"role")
USING header="true", separator = ",";
}
=== Loading Wildcard Type Edges
If an edge has been defined using a wildcard vertex type, a vertex type name must be specified, following the vertex id, in a LOAD
statement for the edge. An example is shown below:
// Schema setup
CREATE VERTEX User (PRIMARY_ID id UINT)
CREATE VERTEX Product (PRIMARY_ID id UINT)
CREATE VERTEX Picture (PRIMARY_ID id UINT)
CREATE UNDIRECTED EDGE Purchase (FROM *, TO *)
CREATE GRAPH Test_Graph(*)
// Loading job
CREATE LOADING JOB test_2 FOR GRAPH Test_Graph {
DEFINE FILENAME f;
LOAD f
TO EDGE Purchase VALUES ($0 User, $1 Product),
TO EDGE Purchase VALUES ($0 User, $2 Picture);
}
== Token functions
Token functions are functions in the DDL language that operate on tokens.Some may be used to construct attribute expressions and some may be used for conditional expressions in the WHERE
clause.
To use a token function, replace the attribute in the destination clause of the LOAD
statement with the function call.The arguments of the function can be a column from the FILE
object.
==== Example
CREATE LOADING JOB load_orders {
DEFINE FILENAME f;
LOAD f
TO VERTEX Order VALUES (gsql_trim($"id"), $"date");
}
=== Token functions for attribute expressions
The following token functions can be used in an ID or attribute expression
Function | Output type | Description |
---|---|---|
gsql_reverse(in_string) |
string |
Returns a string with the characters in the reverse order of the input string in_string. |
gsql_concat(string1, string2,…,stringN) |
string |
Returns a string which is the concatenation of all the input strings. |
gsql_uuid_v4() |
string |
Returns a version-4 UUID. |
gsql_split_by_space(in_string) |
string |
Returns a modified version of in_string, in which each space character is replaced with ASCII 30 (decimal). |
gsql_substring(str, beginIndex [, length]) |
string |
Returns the substring beginning at beginIndex, having the given length. |
gsql_find(str, substr) |
int |
Returns the start index of the substring within the string. If it is not found, then return -1. |
gsql_length(str) |
int |
Returns the length of the string. |
gsql_replace(str, oldToken, newToken [, max]) |
string |
Returns the string resulting from replacing all matchings of oldToken with newToken in the original string. If a max count is provided, there can only be up to that many replacements. |
gsql_regex_replace(str, regex, replaceSubstr) |
string |
Returns the string resulting from replacing all substrings in the input string that match the given regex token with the substitute string. |
gsql_regex_match(str, regex) |
bool |
Returns true if the given string token matches the given regex token and false otherwise. |
gsql_to_bool(in_string) |
bool |
Returns true if the in_string is either "t" or "true", with case-insensitive checking. Returns false otherwise. |
gsql_to_uint(in_string) |
uint |
|
gsql_to_int(in_string) |
int |
|
gsql_ts_to_epoch_seconds(timestamp) |
uint |
Converts a timestamp in canonical string format to Unix epoch time, which is the int number of seconds since Jan. 1, 1970. Refer to the timestamp input format note below. |
gsql_current_time_epoch(0) |
uint |
Returns the current time in Unix epoch seconds. *By convention, the input parameter should be 0, but it is ignored. |
flatten (column_to_be_split, group_separator, 1) flatten (column_to_be_split, group_separator, sub_field_separator, number_of_sub_fields_in_one_group) |
||
flatten_json_array ($"array_name") flatten_json_array ($"array_name", $"sub_obj_1", $"sub_obj_2", …, $"sub_obj_n") |
||
split(column_to_be_split, element_separator) split(column_to_be_split, key_value_separator, element _separator) |
||
gsql_upper(in_string) |
string |
Returns the input string in upper-case. |
gsql_lower(in_string) |
string |
Returns the input string in lower-case. |
gsql_trim(in_string) |
string |
Trims whitespace from the beginning and end of the input string. |
gsql_ltrim(in_string) gsql_rtrim(in_string) |
string |
Trims white space from either the beginning or the end of the input string (Left or right). |
gsql_year(timestamp) |
int |
Returns 4-digit year from timestamp. |
gsql_month(timestamp) |
int |
Returns month (1-12) from timestamp. |
gsql_day(timestamp) |
int |
Returns day (1-31) from timestamp. |
gsql_year_epoch(epoch) |
int |
Returns 4-digit year from Unix epoch time, which is the int number of seconds since Jan. 1, 1970. |
gsql_month_epoch(epoch) |
int |
Returns month (1-12) from Unix epoch time, which is the int number of seconds since Jan. 1, 1970. |
gsql_day_epoch(epoch) |
int |
Returns day (1-31) from Unix epoch time, which is the int number of seconds since Jan. 1, 1970. |
==== Timestamp input format
The timestamp parameter should be in one of the following formats:
-
"%Y-%m-%d %H:%M:%S"
-
"%Y/%m/%d %H:%M:%S"
-
"%Y-%m-%dT%H:%M:%S.000z"
-
Text after the dot
.
is ignored
-
=== Token functions in the WHERE
clause
=== User-defined token functions
Users can write their own token functions in C++ and install them in the GSQL system. To learn how to add a user-defined token function, see Add a User-defined Token Function.
=== Reducer functions
A reducer function aggregates multiple values of a non-ID attribute into one attribute value of a single vertex or edge. Reducer functions are computed incrementally; that is, each time a new input token is applied, a new resulting value is computed.
To reduce and load aggregate data to an attribute, the attribute expression has the following form:
REDUCE(reducer_function (input_expr))
reducer_function
is one of the functions in the following table.
input_expr
can include non-reducer functions, but reducer functions cannot be nested.
Each reducer function is overloaded so that one function can be used for several data types:
-
For primitive data types, the output type is the same as the type of the input expression
input_expr
. -
For
LIST
,SET
, andMAP
containers, the input_expr type is one of the allowed element types for these containers. The output is the entire container.
Function name | Return value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
Any: If an attribute value already exists, return(retain) the existing value.Otherwise, return(load) |
Each function supports a certain set of attribute types.
Calling a reducer function with an incompatible type crashes the service.
In order to prevent that, use the WHERE
clause together with IS NUMERIC
or other operators, functions, predicates for type checking if necessary.
== TEMP_TABLE
and flatten functions
The keyword TEMP_TABLE
triggers the use of a temporary data table which is used to store data generated by one LOAD
statement, for use by a later LOAD
statement.Earlier we introduced the syntax for loading data to a TEMP_TABLE
:
TO TEMP_TABLE table_name (id_name [, attr_name]*) VALUES (id_expr [, attr_expr]*)
[WHERE conditions] [OPTION (options)]
This clause is designed to be used in conjunction with the flatten
or flatten_json_array
function in one of the attr_expr
expressions.The flatten function splits a multi-value field into a set of records.Those records can first be stored in a temporary table, and then the temporary table can be loaded into vertices and/or edges.Only one flatten function is allowed in one temp table destination clause.
There are two versions of the flatten function: One parses single-level groups and the other parses two-level groups.There are also two versions of the flatten_json_array function: One splits an array of primitive values, and the other splits an array of JSON objects.
=== One-level flatten function
flatten (column_to_be_split, separator, 1)
is used to parse a one-level group into individual elements.
An example is shown below:
The following loading job contains two LOAD
statements. The first one loads input data to Book
vertices and to a TEMP_TABLE
. The second one loads the TEMP_TABLE
data to Genre
vertices and Book_Genre
edges.
- Schema and loading jobs
-
One-level Flatten Function loading (load_book_flatten1.gsql)
CREATE LOADING JOB load_books_flatten_1 FOR GRAPH Book_Rating { DEFINE FILENAME f; LOAD f TO VERTEX Book VALUES ($0, $1, _), TO TEMP_TABLE t1 (bookcode,genre) VALUES ($0, flatten ($2,",",1)) (1) USING QUOTE="double", SEPARATOR="|"; LOAD TEMP_TABLE t1 TO VERTEX Genre VALUES ($"genre", $"genre"), TO EDGE Book_Genre VALUES ($"bookcode", $"genre"); } RUN LOADING JOB load_books_flatten_1 USING f="../data/book1.dat"
- Data file
-
book1.dat
101|"Harry Potter and the Philosopher's Stone"|"fiction,fantasy,young adult" 102|"The Three-Body Problem"|"fiction,science fiction,Chinese"
Line 5 says that the third column ($2
) of each input line should be split into separate tokens, with comma (,
) as the separator.
Each token will have its own row in table t1
.
The first column is labeled bookcode
with value $0
and the second column is genre
with one of the $2
tokens.
The contents of TEMP_TABLE t1
are shown below:
bookcode | genre |
---|---|
101 |
fiction |
101 |
fantasy |
101 |
young_adult |
102 |
fiction |
102 |
science_fiction |
102 |
Chinese |
Then, lines 8 to 10 say to read TEMP_TABLE t1
and to do the following for each row:
-
Create a
Genre
vertex for each new value ofgenre
. -
Create a
Book_Genre
edge frombookcode
togenre
. In this case, each row ofTEMP_TABLE t1
generates oneBook_Genre
edge.
The final graph will contain two Book
vertices (101 and 102), five Genre vertices, and six Book_Genre
edges.
{
"results": [{"@@edgeSet": [
{
"from_type": "Book",
"to_type": "Genre",
"directed": false,
"from_id": "101",
"to_id": "fiction",
"attributes": {},
"e_type": "Book_Genre"
},
{
"from_type": "Book",
"to_type": "Genre",
"directed": false,
"from_id": "101",
"to_id": "fantasy",
"attributes": {},
"e_type": "Book_Genre"
},
{
"from_type": "Book",
"to_type": "Genre",
"directed": false,
"from_id": "102",
"to_id": "sciencevfiction",
"attributes": {},
"e_type": "Book_Genre"
},
{
"from_type": "Book",
"to_type": "Genre",
"directed": false,
"from_id": "101",
"to_id": "young adult",
"attributes": {},
"e_type": "Book_Genre"
},
{
"from_type": "Book",
"to_type": "Genre",
"directed": false,
"from_id": "102",
"to_id": "fiction",
"attributes": {},
"e_type": "Book_Genre"
},
{
"from_type": "Book",
"to_type": "Genre",
"directed": false,
"from_id": "102",
"to_id": "Chinese",
"attributes": {},
"e_type": "Book_Genre"
}
]}]
}
=== Two-level flatten function
flatten (column_to_be_split, group_separator, sub_field_separator, number_of_sub_fields_in_one_group) is used for parse a two-level group into individual elements. Each token in the main group may itself be a group, so there are two separators: one for the top level and one for the second level. An example is shown below.
101|"Harry Potter and the Philosopher's Stone"|"FIC:fiction,FTS:fantasy,YA:young adult"
102|"The Three-Body Problem"|"FIC:fiction,SF:science fiction,CHN:Chinese"
The flatten function now has four parameters instead of three. The additional parameter is used to record the genre_name in the Genre vertices.
CREATE LOADING JOB load_books_flatten_2 FOR GRAPH Book_Rating {
DEFINE FILENAME f;
LOAD f
TO VERTEX Book VALUES ($0, $1, _),
TO TEMP_TABLE t2(bookcode,genre_id,genre_name) VALUES ($0, flatten($2,",",":",2))
USING QUOTE="double", SEPARATOR="|";
LOAD TEMP_TABLE t2
TO VERTEX Genre VALUES($"genre_id", $"genre_name"),
TO EDGE Book_Genre VALUES($"bookcode", $"genre_id");
}
RUN LOADING JOB load_books_flatten_2 USING f="book2.dat"
In this example, in the genres column ($2), there are multiple groups, and each group has two sub-fields, genre_id and genre_name. After running the loading job, the file book2.dat will be loaded into the TEMP_TABLE t2 as shown below.
bookcode | genre_id | |
---|---|---|
101 |
FIC |
fiction |
101 |
FTS |
fantasy |
101 |
YA |
young adult |
102 |
FIC |
fiction |
102 |
SF |
science fiction |
102 |
CHN |
Chinese |
=== Flatten a JSON array of primitive values
When loading JSON lines, each line is a JSON object.
If a value field of the JSON object is a JSON array of primitive values , you can use flatten_json_array()
to flatten the array:
- Data file
-
For example, if we have the following JSON line file. The
plug-in
field is a JSON array of primitive values:encoding2.json{"plug-ins" : ["C", "c++"],"encoding" : "UTF-6","indent" : { "length" : 3, "use_space": false}}
- Loading job
-
The function flattens the array, and loads each item in the array into a temporary table. Each value in the array creates a record.
CREATE VERTEX Encoding (PRIMARY_ID id STRING, length FLOAT DEFAULT 10) CREATE UNDIRECTED EDGE Encoding_Edge (FROM encoding, TO encoding) CREATE GRAPH Encoding_Graph (*) CREATE LOADING JOB json_flatten FOR GRAPH Encoding_Graph { LOAD "encoding2.json" TO TEMP_TABLE t2 (name, length) VALUES (flatten_json_array($"plug-ins"), $"indent":"length") USING JSON_FILE ="true"; LOAD TEMP_TABLE t2 TO VERTEX encoding VALUES ($"name", $"length"); } RUN LOADING JOB json_flatten
- Temporary table
-
The data and loading job creates the following temporary table:
id length C
3
c++
3
=== Flatten a JSON array of JSON objects
When loading JSON lines, each line is a JSON object.
If a value field of the JSON object is a JSON array of JSON objects , you can use flatten_json_array()
to flatten the array:
When splitting a JSON array of JSON objects, the primitive values in the array are skipped and only JSON objects are processed.
- Data file
-
For example, if we have the following JSON line file. There is a
plug-in
field in each JSON object, and it is an array.encoding3.json{"encoding":"UTF-1","indent":{"use_space": "dontloadme"}, "plug-ins" : [null, true, false, {"lang":"golang","prop":{"age":"noidea"}}]} {"encoding": "UTF-8", "plug-ins" : [{"lang": "pascal", "score":"1.0", "prop":{"age":"old"}}, {"lang":"c++", "score":2.0}],"indent":{"length" :12,"use_space": true}} {"encoding": "UTF-7", "plug-ins" : [{"lang":"java", "score":2.22}, {"lang":"python", "score":3.0},{"lang":"go", "score":4.0, "prop":{"age":"new"}}],"indent" : { "length" : 30, "use_space": true }} {"plug-ins" : ["C", "c++"],"encoding" : "UTF-6","indent" : { "length" : 3, "use_space": false}}
- Loading job
-
The
flatten_json_array()
function in the destination clause returns three values to the three columns in the temporary table. Thelang
field corresponds to thename
column; thescore
field corresponds to thescore
column; theage
field of theprop
field corresponds to theprop_age
column. These columns are subsequently loaded into to the vertices.The fourth line in the original file does not generate any value, because the
plug-in
field does not contain any JSON object.json_flatten_array_test.gsqlCREATE VERTEX Encoding3 (PRIMARY_ID id STRING, score FLOAT default -1.0, age STRING default "Unknown", length INT default -1) CREATE UNDIRECTED EDGE Encoding3_Edge (FROM Encoding3, TO Encoding3) CREATE GRAPH Encoding_Graph (*) CREATE LOADING JOB json_flatten_array FOR GRAPH Encoding_Graph { LOAD "encoding3.json" TO TEMP_TABLE t3 (name, score, prop_age, indent_length) VALUES (flatten_json_array($"plug-ins", $"lang", $"score", $"prop":"age"), $"indent":"length") USING JSON_FILE="true"; LOAD TEMP_TABLE t3 TO VERTEX encoding3 VALUES ($"name", $"score", $"prop_age", $"indent_length"); } RUN LOADING JOB json_flatten_array
=== Flatten a JSON object in a CSV file The function can also be used to flatten a JSON object in a CSV file.
- Data file
-
If we have the following CSV file. The second column in the CSV file is a JSON object we want to flatten.
encoding.csvgolang|{"prop":{"age":"noidea"}} pascal|{"score":"1.0", "prop":{"age":"old"}} c++|{"score":2.0, "indent":{"length":12, "use_space": true}} java|{"score":2.22, "prop":{"age":"new"}, "indent":{"use_space":"true", "length":2}} python|{ "prop":{"compiled":"false"}, "indent":{"length":4}, "score":3.0} go|{"score":4.0, "prop":{"age":"new"}}
- Loading job
-
flatten_json_array()
can be used in this case without theUSING JSON_FILE="true"
clause:json_flatten_cvs.gsqlCREATE VERTEX Encoding3 (PRIMARY_ID id STRING, score FLOAT default -1.0, age STRING default "Unknown", length INT default -1) CREATE UNDIRECTED EDGE Encoding3_Edge (FROM Encoding3, TO Encoding3) CREATE GRAPH Encoding_Graph (*) CREATE LOADING JOB json_flatten_cvs FOR GRAPH Encoding_Graph { LOAD "encoding.csv" TO TEMP_TABLE t4 (name, score, prop_age, indent_length) VALUES ($0,flatten_json_array($1, $"score", $"prop":"age", $"indent":"length")) USING SEPARATOR="|"; LOAD TEMP_TABLE t4 TO VERTEX Encoding3 VALUES ($"name", $"score", $"prop_age", $"indent_length"); } RUN LOADING JOB json_flatten_cvs
- Temporary table
-
The example generates the temporary table shown below:
id score age length golang
-1 (default)
noidea
-1 (default)
pascal
1
old
-1 (default)
c++
2
unknown (default)
12
java
2.22
new
2
python
3
unknown (default)
4
go
4
new
-1 (default)
== DELETE statement
In addition to loading data, a LOADING JOB can be used to perform the opposite operation: deleting vertices and edges, using the DELETE statement. DELETE cannot be used in offline loading. Just as a LOAD statement uses the tokens from each input line to set the id and attribute values of a vertex or edge to be created, a DELETE statement uses the tokens from each input line to specify the id value of the item(s) to be deleted.
In the v2.0 syntax, there is now a " FROM (filepath_string | filevar)
" clause just before the WHERE clause.
There are four variations of the DELETE statement.The syntax of the four cases is shown below.
CREATE LOADING JOB abc FOR GRAPH Graph_Name {
DEFINE FILENAME f;
// 1. Delete each vertex which has the given vertex type and primary id.
DELETE VERTEX vertex_type_name (PRIMARY_ID id_expr) FROM f [WHERE condition] ;
// 2. Delete each edge which has the given edge type, source vertex id, and destination vertex id.
DELETE EDGE edge_type_name (FROM id_expr, TO id_expr) FROM f [WHERE condition] ;
// 3. Delete all edges which have the given edge type and source vertex id. (Destination vertex id is left open.)
DELETE EDGE edge_type_name (FROM id_expr) FROM f [WHERE condition] ;
// 4. Delete all edges which have the given source vertex id. (Edge type and destination vertex id are left open.)
DELETE EDGE * (FROM id_expr vertex_type_name) FROM f [WHERE condition] ;
}
An example using Book_Rating data is shown below:
// Delete all user occupation edges if the user is in the new files, then load the new files
CREATE LOADING JOB clean_user_occupation FOR GRAPH Book_Rating {
DEFINE FILENAME f;
DELETE EDGE User_Occupation (FROM $0) FROM f;
}
CREATE LOADING JOB load_user_occupation FOR GRAPH Book_Rating {
DEFINE FILENAME f;
LOAD f TO EDGE User_Occupation VALUES ($0,$1);
}
RUN LOADING JOB clean_user_occupation USING f="./data/User_Occupation_update.dat"
RUN LOADING JOB load_user_occupation USING f="./data/User_Occupation_update.dat"
There is a separate DELETE statement in the GSQL Query Language. The query delete statement can leverage the query language’s ability to explore the graph and to use complex conditions to determine which items to delete. In contrast, the loading job delete statement requires that the id values of the items to be deleted must be specified in advance in an input file.