Load from a Data Warehouse

After you have defined a graph schema, you can create a loading job, specify your data sources, and run the job to load data.

The steps for loading from local files, cloud storage, or any other supported sources are similar. We will call out whether a particular step is common for all loading or specific to a data source or loading mode.

Example Schema

Example schema taken from LDBC_SNB

//Vertex Types:
CREATE VERTEX Person(PRIMARY_ID id UINT, firstName STRING, lastName STRING,
  gender STRING, birthday DATETIME, creationDate DATETIME, locationIP STRING,
  browserUsed STRING, speaks SET<STRING>, email SET<STRING>)
  WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
CREATE VERTEX Comment(PRIMARY_ID id UINT, creationDate DATETIME,
  locationIP STRING, browserUsed STRING, content STRING, length UINT)
  WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
//Edge Types:
CREATE DIRECTED EDGE HAS_CREATOR(FROM Comment, TO Person)
  WITH REVERSE_EDGE="HAS_CREATOR_REVERSE"

Create Data Source Object

A data source object provides a standard interface for all supported data source types, so that loading jobs can be written without regard for the data source.

When you create the object, you specify its details (type, access credentials, etc.) in the form of a JSON object. The JSON object can either be read in from a file or provided inline.

Inline mode is required when creating data sources for TigerGraph Cloud instances.

In the following example, we create a data source named s1, and read its configuration information from a file called ds_config.json.

USE GRAPH ldbc_snb
CREATE DATA_SOURCE s1 = "ds_config.json" FOR GRAPH ldbc_snb

Older versions of TigerGraph required a keyword after DATA_SOURCE such as STREAM or KAFKA.

Inline JSON data format when creating a data source

CREATE DATA_SOURCE s1 = "{
  type: <type>,
  key: <value>
}" FOR GRAPH ldbc_snb

String Literals

String literals can be represented according to the following options:

Enclosed with double quote ".
Enclosed with triple double quotes """.
Enclosed with triple single quotes '''.

In the case of JSON that does not contain a colon : or a comma , the double quotes " can be omitted.

Alternate quote syntax for inline JSON data

CREATE DATA_SOURCE s1 = """{
  "type": "<type>",
  "key": "<value>"
}""" FOR GRAPH ldbc_snb

Either a period . or _ can be used for separation in the specified key name. Example: first.second or first_second.

We currently support BigQuery, Snowflake, and PostgreSql data warehouses. More data warehouses will be supported in future releases.

BigQuery

TigerGraph’s BigQuery loader makes use of the BigQuery JDBC connector provided by Google, in collaboration with Simba. Use the following configuration for the DATA_SOURCE.

Data source configuration for BigQuery

{
  "type":"bigquery",
  "ProjectId":"tigergraph-dev",
  "OAuthType":2,
  "parameters" : {
    "OAuthRefreshToken":"<refresh token>",
    "OAuthClientId":"<client ID>.apps.googleusercontent.com",
    "OAuthClientSecret":"<client secret>"
    #other Simba JDBC parameters
  }
}

In addition, for large query results, we highly recommended specifying the following parameters:

Addition parameter setting for large BigQuery results

"EnableHighThroughputAPI":"1" -> Storage Read API
"AllowLargeResults":"1" -> Large Query Result Support
"LargeResultDataset":"<target_dataset>" -> Storage for Temp Result
"LargeResultsDatasetExpirationTime":"<time_ms>" -> Expiration of Temp Result

For more information about Simba/Google BigQuery JDBC connection parameters, please refer the BigQuery JDBC Installation and Configuration Guide.

Snowflake

Data source configuration for Snowflake:

{
    "type":"snowflake",
    "connection.url":"jdbc:snowflake://https://<account_id>.snowflakecomputing.com/?db=<db>&schema=<schema>&role=<role>",
    "connection.user": "<username>",
    "connection.password": "<password>"
}

Alternately, key-pair authentication can be used. See Snowflake’s document for more details on their support for key-pair authentication and rotation.

Data source configuration with key-pair authentication:

{
    "type":"snowflake",
    "connection.url":"jdbc:snowflake://https://<account_id>.snowflakecomputing.com/?db=<db>&schema=<schema>&role=<role>&private_key_file=<key_file>",
    "connection.user": "<userwithrsa>",
    "connection.password": "<anystring>"
}

We recommend using the latest version of OpenSSL (3.0.11 19 Sep 2023) to generate keys.
When a private key is used, connection.password is still required, but it can be set to any non-empty string.
Passing a private key directly or using a private key file with a password is NOT supported.

For Snowflake loading jobs only the default setting for batch.max.rows is set to 100000 for better performance.

Use the required fields below for a Snowflake data DATA_SOURCE:

Table 1. Required fields for Snowflake `DATA_SOURCE`
Field	Example	Notes
connection.url	`"jdbc:snowflake://https:// <account_id>.snowflakecomputing.com/?db=<db>&schema=<schema>&role=<role>",`	It should start with `jdbc:snowflake:`
connection.user	"tigergraph"
connection.password	"password"	This will be masked and shown as `**`
type	"snowflake"	This must be `snowflake`.

PostgreSql

Data source configuration for PostgreSql:

{
    "type":"postgresql",
    "host":"pg_address",
    "port": 5432,
    "connection.user":"postgres",
    "connection.password":"postgres",
    "db.name":"postgres"
}

These are some required and optional configuration parameters for connection and authentication with a PostgreSql data DATA_SOURCE.

Table 2. Fields for PostgreSql `DATA_SOURCE`
Field	Example	Notes
host	"postgresql_server_address"	The PostgreSql server’s address. This is required.
port	"5432"	The port used by the PostgreSql server. The default is set to `5432`.
connection.user	"DBuser"	The user of the PostgreSql server. This is required
connection.password	"MyPassword"	The password of the user. This is required
db.name	"postgres"	The database name. This is required

Create a loading job

A loading job tells the database how to construct vertices and edges from data sources. The loading job body has two parts:

DEFINE statements create variables to refer to data sources. These can refer to actual files or be placeholder names. The actual data sources can be given when running the loading job.
LOAD statements specify how to take the data fields from files to construct vertices or edges.

Example loading job from a data warehouse

If single quote characters ( ' ) need to be included in the query in bash, use ( "'" ) to escape.

If double quotes are needed, use the JSON format for a query instead.

Big Query

The following is an example loading job from Google BigQuery.

Example loading job for BigQuery

CREATE DATA_SOURCE s1 = """{
    "type":"bigquery",
    "ProjectId":"tigergraph-dev",
    "OAuthType":2,
    "parameters" : {
        "OAuthRefreshToken":"<refresh token>",
        "OAuthClientId":"<client ID>.apps.googleusercontent.com",
        "OAuthClientSecret":"<client secret>"
    }
}""" FOR GRAPH ldbc_snb

USE GRAPH ldbc_snb
CREATE LOADING JOB load_data FOR GRAPH ldbc_snb {
DEFINE FILENAME file_Comment =
  "$s1:SELECT * FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Comment";
DEFINE FILENAME file_Person =
  "$s1:SELECT id, firstName, lastName, gender, birthday, creationDate, locationIP, browserUsed, language, email FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Person";
DEFINE FILENAME file_Comment_hasCreator_Person =
  "$s1:SELECT * FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Comment_hasCreator_Person";
LOAD file_Comment
  TO VERTEX Comment
    VALUES ($1, $0, $2, $3, $4, $5)
    USING separator="|";
LOAD file_Person
  TO VERTEX Person
    VALUES ($1, $2, $3, $4, $5, $0, $6, $7, SPLIT($8,";"), SPLIT($9,";"))
    USING separator="|";
LOAD file_Comment_hasCreator_Person
  TO EDGE HAS_CREATOR
    VALUES ($1, $2) USING separator="|";
}

Snowflake

The following is an example loading job from Snowflake.

Example loading job for Snowflake

CREATE DATA_SOURCE s1= """{
    "type":"snowflake",
    "connection.url": "jdbc:snowflake:/https:/<account_id>.snowflakecomputing.com/?db=<db>&schema=<schema>&role=<role>",
    "connection.user": "<username>",
    "connection.password": "<password>"
}""" FOR GRAPH ldbc_snb

USE GRAPH ldbc_snb
CREATE LOADING JOB load_data FOR GRAPH ldbc_snb {
DEFINE FILENAME file_Comment =
  "$s1:SELECT * FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Comment";
DEFINE FILENAME file_Person =
  "$s1:SELECT id, firstName, lastName, gender, birthday, creationDate, locationIP, browserUsed, language, email FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Person";
DEFINE FILENAME file_Comment_hasCreator_Person =
  "$s1:SELECT * FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Comment_hasCreator_Person";
LOAD file_Comment
  TO VERTEX Comment
    VALUES ($1, $0, $2, $3, $4, $5)
    USING separator="|";
LOAD file_Person
  TO VERTEX Person
    VALUES ($1, $2, $3, $4, $5, $0, $6, $7, SPLIT($8,";"), SPLIT($9,";"))
    USING separator="|";
LOAD file_Comment_hasCreator_Person
  TO EDGE HAS_CREATOR
    VALUES ($1, $2) USING separator="|";
}

PostgreSql

The following is an example loading job from PostgreSql.

Example loading job for PostgreSql

CREATE DATA_SOURCE s1 = """{
    "type":"postgresql",
    "host":"pg_address",
    "port":5432,
    "connection.user":"postgres",
    "connection.password":"postgres",
    "db.name":"postgres"
}""" FOR GRAPH ldbc_snb

CREATE LOADING JOB load_data FOR GRAPH ldbc_snb {
DEFINE FILENAME file_Comment =
  "$s1:SELECT * FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Comment";
DEFINE FILENAME file_Person =
  "$s1:SELECT id, firstName, lastName, gender, birthday, creationDate, locationIP, browserUsed, language, email FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Person";
DEFINE FILENAME file_Comment_hasCreator_Person =
  "$s1:SELECT * FROM tigergraph-ldbc-benchmark.snb_bi_sf01.Comment_hasCreator_Person";
LOAD file_Comment
  TO VERTEX Comment
    VALUES ($1, $0, $2, $3, $4, $5)
    USING separator="|";
LOAD file_Person
  TO VERTEX Person
    VALUES ($1, $2, $3, $4, $5, $0, $6, $7, SPLIT($8,";"), SPLIT($9,";"))
    USING separator="|";
LOAD file_Comment_hasCreator_Person
  TO EDGE HAS_CREATOR
    VALUES ($1, $2) USING separator="|";
}

Define filenames

First we define filenames, which are local variables referring to data files (or data objects).

The terms FILENAME and filevar are used for legacy reasons, but a filevar can also be an object in a data object store.

DEFINE FILENAME syntax

DEFINE FILENAME filevar ["=" file_descriptor ];

The file descriptor can be specified at compile-time or at runtime. Runtime settings override compile-time settings:

Specifying file descriptor at runtime

RUN LOADING JOB job_name USING filevar=file_descriptor_override

While a loading job may have multiple FILENAME variables , they must all refer to the same DATA_SOURCE object.

Data warehouse file descriptors

For data warehouses, you run a SQL query to get the data. The file descriptor has three valid formats. You can simply provide the SQL query statement. Or, you can provide optional configuration details, either in a JSON file or as inline JSON content.

DEFINE FILENAME file_name = "$[data source name]:[SQL]";
DEFINE FILENAME file_name = "$[data source name]:[json config file]";
DEFINE FILENAME file_name = "$[data source name]:[inline json content]";

For example:

// Format 1: SQL query statement
DEFINE FILENAME query_person = "$s1:SELECT id,name,gender FROM ldbc.person";

// Format 2: Configuration file
DEFINE FILENAME bq_inline_json = """$s1:myfile.json""";

// Format 3: Inline JSON
DEFINE FILENAME query_person="""$s1:{
  "query": "SELECT id,name,gender
            FROM ldbc.person where age < 10;
            SELECT id,name,gender
            FROM ldbc.person where age > 50",
  "num.partitions": 6,
  "tasks.max": 2
}""";

Filename parameters

These are the required and optional configuration parameters:

Parameter Description Required? Default value

Parameter	Description	Required?	Default value
query	One or more SQL queries separated by commas. To avoid timeout in a large query, you can break it into multiple smaller queries based on the partitioning key. These queries may be assigned to multiple tasks for execution, while the execution order is not guaranteed.	Yes	N/A
batch.max.rows	Maximum number of rows to include in a single batch when polling for new data from the query result.	No	1000 For Snowflake loading jobs only: The default setting for `batch.max.rows` is set to `100000` for better performance.
tasks.max	The maximum number of tasks used to execute queries. When `query` contains multiple queries, you can increase this parameter to execute queries in parallel.	No	1
poll.interval.ms	Time interval in ms for periodic executoion of the query.	No	5000
loading.timeout.ms	The loading timeout per batch in milliseconds. If set to 0, then the global parameter `RESTPP.Factory.DefaultLoadingTimeoutSec` is used.	No	0
loading.max.retry.attempts	The maximum number of retries after a retryable server error. 0 means an unlimited number of retries.	No	0
io.max.retry.attempts	The maximum number of HTTP transport retries. 0 means an unlimited number of retries.	No	0
batch.size.bytes	The maximum data batch size in bytes, overriding the `batch.size.rows` limit. Example: If each row of data is 1000b, `batch.size.rows` = 10, and `batch.size.bytes` = 5000, then the data will be split into 2 5000b batches.	No	2,097,152 (2 MiB)
num.partitions	[Legacy: only used when gadmin parameter `KafkaConnect.EnableV2` is false] The number of partitions to use. When loading data, each partition is distributed evenly across each node. If one filename contains much more data than others, consider increasing this value.	No	3

query

One or more SQL queries separated by commas. To avoid timeout in a large query, you can break it into multiple smaller queries based on the partitioning key. These queries may be assigned to multiple tasks for execution, while the execution order is not guaranteed.

Yes

N/A

batch.max.rows

Maximum number of rows to include in a single batch when polling for new data from the query result.

1000

For Snowflake loading jobs only: The default setting for batch.max.rows is set to 100000 for better performance.

tasks.max

The maximum number of tasks used to execute queries. When query contains multiple queries, you can increase this parameter to execute queries in parallel.

poll.interval.ms

Time interval in ms for periodic executoion of the query.

5000

loading.timeout.ms

The loading timeout per batch in milliseconds.

If set to 0, then the global parameter RESTPP.Factory.DefaultLoadingTimeoutSec is used.

loading.max.retry.attempts

The maximum number of retries after a retryable server error.

0 means an unlimited number of retries.

io.max.retry.attempts

The maximum number of HTTP transport retries.

0 means an unlimited number of retries.

batch.size.bytes

The maximum data batch size in bytes, overriding the batch.size.rows limit.

Example: If each row of data is 1000b, batch.size.rows = 10, and batch.size.bytes = 5000, then the data will be split into 2 5000b batches.

2,097,152 (2 MiB)

num.partitions

[Legacy: only used when gadmin parameter KafkaConnect.EnableV2 is false] The number of partitions to use. When loading data, each partition is distributed evenly across each node. If one filename contains much more data than others, consider increasing this value.

Specify the data mapping

Next, we use LOAD statements to describe how the incoming data will be loaded to attributes of vertices and edges. Each LOAD statement handles the data mapping, and optional data transformation and filtering, from one filename to one or more vertex and edge types.

LOAD statement syntax

LOAD [ source_object|filevar|TEMP_TABLE table_name ]
  destination_clause [, destination_clause ]*
  [ TAGS clause ] (1)
  [ USING clause ];

1	As of v3.9.3, TAGS are deprecated.

Let’s break down one of the LOAD statements in our example:

Example loading job for local files

LOAD file_Person TO VERTEX Person
    VALUES($1, $2, $3, $4, $5, $0, $6, $7,
       SPLIT($8, ";"), SPLIT($9, ";"))
    USING SEPARATOR="|", HEADER="true", EOL="\n";

$0, $1,… refer to the first, second, … columns in each line a data file.
SEPARATOR="|" says the column separator character is the pipe (|). The default is comma (,).
HEADER="true" says that the first line in the source contains column header names instead of data. These names can be used instead of the columnn numbers.
SPLIT is one of GSQL’s ETL functions. It says that there is a multi-valued column, which has a separator character to mark the subfields in that column.

Refer to Creating a Loading Job in the GSQL Language Reference for descriptions of all the options for loading jobs.

Data Mapping from BigQuery reuslts

The columns of SQL results are joined by a specified separator to form delimited content.

LOAD bq_sql TO VERTEX Comment VALUES ($1, $0, $2, $3, $4, $5) USING separator="|";

In order to load nested or repeated records from BigQuery, some conversion functions need to be applied to the SQL statement.

Querying STRUCT Data

Method 1:
1. Apply the BigQuery TO_JSON_STRING function to the columns of the STRUCT, e.g.,
  SELECT TO_JSON_STRING(col) FROM table
2. Flatten the JSON object to CSV format.
Method 2:
- Retrieve the fields from the STRUCT directly, e.g.,
  SELECT col.field1, col.field2, col.field3 FROM table

Querying Arrays

Apply function ARRAY_TO_STRING to the columns of ARRAY type, e.g.,
```
SELECT ARRAY_TO_STRING(col_arr,separator) FROM table
```
In the LOAD statement, use the GSQL SPLIT function.

Run the loading job

Use the command RUN LOADING JOB to run the loading job.

RUN LOADING JOB basic syntax (some options omitted)

RUN LOADING JOB [-noprint] job_name [
  USING filevar [="file_descriptor"][, filevar [="file_descriptor"]]*
  [,EOF="eof_mode"]
]

-noprint

By default, the loading job will run in the foreground and print the loading status and statistics after you submit the job. If the -noprint option is specified, the job will run in the background after displaying the job ID and the location of the log file.

filevar list

The optional USING clause may contain a list of file variables. Each file variable may optionally be assigned a file_descriptor, obeying the same format as in CREATE LOADING JOB. This list of file variables determines which parts of a loading job are run and what data files are used.

When a loading job is compiled, it generates one RESTPP endpoint for each filevar and source_object. As a consequence, a loading job can be run in parts. When RUN LOADING JOB is executed, only those endpoints whose filevar or file identifier (GSQL_FILENAME_n) is mentioned in the USING clause will be used. However, if the USING clause is omitted, then the entire loading job will be run.

If a file_descriptor is given, it overrides the file_descriptor defined in the loading job. If a particular filevar is not assigned a file_descriptor either in the loading job or in the RUN LOADING JOB statement, an error is reported and the job exits.

Continuous Loading from Data Warehouses

If EOF="true" (the default), then the query is executed once, and its output will be loaded.

If EOF="false", the query will be executed periodically every poll.interval.ms and its output loaded. This will continuous indefinitely until the job is aborted.

Prior to version 3.9.2, the default value of EOF was "False". Beginning with 3.9.2, the default value is "True".

Manage and monitor your loading job

When a loading job starts, the GSQL server assigns it a job ID and displays it for the user to see. There are four key commands to monitor and manage loading jobs:

SHOW LOADING STATUS job_id|ALL
ABORT LOADING JOB job_id|ALL
RESUME LOADING JOB job_id
SHOW LOADING ERROR job_id

SHOW LOADING STATUS shows the current status of either a specified loading job or all current jobs, this command should be within the scope of a graph:

GSQL > USE GRAPH graph_name
GSQL > SHOW LOADING STATUS ALL

For each loading job, the above command reports the following information:

Loading status
Loaded lines/Loaded objects/Error lines
Average loading speed
Size of loaded data
Duration

When inspecting all current jobs with SHOW LOADING STATUS ALL, the jobs in the FINISHED state will be omitted as they are considered to have successfully finished. You can use SHOW LOADING STATUS job_id to check the historical information of finished jobs. If the report for this job contains error data, you can use SHOW LOADING ERROR job_id to see the original data that caused the error.

See Managing and Inspecting a Loading Job for more details.

Manage loading job concurrency

See Loading Job Concurrency for how to manage the concurrency of loading jobs.

Known Issues with Loading

TigerGraph does not store NULL values. Therefore, your input data should not contain any NULLs.