Load from Cloud Storage

Table of Contents

Example Schema
Create Data Source Object
Create a loading job
Run the loading job
- Continuous Loading from Cloud Storage
Manage and monitor your loading job
Manage loading job concurrency
Known Issues with Loading

After you have defined a graph schema, you can create a loading job, specify your data sources, and run the job to load data.

The steps are similar whether you are loading from local files, from cloud storage, or any of the other supported sources. We will call out whether a particular step is common for all loading or specific to a data source or loading mode.

Example Schema

This example uses part of the LDBC_SNB schema:

Example schema taken from LDBC_SNB

//Vertex Types:
CREATE VERTEX Person(PRIMARY_ID id UINT, firstName STRING, lastName STRING,
  gender STRING, birthday DATETIME, creationDate DATETIME, locationIP STRING,
  browserUsed STRING, speaks SET<STRING>, email SET<STRING>)
  WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
CREATE VERTEX Comment(PRIMARY_ID id UINT, creationDate DATETIME,
  locationIP STRING, browserUsed STRING, content STRING, length UINT)
  WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
//Edge Types:
CREATE DIRECTED EDGE HAS_CREATOR(FROM Comment, TO Person)
  WITH REVERSE_EDGE="HAS_CREATOR_REVERSE"

Create Data Source Object

A data source code provides a standard interface for all supported data source types, so that loading jobs can be written without regard for the data source.

When you create the object, you specify its details (type, access credentials, etc.) in the form of a JSON object. The JSON object can either be read in from a file or provided inline. Inline mode is required when creating data sources for TigerGraph Cloud instances.

In the following example, we create a data source named s1, and read its configuration information from a file called ds_config.json.

USE GRAPH ldbc_snb
CREATE DATA_SOURCE s1 = "ds_config.json" FOR GRAPH ldbc_snb

Older versions of TigerGraph required a keyword after DATA_SOURCE such as STREAM or KAFKA.

Inline JSON data format when creating a data source

CREATE DATA_SOURCE s1 = "{
type: <type>,
key: <value>
}" FOR GRAPH ldbc_snb

String literals can be enclosed with a double quote ", triple double quotes """, or triple single quotes '''. Double quotes " in the JSON can be omitted if the key name does not contain a colon : or comma ,.

Alternate quote syntax for inline JSON data

CREATE DATA_SOURCE s1 = """{
"type": "<type>",
"key": "<value>"
}""" FOR GRAPH ldbc_snb

Key names accept a separator of either a period . or underscore _, so for example, key_name and key.name are both valid key names.

Three cloud source types are supported:

Amazon S3
Azure Blob Storage
Google Cloud Storage

Also, three data object formats are supported: CSV, JSON, and Parquet.

AWS S3

AWS uses the standard IAM credential provider and uses your access key for authentication.

Access keys can be used for an individual user or for an IAM role. See Using IAM Roles for Amazon EC2 for more information.

{
"type": "s3",
"access.key": "<access key>",
"secret.key": "<secret key>"
}

Azure Blob Storage

We support two types of authentication:

Shared key authentication:

Get the account key on the Access Keys tab of your storage account. TigerGraph can automatically extract the account name from the file URI, so there’s no need to provide the account name.

{
"type" : "abs",
"account.key" : "<account key>"
}

Service principal authentication*:

To use service principal authentication, you must first register your TigerGraph instance as an application and grant it access to your storage account.

{
"type" : "abs",
"client.id" : "<client id>",
"client.secret" : "<client secrect>",
"tenant.id" : "<tenant id>"
}

Google Cloud Storage

For GCS, the TigerGraph data source configuration object is based on the GCS service account key format.

{
"type": "gcs",
"project_id": "<project id>",
"private_key_id": "<private key id>",
"private_key": "<private key>",
"client_email": "<email address>"
}

You can follow Google Cloud’s instructions for creating a service account key, and then replace the "type" value with "gcs".

Create a loading job

A loading job tells the database how to construct vertices and edges from data sources. The loading job body has two parts:

DEFINE statements create variables to refer to data sources. These can refer to actual files or be placeholder names. The actual data sources can be given when running the loading job.
LOAD statements specify how to take the data fields from files to construct vertices or edges.

Example loading job from cloud storage

The following example uses AWS S3 as the source.

Example loading job from cloud storage

USE GRAPH ldbc_snb
CREATE LOADING JOB load_data FOR GRAPH ldbc_snb {
  DEFINE FILENAME file_Comment =
    "$s1:s3://s3-loading-test/tg_ldbc_snb/sf0.1_csv/dynamic/Comment";
  DEFINE FILENAME file_Person =
    "$s1:s3://s3-loading-test/tg_ldbc_snb/sf0.1_csv/dynamic/Person";
  DEFINE FILENAME file_Comment_hasCreator_Person =
    "$s1:s3://s3-loading-test/tg_ldbc_snb/sf0.1_csv/dynamic/Comment_hasCreator_Person";
  LOAD file_Comment
    TO VERTEX Comment
      VALUES ($1, $0, $2, $3, $4, $5) USING header="true", separator="|";
  LOAD file_Person
    TO VERTEX Person
      VALUES ($1, $2, $3, $4, $5, $0, $6, $7, SPLIT($8,";"), SPLIT($9,";"))
      USING header="true", separator="|";
  LOAD file_Comment_hasCreator_Person
    TO EDGE HAS_CREATOR
      VALUES ($1, $2) USING header="true", separator="|";
}

Define filenames

First we define filenames, which are local variables referring to data files (or data objects).

The terms FILENAME and filevar are used for legacy reasons, but a filevar can also be an object in a data object store.

DEFINE FILENAME syntax

DEFINE FILENAME filevar ["=" file_descriptor ];

The file descriptor can be specified at compile time or at runtime. Runtime settings override compile-time settings:

Specifying file descriptor at runtime

RUN LOADING JOB job_name USING filevar=file_descriptor_override

While a loading job may have multiple FILENAME variables , they must all refer to the same DATA_SOURCE object.

Cloud file descriptors

For cloud sources, the file descriptor has three valid formats. You can simply provide the file URI. Or, you can provide optional configuration details, either in a JSON file or as inline JSON content.

DEFINE FILENAME file_name = "$[data source name]:[URI]";
DEFINE FILENAME file_name = "$[data source name]:[json config file]";
DEFINE FILENAME file_name = "$[data source name]:[inline json content]";

See the following examples.

// Format 1: URI only
DEFINE FILENAME uri_s3 = "$s_s3:s3://s3-loading-test/tg_ldbc_snb/sf0.1_csv/dynamic/Comment";
DEFINE FILENAME uri_gcs = "$s_gcs:gs://tg_ldbc_snb/sf0.1_csv/dynamic/Person";
DEFINE FILENAME uri_abs = "$s_abs:abfss://person@yandblobstorage.dfs.core.windows.net/persondata.csv";

// Format 2: URI and configuration file
DEFINE FILENAME uri_s3 = "$s1:myfile.json";

// Format 3: URI and inline JSON
DEFINE FILENAME parquet_s3 = """$s1:{
  "file.uris":"s3://s3-loading-test/tg_ldbc_snb/sf0.1_parquet/dynamic/Comment",
  "file.type":"parquet"}""";
DEFINE FILENAME csv_gcs = """$s1:{
  "file.uris": "gs://tg_ldbc_snb/sf0.1_csv/dynamic/Person",
  "file.type": "text",
  "num.partitions": 6}""";

Filename parameters

These are the required and optional configuration parameters:

Parameter Description Required? Default value

Parameter	Description	Required?	Default value
file.uris	The URI or URIs separated by commas.	Required	N/A
file.type	`text` for CSV and JSON files; `parquet` for Parquet files.	Optional	If the file extension is `parquet`, then file.type default is Parquet; otherwise the default is `text`.
num.partitions	The number of partitions to use. When loading data, each partition is distributed evenly across each node. If one filename contains much more data than others, consider using a larger partition number.	Optional	3
batch.size	The number of CSV lines or JSON objects that will be processed per batch.	Optional	10000
recursive	If the URI refers to a directory, whether to search subdirectories recursively for files to load.	Optional	true
regexp	A regular expression to filter filenames to be loaded. Uses Java regular expression patterns.	Optional	.*, which permits all filenames.
default	The default value for any field left empty.	Optional	"", an empty string.
archive.type	The file type for archive files. Accepted values: `auto` (where it uses the file extension as the file type), `tar`, `zip`, `gzip`, `tar.gz` and `none` (loading from an uncompressed file).	Optional	auto
tasks.max	The number of threads used to download data.	Optional	1

file.uris

The URI or URIs separated by commas.

Required

N/A

file.type

text for CSV and JSON files; parquet for Parquet files.

Optional

If the file extension is parquet, then file.type default is Parquet; otherwise the default is text.

num.partitions

The number of partitions to use. When loading data, each partition is distributed evenly across each node. If one filename contains much more data than others, consider using a larger partition number.

Optional

batch.size

The number of CSV lines or JSON objects that will be processed per batch.

Optional

10000

recursive

If the URI refers to a directory, whether to search subdirectories recursively for files to load.

Optional

true

regexp

A regular expression to filter filenames to be loaded. Uses Java regular expression patterns.

Optional

.*, which permits all filenames.

default

The default value for any field left empty.

Optional

"", an empty string.

archive.type

The file type for archive files. Accepted values: auto (where it uses the file extension as the file type), tar, zip, gzip, tar.gz and none (loading from an uncompressed file).

Optional

auto

tasks.max

The number of threads used to download data.

Optional

Specify the data mapping

Next, we use LOAD statements to describe how the incoming data will be loaded to attributes of vertices and edges. Each LOAD statement handles the data mapping, and optional data transformation and filtering, from one filename to one or more vertex and edge types.

LOAD statement syntax

LOAD [ source_object|filevar|TEMP_TABLE table_name ]
  destination_clause [, destination_clause ]*
  [ TAGS clause ] (1)
  [ USING clause ];

1	As of v3.9.3, TAGS are deprecated.

Let’s break down one of the LOAD statements in our example:

Example loading job for local files

LOAD file_Person TO VERTEX Person
    VALUES($1, $2, $3, $4, $5, $0, $6, $7,
       SPLIT($8, ";"), SPLIT($9, ";"))
    USING SEPARATOR="|", HEADER="true", EOL="\n";

$0, $1,… refer to the first, second, … columns in each line a data file.
SEPARATOR="|" says the column separator character is the pipe (|). The default is comma (,).
HEADER="true" says that the first line in the source contains column header names instead of data. These names can be used instead of the columnn numbers.
SPLIT is one of GSQL’s ETL functions. It says that there is a multi-valued column, which has a separator character to mark the subfields in that column.

Refer to Creating a Loading Job in the GSQL Language Reference for descriptions of all the options for loading jobs.

When loading JSON or Parquet data, please make sure:

The USING option JSON_FILE="true" is used
Refer to JSON keys (≈ Parquet "column names") instead of column indices.

E.g., LOAD file_Comment TO VERTEX Comment VALUES ($"id", $"content") USING JSON_FILE="TRUE"

Run the loading job

Use the command RUN LOADING JOB to run the loading job.

RUN LOADING JOB basic syntax (some options omitted)

RUN LOADING JOB [-noprint] job_name [
  USING filevar [="file_descriptor"][, filevar [="file_descriptor"]]*
  [,EOF="eof_mode"]
]

-noprint

By default, the loading job will run in the foreground and print the loading status and statistics after you submit the job. If the -noprint option is specified, the job will run in the background after displaying the job ID and the location of the log file.

filevar list

The optional USING clause may contain a list of file variables. Each file variable may optionally be assigned a file_descriptor, obeying the same format as in CREATE LOADING JOB. This list of file variables determines which parts of a loading job are run and what data files are used.

When a loading job is compiled, it generates one RESTPP endpoint for each filevar and source_object. As a consequence, a loading job can be run in parts. When RUN LOADING JOB is executed, only those endpoints whose filevar or file identifier (GSQL_FILENAME_n) is mentioned in the USING clause will be used. However, if the USING clause is omitted, then the entire loading job will be run.

If a file_descriptor is given, it overrides the file_descriptor defined in the loading job. If a particular filevar is not assigned a file_descriptor either in the loading job or in the RUN LOADING JOB statement, an error is reported and the job exits.

Continuous Loading from Cloud Storage

EOF (End-of-file) is a boolean parameter. The loader has two modes: streaming mode ("False") and EOF mode ("True").

Prior to version 3.9.2, the default value of EOF was "False". Beginning with 3.9.2, the default value is "True".

If EOF = "True" (EOF mode), loading will stop after consuming the provided file objects, i.e, when it reaches the EOF character for each file object.
If EOF = "False" (streaming mode), the loading job remains active and keeps for new data until the job is aborted. The loader can detect both new lines in existing files and new files added to the designated source folder.

Streaming mode checks for new content based on increased line number. Only new lines in existing files and new files will be loaded. If any existing lines are changed or deleted, these changes will not be part of the loading job.

For example, consider a file data.txt in cloud storage

Initial state of data.txt

line-1

1) We load data.txt using streaming mode.

RUN LOADING JOB stream_csv USING EOF="false"

The line of data is loaded successfully into the loading job for ingestion to TigerGraph.

2) If a user edits the file and adds a new line, the stream loader notices the addition and loads the new line, starting from where it previously left off.

data.txt after a line is added

line-1
line-2

In this case, the new line line-2 is successfully ingested to TigerGraph,for a total of two lines.

3) If a user edits the file and inserts a line before the end, as shown below, the entire file is loaded again.

data.txt after a new line is added before the end

line-1
added-line
line-2

Because two lines had already been loaded, the first two lines are skipped, even though the second contains new data. The third line from the file is then loaded, resulting in a repeat of line-2.

Data in TigerGraph

line-1
line-2
line-2

To insure data is loaded correctly, only use stream mode when there is no chance of data being altered or added to the middle of a file.

Manage and monitor your loading job

When a loading job starts, the GSQL server assigns it a job ID and displays it for the user to see. There are three key commands to monitor and manage loading jobs:

SHOW LOADING STATUS job_id|ALL
ABORT LOADING JOB job_id|ALL
RESUME LOADING JOB job_id

SHOW LOADING STATUS shows the current status of either a specified loading job or all current jobs, this command should be within the scope of a graph:

GSQL > USE GRAPH graph_name
GSQL > SHOW LOADING STATUS ALL

For each loading job, the above command reports the following information:

Loading status
Loaded lines/Loaded objects/Error lines
Average loading speed
Size of loaded data
Duration

When inspecting all current jobs with SHOW LOADING STATUS ALL, the jobs in the FINISHED state will be omitted as they are considered to have successfully finished. You can use SHOW LOADING STATUS job_id to check the historical information of finished jobs.

See Managing and Inspecting a Loading Job for more details.

Manage loading job concurrency

See Loading Job Concurrency for how to manage the concurrency of loading jobs.

Known Issues with Loading

TigerGraph does not store NULL values. Therefore, your input data should not contain any NULLs.

[3.9.2+] If the connector is manually deleted before reaching EOF, the corresponding loading job will never stop. Please use ABORT LOADING JOB to terminate the loading pipeline instead of directly manipulating the connector.