Load from Local Files

Table of Contents

Example Schema
Create a loading job
Run the loading job
Manage and monitor your loading job
Manage loading job concurrency
Files Loader Auto-Restart
Known Issues with Loading

After you have defined a graph schema, you can create a loading job, specify your data sources, and run the job to load data.

The steps for loading from local files, cloud storage, or any other supported sources are similar. We will call out whether a particular step is common for all loading or specific to a data source or loading mode.

Example Schema

Example schema taken from LDBC_SNB

//Vertex Types:
CREATE VERTEX Person(PRIMARY_ID id UINT, firstName STRING, lastName STRING,
  gender STRING, birthday DATETIME, creationDate DATETIME, locationIP STRING,
  browserUsed STRING, speaks SET<STRING>, email SET<STRING>)
  WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
CREATE VERTEX Comment(PRIMARY_ID id UINT, creationDate DATETIME,
  locationIP STRING, browserUsed STRING, content STRING, length UINT)
  WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
//Edge Types:
CREATE DIRECTED EDGE HAS_CREATOR(FROM Comment, TO Person)
  WITH REVERSE_EDGE="HAS_CREATOR_REVERSE"

If we were loading from a remote data source, the next step would be to create a DATA_SOURCE object. A DATA_SOURCE provides a standard interface for all supported data source types, so that loading jobs can be written without regard for the data source. This is not necessary for local files.

Create a loading job

A loading job tells the database how to construct vertices and edges from data sources. The loading job body has two parts:

DEFINE statements create variables to refer to data sources. These can refer to actual files or be placeholder names. The actual data sources can be given when running the loading job.
LOAD statements specify how to take the data fields from files to construct vertices or edges.

Example loading job for local files

The following is an example loading job for local files.

Example loading job for local files

USE GRAPH ldbc_snb
CREATE LOADING JOB load_data FOR GRAPH ldbc_snb {
  DEFINE FILENAME file_Person = "/data/person.csv";
  DEFINE FILENAME file_Comment = "m3:/data/comment.csv";
  DEFINE FILENAME file_Comment_hasCreator_Person=
    "ALL:/data/hasCreator.json";
  LOAD file_Person TO VERTEX Person
    VALUES($1, $2, $3, $4, $5, $0, $6, $7,
      SPLIT($8, ";"), SPLIT($9, ";"))
    USING SEPARATOR="|", HEADER="true", EOL="\n";
  LOAD file_Comment TO VERTEX Comment
    VALUES($1, $0, $2, $3, $4, $5)
    USING SEPARATOR="|", HEADER="true", EOL="\n";
  LOAD file_Comment_hasCreator_Person TO EDGE HAS_CREATOR
    VALUES($1 Comment, $2 Person)
    USING JSON_FILE="true";
}

Define filenames

First we define filenames, which are local variables referring to data files (or data objects).

The terms FILENAME and filevar are used for legacy reasons, but a filevar can also be an object in a data object store.

DEFINE FILENAME syntax

DEFINE FILENAME filevar ["=" file_descriptor ];

The file descriptor can be specified at compile-time or at runtime. Runtime settings override compile-time settings:

Specifying file descriptor at runtime

RUN LOADING JOB job_name USING filevar=file_descriptor_override

Local file descriptors

For local file loading, the file_desciptor is a file path or folder path string, enclosed in quotation marks. Here are examples of the syntax for various cases:

An absolute or relative path for either a file or a folder on the machine where the job is run:
```
"/data/graph.csv"
```
An absolute or relative path for either a file or a folder on all machines in the cluster:
```
"ALL:/data/graph.csv"
```
An absolute or relative path for either a file or a folder on any machine in the cluster:
```
"ANY:/data/graph.csv"
```

A list of machine-specific paths:

"m1:/data1.csv, m2|m3|m5:/data/data2.csv"

Specify the data mapping

Next, we use LOAD statements to describe how the incoming data will be loaded to attributes of vertices and edges. Each LOAD statement handles the data mapping, and optional data transformation and filtering, from one filename to one or more vertex and edge types.

LOAD statement syntax

LOAD [ source_object|filevar|TEMP_TABLE table_name ]
  destination_clause [, destination_clause ]*
  [ TAGS clause ] (1)
  [ USING clause ];

1	As of v3.9.3, TAGS are deprecated.

Let’s break down one of the LOAD statements in our example:

Example loading job for local files

LOAD file_Person TO VERTEX Person
    VALUES($1, $2, $3, $4, $5, $0, $6, $7,
       SPLIT($8, ";"), SPLIT($9, ";"))
    USING SEPARATOR="|", HEADER="true", EOL="\n";

$0, $1,… refer to the first, second, … columns in each line a data file.
SEPARATOR="|" says the column separator character is the pipe (|). The default is comma (,).
HEADER="true" says that the first line in the source contains column header names instead of data. These names can be used instead of the columnn numbers.
SPLIT is one of GSQL’s ETL functions. It says that there is a multi-valued column, which has a separator character to mark the subfields in that column.

Refer to Creating a Loading Job in the GSQL Language Reference for descriptions of all the options for loading jobs.

Run the loading job

Use the command RUN LOADING JOB to run the loading job.

RUN LOADING JOB basic syntax (some options omitted)

RUN LOADING JOB [-noprint] job_name [
  USING filevar [="file_descriptor"][, filevar [="file_descriptor"]]*
  [,EOF="eof_mode"]
]

-noprint

By default, the loading job will run in the foreground and print the loading status and statistics after you submit the job. If the -noprint option is specified, the job will run in the background after displaying the job ID and the location of the log file.

filevar list

The optional USING clause may contain a list of file variables. Each file variable may optionally be assigned a file_descriptor, obeying the same format as in CREATE LOADING JOB. This list of file variables determines which parts of a loading job are run and what data files are used.

When a loading job is compiled, it generates one RESTPP endpoint for each filevar and source_object. As a consequence, a loading job can be run in parts. When RUN LOADING JOB is executed, only those endpoints whose filevar or file identifier (GSQL_FILENAME_n) is mentioned in the USING clause will be used. However, if the USING clause is omitted, then the entire loading job will be run.

If a file_descriptor is given, it overrides the file_descriptor defined in the loading job. If a particular filevar is not assigned a file_descriptor either in the loading job or in the RUN LOADING JOB statement, an error is reported and the job exits.

Streaming mode is not available for local file loading, so the EOF parameter will be ignored.

Manage and monitor your loading job

When a loading job starts, the GSQL server assigns it a job ID and displays it for the user to see. There are four key commands to monitor and manage loading jobs:

SHOW LOADING STATUS job_id|ALL
ABORT LOADING JOB job_id|ALL
RESUME LOADING JOB job_id
SHOW LOADING ERROR job_id

SHOW LOADING STATUS shows the current status of either a specified loading job or all current jobs, this command should be within the scope of a graph:

GSQL > USE GRAPH graph_name
GSQL > SHOW LOADING STATUS ALL

For each loading job, the above command reports the following information:

Loading status
Loaded lines/Loaded objects/Error lines
Average loading speed
Size of loaded data
Duration

When inspecting all current jobs with SHOW LOADING STATUS ALL, the jobs in the FINISHED state will be omitted as they are considered to have successfully finished. You can use SHOW LOADING STATUS job_id to check the historical information of finished jobs. If the report for this job contains error data, you can use SHOW LOADING ERROR job_id to see the original data that caused the error.

See Managing and Inspecting a Loading Job for more details.

Manage loading job concurrency

See Loading Job Concurrency for how to manage the concurrency of loading jobs.

Files Loader Auto-Restart

See High Availability (HA) Overview.

Known Issues with Loading

TigerGraph does not store NULL values. Therefore, your input data should not contain any NULLs.