Running a Loading Job

Clearing and Initializing the Graph Store

There are two aspects to clearing the system: flushing the data and clearing the schema definitions in the catalog. Two different commands are available.

CLEAR GRAPH STORE

The CLEAR GRAPH STORE command flushes all the data out of the graph store (database). By default, the system asks the user to confirm that you really want to discard all the graph data. To force the clear operation and bypass the confirmation question, use the -HARD option,

CLEAR GRAPH STORE -HARD

Clearing the graph store does not affect the schema.

  • Use the -HARD option with extreme caution. There is no undo option. -HARD must be in all capital letters.

  • CLEAR GRAPH STORE stops all the TigerGraph servers (GPE, GSE, RESTPP, Kafka, and Zookeeper).

  • Loading jobs and queries are aborted.

DROP ALL clears both the data and the schema.

Running a Loading Job

Running a loading job executes a previously installed loading job. The job reads lines from an input source, parses each line into data tokens, and applies loading rules and conditions to create new vertex and edge instances to store in the graph data store.

The input sources could be defined in the loading job or could be provided when running the job. Additionally, loading jobs can also be run by submitting an HTTP request to the REST++ server.

User privileges for running loading jobs are treated as separate from privileges regarding reading and writing data to vertices and edges. A user can create and run loading jobs even without the privileges to modify vertex and edge data. For more information, see Access Control Model in TigerGraph.

RUN LOADING JOB

RUN LOADING JOB syntax for concurrent loading
RUN LOADING JOB [-noprint] [-dryrun] [-n [i],j] [-max_percent_error mbe ]
job_name [
    USING file_var [="file_path_string"]
        [, file_var [="file_path_string"]]*
    [, CONCURRENCY=cnum][,BATCH_SIZE=bnum][,EOF="eof_mode"]
]

When a concurrent loading job is submitted, it is assigned a job ID number, which is displayed on the GSQL console. The user can use this job ID to refer to the job, for a status update, to abort the job, or to restart the job. These operations are described later in this section.

Options

-noprint

By default, the command will print several lines of status information while the loading is running.
If the -noprint option is included, the output will omit the progress and summary details, but it will still display the job id and the location of the log file.

Example of minimal output when -noprint option is used
Kick off the following job:
  JobName: load_videoE, jobid: gsql_demo_m1.1525091090494
  Loading log: '/usr/local/tigergraph/logs/restpp/restpp_loader_logs/gsql_demo/gsql_demo_m1.1525091090494.log'

-dryrun

If -dryrun is used, the system will read the data files and process the data as instructed by the job, but will NOT load any data into the graph. This option can be a useful diagnostic tool.

-n [i], j

The -n option limits the loading job to processing only a range of lines of each input data file. The -n flag accepts one or two arguments.

For example, -n 50 means read lines 1 to 50. -n 10, 50 means read lines 10 to 50. The special symbol $ is interpreted as "last line", so -n 10,$ means reads from line 10 to the end.

-max_percent_error mpe

The -max_percent_error option instructs the loader process to abort the loading job, if the percentage (range between 0 and 100) of data lines in the input batch that are rejected, exceeds mpe. By default, loading will continue until complete, which is equivalent to mbe being set to 100.

Otherwise, for example, if the input source has 1000 data lines and mpe = 25 (or 25%), then the loading job will abort if it rejects 251 lines.

If mpe = 0, then the load job aborts if any input line are rejected.

There are two possible cases where an input line may be rejected:

  • Input format error: the input line cannot be parsed properly.

  • Filtering error: the input line can be parsed properly, but it does not pass certain filtering rules defined in the loading job itself.

The percentage number following -max_percent_error is required if the option is provided. Otherwise, a parsing error will appear.

Parameters

Below are the parameters available for the RUN QUERY command introduced by the USING clause.

filevar list

The optional USING clause may contain a list of file variables. Each file variable may optionally be assigned a filepath_string, obeying the same format as in the CREATE LOADING JOB. This list of file variables determines which parts of a loading job are run and what data files are used.

  • When a loading job is compiled, it generates one RESTPP endpoint for each filevar and filepath_string. As a consequence, a loading job can be run in parts. When RUN LOADING JOB is executed, only those endpoints whose filevar or file identifier (__GSQL_FILENAME_n__) is mentioned in the USING clause will be used. However, if the USING clause is omitted, then the entire loading job will be run.

  • If a filepath_string is given, it overrides the filepath_string defined in the loading job. If a particular filevar is not assigned a filepath_string either in the loading job or in the RUN LOADING JOB statement, then an error is reported and the job exits.

CONCURRENCY

The CONCURRENCY parameter sets the maximum number of concurrent requests that the loading job may send to the GPE. The default is 256.

BATCH_SIZE

The BATCH_SIZE parameter sets the number of data lines included in each concurrent request sent to the GPE. The default is 8192.

EOF

The EOF option applies only for Kafka-based or connector-based loading. Direct loading of local files is always in EOF mode.

This is a boolean parameter The loader has two modes: streaming mode ("False") and EOF mode ("True"). The default mode is ("True"): EOF mode.

  • In EOF mode, loading will stop after consuming the provided file objects.

  • In streaming mode, loading will never stop until the job is aborted.

Running Loading Jobs as REST Requests

Another way to run a loading job is through the POST /ddl/{graph_name} endpoint of the REST++ server. Since the REST++ server has more direct access to the graph processing engine, this can execute more quickly than a RUN LOADING JOB statement in GSQL. For details on how to use the endpoint, please see Run a loading job.

Troubleshooting Loading Job Delays

In v3.9.3, the current timestamp and the last updated timestamp was introduced to help keep track of the most recent progress of a loading job.

If you notice that a loading job has not been updated for an extended period of time, please follow the steps below to troubleshoot and resolve the issue.

  1. Verify GPE/GSE Status

    Confirm the status of the GPE/GSE of a loading job that is not being updated. Check the online status of a GPE/GSE by running gadmin status.

    Example
    $ gadmin status
    +--------------------+-------------------------+-------------------------+
    |    Service Name    |     Service Status      |      Process State      |
    +--------------------+-------------------------+-------------------------+
    |        GPE         |         Online          |         Running         |
    |        GSE         |         Online          |         Running         |
    +--------------------+-------------------------+-------------------------+

    Verify that both GPE and GSE are marked as "Online."

    If they are not "Online", please restart the services by running gadmin restart all -y.

  2. Re-run the Loading Job

    Once you have ensured that GPE and GSE are online and functioning. Resolve the loading job delay by re-running the job.

If you encounter any further issues or require additional assistance, please consult our support team for expert guidance.