File Input Policy
GSQL file input policy allows users to apply restrictions on the location of local files used to load data to TigerGraph.
Allowlist and Blocklist
The policy consists of an allowlist and a blocklist:
-
Only paths that belongs to one of the
allowlists
will be allowed for input. -
The blocklist has higher priority over the
allowlist
.-
Ex. If a path belongs to both the
allowlist
andblocklist
simultaneously, the path will be blocked.
-
-
Users can use
gadmin config
to set theblocklist
andallowlist
for the file input policy.
gadmin config set GSQL.FileInputPolicy '["/__TG_DATA_ROOT__/gui/loading_data", "!/etc", "!/usr/lib"]' gadmin config apply -y gadmin restart gsql restpp -y
Similar to the File Output Policy the value for the config is a list of strings that start with either !/
for the blocklist
or /
for the allowlist
.
Format
The format of an entry in the blocklist
or allowlist
follow the rules below:
-
Paths can use data root:
__TG_DATA_ROOT__
Which will be translated to the actual data root when it’s used.
This is the same as gadmin config get System.DataRoot
. -
Paths must be an absolute path.
-
Paths must start with either
/
forallowlist
, or!/
forblocklist
. -
Dots are not allowed in the path (EX.
/a/../b
,/a/./b
). -
Paths should not end with
/
(EX./tmp/ → /tmp
).-
The only exception is for the root path.
-
-
Some commands will implicitly convert the input relative path to an absolute path with the GSQL client working directory (where the GSQL client was invoked)
-
Examples include:
CREATE LOADING JOB
,RUN LOADING JOB
,CREATE CONNECTOR
.
-
-
-
No hidden path/files can be added to the
allowlist
(EX./home/.ssh
).-
Hidden paths/files are allowed fto be added to the
blocklist
(EX.!/home/.ssh
).
-
Default value for File Input Policy
The default value for the GSQL.FileInputPolicy is the same as File Output Policy:
/
Unmodifiable Implicit Blocklist
3.10 Input policy may cause backward compatibility issues for customer upgrading from an older version. See Backward Compatibility for more details. |
This is applicable for both file INPUT
and OUTPUT
policy.
The unmodifiable implicit blocklist includes:
-
Any paths with hidden directory or file, (EX.
/home/tg/.ssh/id_rsa
,/tmp/.mydata
) -
Any TigerGraph sensitive paths, including:
[ APP_ROOT, DATA_ROOT/configs, DATA_ROOT/kafka, DATA_ROOT/zk, DATA_ROOT/etcd, DATA_ROOT/gstore, DATA_ROOT/ssh ]
APP_ROOT
is thePARENT
directory from the pathgadmin config get System.AppRoot
. AndDATA_ROOT
is same fromgadmin config get System.DataRoot
. -
Linux’s system sensitive directories, including:
[ /bin, /boot, /dev, /etc, /lib, /lib32, /libx32, /lib64, /opt, /proc, /run, /sbin, /srv, /sys, /usr, /var ]
Affected Operations
During an upgrade, TigerGraph checks all existing queries and loading jobs and shows a WARNING
message, if there exists blocked constant paths in queries or loading jobs.
If there are any WARNING
messages users are also prompted with a Y/N
during upgrade to let the user decide to continue or abort upgrade.
For a non-interactive upgrade the default is |
Loading Job
For a loading job, the input policy only takes into account the loading jobs on local files that are RUN
and/or CREATE
.
“CREATE LOADING JOB” and “RUN LOADING JOB” can use relative path, and GSQL server will use the working directory (where GSQL client was invoked) as base path to resolve absolute path. |
RUN LOADING JOB
When starting to run a file loading job check the given path and apply the file input policy to it.
Below are two example cases:
-
Path for local data file:
CREATE LOADING JOB load_ip FOR GRAPH test_graph { DEFINE FILENAME f1 = "any:/tmp/data1.csv";; LOAD f1 TO VERTEX account VALUES ($0, $0), TO VERTEX actorIP VALUES ($1, $1), TO EDGE event_property VALUES ($0 account, $1 actorIP) ; } // Notice the path can be either absolute or relative RUN LOADING JOB load_ip USING f1="m1:./resources/data_set/gsql/k_step_neighber.csv"
-
Path for config file:
CREATE DATA_SOURCE KAFKA ka = "/tmp/kafka_broker.json" FOR GRAPH g CREATE LOADING JOB load_kafka { // This path should also be considered DEFINE FILENAME f1 = "$ka:/tmp/kafka_topic.json"; LOAD f1 TO VERTEX v1 VALUES($0, $1); }
Loading job from Directory
Additionally, a loading job can use a directory instead of specific data file path for FILENAME
.
For example, when using the directory /dir_1
as FILENAME
, the TigerGraph loader will traverse all files in the directory to load data.
If users define the input policy as ['/dir_1', '!/dir_1/data_1']
, so that the directory /dir_1
is in allowlist
, while a file /dir_1/data_1
is in blocklist
the TigerGraph Loader will skip the data file /dir_1/data_1
.
While still loading other files that are not in blocklist
. (Ex. /dir_1/data_2
, /dir_1/data_3
, etc…).
[WARNING] The file "/dir_1/data_1" is skipped because it violates file input policy.
There are 2 ways to run loading jobs:
-
The GSQL command:
gsql -g G1 'run loading job load_job1'
-
The GSQL API:
Use the API to start the loading job and pass in the configuration json directly in string:curl --user tigergraph:tigergraph -d ' [ { "name":"load_person", "dataSources":[ { "filename":"f1", "name":"k1", "path":"", "config":{ "topic":"kiwi", "partition_list":[ { "start_offset":-2, "partition":0 } ] } } ], "streaming":false } ] ' -X POST "http://localhost:8123/gsql/loading-jobs?graph=test_graph&action=start"
User-Created Query
Installed Mode
selectVertex
selectVertex
will read existing vertices from a local file directly.
Users should check filepath for the function. |
CREATE QUERY selectVertexEx(STRING filename) FOR GRAPH socialNet { S1 = {SelectVertex(filename, $"c1", $1, ",", true)}; S2 = {SelectVertex(filename, $0, person, ",", true)}; PRINT S1, S2; # Both sets of inputs product the same result }
LoadAccum
LoadAccum
is supported in a query to load data from local file into global accumulator.
CREATE QUERY load_accum_ex (STRING filename) FOR GRAPH Social_Net { TYPEDEF TUPLE<STRING aaa, VERTEX<Post> ddd> Your_Tuple; MapAccum<VERTEX<Person>, MapAccum<INT, Your_Tuple>> @@test_map; GroupByAccum<STRING a, STRING b, MapAccum<STRING, STRING> strList> @@test_group_by; @@test_map = { LOADACCUM (filename, $0, $1, $2, $3, ",", false)}; @@test_group_by = { LOADACCUM ( filename, $1, $2, $3, $3, ",", true) }; PRINT @@test_map, @@test_group_by; }
Path for Configurations
We also allow parse
and read
configurations from local file system.
These commands can be protected by file input policy as well, including:
CREATE DATA_SOURCE KAFKA k1 = "/path/to/config" // from 3.9.0 CREATE CONNECTOR FROM "/tmp/conn.cfg" CREATE DATA_SOURCE STREAM s1 = "/tmp/ds_config.json"
If the object Also, |
Put UDF file
File input policy can also be applied to where UDF files are uploaded from.
Notice the PUT
command can also use relative path (implicitly converted to absolute path within the GSQL client working directory)
gsql ' PUT ExprFunctions FROM "resources/gsql/common/ExprFunctions.hpp" '
Similar to Execute GSQL File, this restriction does not apply to remote GSQL client. It only apply to local GSQL client. |
Execute GSQL File
There are 2 ways to execute GSQL file.
-
In a GSQL shell:
GSQL > @hello.gsql
-
From GSQL client directly:
gsql hello.gsql gsql -f hello.gsql
For a remote GSQL client, users do not need to apply file input policy. However, users need to apply the file input policy to local GSQL clients to avoid reading local files. |
Backward Compatibility
When upgrading from an old version:
-
GSQL will scan constant file paths in all queries and loading jobs in all graphs.
-
If violations of default file input/output policy are found (due to Unmodifiable Implicit Blocklist, a message will prompt the user to let them choose to continue or abort the upgrade.
3.9.3 to 3.10.0
For more details, here is an example when upgrading from 3.9.3 to 3.10.0 and when there exists some violations of file input and output paths.
-
The TigerGraph upgrade displays 1 loading job (
loadData_1
) and 2 queries (q1_FileInput
andq2_FileOutput
). These 3 objects include constant paths that do not comply with the unmodifiable blocklist of file input or output policy.
Users can choose to continue or abort the upgrade.
If continued, after the upgrade, the affected queries will fail to install and the loading jobs will fail to run. Users must rewrite the query/loading job in order to install them again. |
>> bash install.sh -U ... Do you want to switch platform to the new version now (it can be delayed to a later time)? (y/N): y ... [PROGRESS]: 23:30:16 Verify dict and UDF file ... ======= UPGRADE_OLD_VERSION: 3.9.3 ======= Run UDF Policy check since the config GSQL.UDF.Policy.Enable is true. Verify UDF file at path: /home/tigergraph/tigergraph/data/gsql/udf/ExprUtil.hpp Uploaded UDF file does not exist. Skip compatibility check on it. Verify UDF file at path: /home/tigergraph/tigergraph/data/gsql/udf/ExprFunctions.hpp Uploaded UDF file does not exist. Skip compatibility check on it. Successfully finished verifying UDF compatibility. ======== Start: Backward Compatibility Check on File Input/Output Policy ======== Collecting constant paths that violate file policy ... ------ Start graph: test_graph ------ - Query q1_FileInput: [/etc/os_data] - Query q2_FileOutput: [/home/tigergraph/tigergraph/data/gstore/tg_data] - Loading Job loadData_1: [/tmp/.hidden/data.csv] ------ Finish graph: test_graph ------ ======== Complete: Backward Compatibility Check on File Input/Output Policy ======== File input/output violations were observed on constant paths from the listed objects above. The default file input/output policy requires file path must: 1. Use absolute path, not use relative path. 2. Not use hidden path or file like "/.ssh/data", "/.mydata". 3. Not use any TigerGraph sensitive paths, including System.AppRoot and some paths under System.DataRoot. 4. Not use Linux system sensitive paths, like "/etc", "/sys" For more details, please check our documentation on file input/output policy. If continue, after upgrade the affected queries will fail to install, and loading jobs will fail to run. Successfully finished verifying catalog. [GTEST_IL] Please check debug log at: /home/tigergraph/tigergraph-3.10.0-offline-fileinput/gsql-checker-log/DEBUG.20240208-233017.776 Do you want to continue this upgrade? [y/N]:
For a non-interactive upgrade ( |
Limitations
Interpret Mode
Features listed are not yet supported in INTERPRET mode yet.
-
selectVertex
-
LoadAccum
-
UDF functions
Dynamic Paths during Upgrade
During a TigerGraph upgrade, file input policy is not able to collect dynamic file paths during the check.
CREATE QUERY qDynFile(STRING filename) { FILE file1(filename); file1.println("first line"); }
After an upgrade, TigerGraph may throw runtime errors when a user tries to run an old script to run query or loading jobs.
GSQL >> RUN QUERY qDynFile("/etc/os_data") Runtime Error: Linux system sensitive directory '/etc' is not allowed in path. Please use another path.
File Link
Similar to file output policy a Linux file system supports symbolic links (symlinks) and hard links. Both symlinks and hard links are attributes of a file. A file can be marked a symlink or be a hard link to another file.
TigerGraph file input policy can only control the link in the path to block or allow it.