Write Query Output to Cloud

Before you can write query results to an S3 bucket, you need to set up the connection to the S3 service. This requires credentials like an AWS Access Key ID and Secret Access Key. Make sure the necessary read/write permissions are granted for the S3 bucket. There are two methods to configure these credentials:

Using gadmin config

This method configures the S3 connection for the entire TigerGraph cluster. Once you set the credentials, they apply to all users in the cluster. Here’s how to do it:

gadmin config set GPE.QueryOutputS3AWSAccessKeyID <YOUR_AWS_ACCESS_KEY_ID>
gadmin config set GPE.QueryOutputS3AWSSecretAccessKey <YOUR_AWS_SECRET_ACCESS_KEY>
gadmin config apply -y
gadmin restart gpe -y
  • Replace <YOUR_AWS_ACCESS_KEY_ID> and <YOUR_AWS_SECRET_ACCESS_KEY> with your actual AWS credentials

Using GSQL Session Parameters

This method configures the credentials only for the current session and user.

SET s3_aws_access_key_id = <YOUR_AWS_ACCESS_KEY_ID>
SET s3_aws_secret_access_key = <YOUR_AWS_SECRET_ACCESS_KEY>
  • These parameters are set per session, so each user will need to provide their own credentials if they’re working with S3.

  • All queries run during this session will use these credentials to write to S3.

[NOTE]:If both methods (using gadmin config and GSQL session parameters) are applied simultaneously, the GSQL session parameters take precedence.

Unique File Paths to Avoid Conflicts

Since S3 is a shared storage system, multiple nodes in a cluster can upload to the same S3 bucket.To avoid naming conflicts (multiple nodes trying to write to the same file),the S3 path will include a prefix based on the instance name :

  • Instance Name: A prefix like GPE_{PartitionId}_{ReplicaId} ensures uniqueness by identifying the instance that generated the output.

  • Role: For distributed queries, additional suffixes will be used to differentiate between the manager and worker roles on the same GPE:

    • Coordinator: The node managing the query (written as .coordinator).

    • Worker: The node processing the query (written as .worker).

So, your S3 file paths might look like:

  • GPE_{PartitionId}_{ReplicaId}.coordinator

  • GPE_{PartitionId}_{ReplicaId}.worker

Example:

Consider a scenario where a 3 x 2 cluster is executing a distributed query, and the results are being saved to a file called queryResults.

In this case, the cluster has 3 partitions, each with 2 replicas, and the query is distributed across multiple nodes.

The output files generated by the query would be named as follows:

GPE_0_0.worker.queryResults.csv – This file represents the output from the worker node in partition 0, replica 0. GPE_0_1.coordinator.queryResults.csv – This file represents the output from the coordinator node in partition 0, replica 1.

These unique file names help ensure that no conflicts occur when multiple nodes are writing query outputs at the same time.