JDBC Spark Connection
An open-source solution exists for loading data into TigerGraph using JDBC and Spark. This guide will show you how to set up a connection to read data from PostgreSQL through PySpark and write the data to TigerGraph.
For large volumes of data with fault-tolerance, we recommend our Kafka-based streaming data connector. |
This guide is written for Ubuntu 18.04.
Prerequisites
First make sure you have current versions of Java, Python, and Spark on the machine that will be the Spark server.
$ sudo apt update
$ sudo apt-get -y install openjdk-8-jdk-headless
$ sudo apt install python3
$ wget https://downloads.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop2.tgz
Download the PostgreSQL JDBC driver.
$ wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar
Download the TigerGraph JDBC driver.
$ wget https://repo1.maven.org/maven2/com/tigergraph/tigergraph-jdbc-driver/1.3.6/tigergraph-jdbc-driver-1.3.6.jar
Copy both drivers into the $SPARK_HOME/jars/
folder.
$ cp tigergraph-jdbc-driver-1.3.6.jar postgresql-42.5.0.jar $SPARK_HOME/jars
Test your Spark installation by running this command:
$ spark-submit --class org.apache.spark.examples.SparkPi spark/examples/jars/spark-examples_2.12-3.3.1.jar 10
Set up SSL connection
TigerGraph Cloud uses SSL connections, so we need to create the certificate for Spark to use.
For this step and other steps, you need your TigerGraph Cloud cluster URL, which can be found on the Clusters page.
A typical cluster URL looks like 7c141e8c7669404f923623083ba01da9.i.tgcloud.io
.
$ echo | openssl s_client -host <your TigerGraph Cloud URL> -port 443 | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > ~/tgcloud.crt
This SSL certificate, a .crt
file, cannot be used directly with the JDBC driver.
Instead, you must use it to create a Java KeyStore (JKS) file using the keytool
utility.
Create a password and use it where the sample command below has the placeholder <your password>
.
$ keytool -import -alias <your TigerGraph Cloud URL> -file ~/tgcloud.crt -keystore ~/trust.jks -storepass <your password>
Create the secret key on your TigerGraph instance
Go to Admin Portal, then on the sidebar, go to Management > Users.
Choose the graph you want to access. Create a secret key by following the instructions on the Manage Secrets page.
Save the secret key separately on your server. This key will be your password to access the TigerGraph database.
Create and run the PySpark script
Create a PySpark script file using the following template.
In this example, the file is named postgres_to_tg.py
.
You must decide how you want to map your source data to the graph.
The template example loads data into a graph vertex type called Account.
For more details on how to use the TigerGraph JDBC Connector, see the documentation at https://github.com/tigergraph/ecosys/blob/master/tools/etl/tg-jdbc-driver/README.md.
The last section of the file, the Loading to TigerGraph section, requires several parameters. Some are filled in already in the example, but others require your input.
-
url
: Your TigerGraph Cloud URL -
password
: Your secret key created with Admin Portal -
graph
: The graph for which you created the secret key -
truststore
: The JKS file you created previously -
trustStorePassword
: The JKS password
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Python Spark pg to TigerGraph") \
.config("spark.driver.extraClassPath", "/home/ubuntu/postgresql-42.5.0.jar:/home/ubuntu/tigergraph-jdbc-driver-1.3.6.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/data_postgres") \
.option("dbtable", "account") \
.option("user", "postgres") \
.option("password", "postgres") \
.option("driver", "org.postgresql.Driver") \
.load()
//Loading to TigerGraph
df.write \
.mode("overwrite") \
.format("jdbc") \
.option("driver", "com.tigergraph.jdbc.Driver") \
.option("url", "jdbc:tg:https://7c141e8c3221404f923623083fa01da6.i.tgcloud.io:443") \
.option("user", "__GSQL__secret") \
.option("password", "<your secret>") \
.option("graph", "<your graph name>") \
.option("dbtable", "vertex Account") \
.option("debug", "1") \
.option("trustStore", "trust.jks") \
.option("trustStorePassword", "password") \
.option("trustStoreType", "JKS") \
.save()
To run the file, execute this command:
spark-submit postgres_to_tg.py