Troubleshooting Guide

Introduction

The Troubleshooting Guide teaches you how check on the status of your TigerGraph system, and when needed, how to find the log files in order to get a better understanding of why certain errors are occurring. This section covers log file debugging for data loading and querying.

General

Before any deeper investigation, always run these general system checks :

$ gadmin status        (Make sure all TigerGraph services are UP.)

$ df -lh               (Make sure all servers are getting enough disk space.)

$ free -g              (Make sure all servers have enough memory.)

$ tsar                 (Make sure there is no irregular memory usage on the system.)

$ dmesg -T | tail      (Make sure there are no Out of Memory, or any other errors.)

Location of Log Files

The following command reveals the location of the log files :

gadmin log

You will be presented with a list of log files. The left side of the resulting file paths is the component for which the respective log file is logging information. The majority of the time, these files will contain what you are looking for. You may notice that there are multiple files for each TigerGraph component.

The .out file extension is for errors. The .INFO file extension is for normal behaviors.

In order to diagnose an issue for a given component, you'll want to check the .out log file extension for that component.

Other log files that are not listed by the gadmin log command are those for Zookeeper and Kafka, which can be found here:

zookeeper : ~/tigergraph/zk/zookeeper.out.*
kafka     : ~/tigergraph/kafka/kafka.out

Setting automated collection of Logs

Customers can setup automated collection of logs as well as restart of services. This will ensure that all the logs related to critical process crash are collected in a timely fashion. This will minimize the need for access to customer environment to diagnose issues remotely. This will also avoid delayed restart of services.

Please follow the up to date instructions from TigerGraph Ecosystem Github Site Here are the two steps:

  1. Users can save the following script as "~/.gsql/dump_log_auto_start.sh”

DUMP_LOG_AUTO_START_LOCK="/tmp/.dump_log_auto_start_lock"

if [[ -f ${DUMP_LOG_AUTO_START_LOCK} ]]; then
    exit 0
fi

if ~/.gium/gadmin status -v gpe gse | grep PROC | grep False; then
    touch ${DUMP_LOG_AUTO_START_LOCK}
    # dump 600 seconds logs before gpe/gse is down
    ~/.gium/gcollect -t 600 -o /tmp/dumplog-`date +"%Y%m%d-%T"` collect
    ~/.gium/gadmin start gpe gse
    rm ${DUMP_LOG_AUTO_START_LOCK}
fi

Note: By default, when GPE or GSE goes down the script will dump logs for the preceding 10 minutes to the folder "/tmp/dumplog-$timestamp" (e.g., /tmp/dumplog-20200620-03:18:03) and restart GPE automatically. The time for log collection is configurable and user may change it accordingly.

2. After configuring the script, add a cron job by running "crontab -e" and add the following line at the end:

* * * * * /home/tigergraph/.gsql/dump_log_auto_start.sh >/dev/null 2>&1 </dev/null

Query Debugging

Checking the Logs - Flow of a query in the system

To better help you understand the flow of a query within the TigerGraph system, we've provided the diagram below with arrows showing the direction of information flow. We'll walk through the execution of a typical query to show you how to observe the information flow as recorded in the log files.

From calling a query to returning the result, here is how the information flows: 1. Nginx receives the request.

grep <QUERY_NAME> /home/tigergraph/tigergraph/logs/nginx/ngingx_1.access.log

You can click on the image below to expand it.

2. Nginx sends the request to Restpp.

grep <QUERY_NAME> /home/tigergraph/tigergraph/logs/RESTPP_1_1/log.INFO

3. Restpp sends an ID translation task to GSE and a query request to GPE. 4. GSE sends the translated ID to GPE, and the GPE starts to process the query. 5. GPE sends the query result to Restpp, and sends a translation task to GSE, which then sends the translation result to Restpp.

grep <REQUEST_ID> /home/tigergraph/tigergraph/logs/GPE_1_1/log.INFO
grep <REQUEST_ID> /home/tigergraph/tigergraph/logs/GSE_1_1/log.INFO

6. Restpp sends the result back to Nginx.

grep <REQUEST_ID> /home/tigergraph/tigergraph/logs/RESTPP_1_1/log.INFO

7. Nginx sends the response.

grep <QUERY_NAME> /home/tigergraph/tigergraph/logs/nginx/nginx_1.access.log

Other Useful Commands for Query Debugging

Check recently executed query:
$ grep UDF:: /home/tigergraph/tigergraph/logs/GPE_1_1/log.INFO | tail -n 50

Get the number of queries executed recently:
$ grep UDF::End /home/tigergraph/tigergraph/logs/GPE_1_1/log.INFO | wc -l

Grep distributed query log:
$ grep “Action done” /home/tigergraph/tigergraph/logs/GPE_1_1/log.INFO | tail -n 50


Grep logs from all servers:
$ grun all “grep UDF:: /home/tigergraph/tigergraph/logs/GPE_*/log.INFO | tail -n 50”

Slow Query Performance

Multiple situations can lead to slower than expected query performance:

  • Insufficient Memory When a query begins to use too much memory, the engine will start to put data onto the disk, and memory swapping will also kick in. Use the Li command: free -g to check available memory and swap status. To combat this, you can either optimize the data structure used within the query or increase the physical memory size on the machine.

  • GSQL Logic Usually, a single server machine can process up to 20 million edges per second. If the actual number of vertices or edges is much much lower, most of the time it can be due to inefficient query logic. That is, the query logic is now following the natural execution of GSQL. You will need to optimize your query to tune the performance.

  • Disk IO When the query writes the result to the local disk, the disk IO may be the bottleneck for the query's performance. Disk performance can be checked with this Linux command : sar 1 10. If you are writing (PRINT) one line at a time and there are many lines, storing the data in one data structure before printing may improve the query performance.

  • Huge JSON Response If the JSON response size of a query is too massive, it may take longer to compose and transfer the JSON result than to actually traverse the graph. To see if this is the cause, check the GPE log.INFO file. If the query execution is already completed in GPE but has not been returned, and CPU usage is at about 200%, this is the most probable cause. If possible, please reduce the size of the JSON being printed.

  • Memory Leak This is a very rare issue. The query will progressively become slower and slower, while GPE's memory usage increases over time. If you experience these symptoms on your system, please report this to the TigerGraph team.

  • Network Issues When there are network issues during communication between servers, the query can be slowed down drastically. To identify that this is the issue, you can check the CPU usage of your system along with the GPE log.INFO file. If the CPU usage stays at a very low level and GPE keeps printing ??? , this means network IO is very high.

  • Frequent Data Ingestion in Small Batches Small batches of data can increase the data loading overhead and query processing workload. Please increase the batch size to prevent this issue.

Query Hangs

When a query hangs, or seems to run forever, it can be attributed to these possibilities :

  • Services are down Please check that TigerGraph services are online and running. Run gadmin status and possibly check the logs for any issues that you find from the status check.

  • Query infinite loop To verify this is the issue, check the GPE log.INFO file to see if graph iteration log lines are continuing to be produced. If they are, and the edgeMaps log the same number of edges every few iterations, you have an infinite loop in your query. If this is the case, please restart GPE to stop the query : gadmin restart gpe -y. Proceed to refine your query and make sure your loops within the query are able to break out of the loop.

  • Query is still running, it is just slow If you have a very large graph, please be patient. Ensure that there is no infinite loop in your query, and refer to the slow query performance section for possible causes.

  • GraphStudio Error If you are running the query from GraphStudio, the loading bar may continue spinning as if the query has not finished running. You can right-click the page and select inspect->console (in the Google Chrome browser) and try to find any suspicious errors there.

Query Returns No Result

If a query runs and does not return a result, it could be due to two reasons: 1. Data is not loaded. From the Load Data page on GraphStudio, you are able to check the number of loaded vertices and edges, as well as a number of each vertex or edge type. Please ensure that all the vertices and edges needed for the query are loaded.

2. Properties are not loaded. The number of vertices and edges traversed can be observed in the GPE log.INFO file. If for one of the iterations you see activated 0 vertices, this means no target vertex satisfied your searching condition. For example, the query can fail to pass a WHERE clause or a HAVING clause. If you see 0 vertex reduces while the edge map number is not 0, that means that all edges have been filtered out by the WHERE clause, and that no vertices have entered into the POST-ACCUM phase. If you see more than 0 vertex reduces, but activated 0 vertices, this means all the vertices were filtered out by the HAVING clause.

To confirm the reasoning within the log file, use GraphStudio to pick a few vertices or edges that should have satisfied the conditions and check their attributes for any unexpected errors.

Query Installation Failed

Query Installation may fail for a handful of reasons. If a query fails to install, please check the GSQL log file. The default location for the GSQL log is here :

/home/tigergraph/tigergraph/logs/gsql_server_log/GSQL_LOG

Go down to the last error and it will point you to the error. This will show you any query errors that could be causing the failed installation. If you have created a user-defined function, you could potentially have a c++ compilation error.

If you have a c++ user-defined function error, your query will fail to install, even if it does not utilize the UDF.

Data Loading Debugging

Checking the Logs

GraphStudio

Using GraphStudio, you are able to see, from a high-level, a number of errors that may have occurred during the loading. This is accessible from the Load Data page. Click on one of your data sources, then click on the second tab of the graph statistics chart. There, you will be able to see the status of the data source loading, number of loaded lines, number of lines missing, and lines that may have an incorrect number of columns. (Refer to picture below.)

Command Line

If you see there are a number of issues from the GraphStudio Load Data page, you can dive deeper to find the cause of the issue by examining the log files. Check the loading log located here:

/home/tigergraph/tigergraph/logs/restpp/restpp_loader_logs/<GRAPH_NAME>/

Open up the latest .log file and you will be able to see details about each data source. The picture below is an example of a correctly loaded data file.

Here is an example of a loading job with errors :

From this log entry, you are able to see the errors being marked as lines with invalid attributes. The log will provide you the line number from the data source which contains the loading error, along with the attribute it was attempting to load to.

Slow Loading

Normally, a single server running TigerGraph will be able to load from 100k to 1000k lines per second, or 100GB to 200GB of data per hour. This can be impacted by any of the following factors:

  • Loading Logic How many vertices/edges are generated from each line loaded?

  • Data Format Is the data formatted as JSON or CSV? Are multi-level delimiters in use? Does the loading job intensively use temp_tables?

  • Hardware Configuration Is the machine set up with HDD or SSD? How many CPU cores are available on this machine?

  • Network Issue Is this machine doing local loading or remote POST loading? Any network connectivity issues?

  • Size of Files How large are the files being loaded? Many small files may decrease the performance of the loading job.

  • High Cardinality Values Being Loaded to String Compress Attribute Type How diverse is the set of data being loaded to the String Compress attribute?

To combat the issue of slow loading, there are also multiple methods:

  • If the computer has many cores, consider increasing the number of Restpp load handlers.

$ gadmin --config handler
increase the number of handlers
save
$ gadmin --config apply
  • Separate ~/tigergraph/kafka from ~/tigergraph/gstore and store them on separate disks.

  • Do distributed loading.

  • Do offline batch loading.

  • Combine many small files into one larger file.

Loading Hangs

When a loading job seems to be stuck, here are things to check for :

  • GPE is DOWN You can check the status of GPE with this command : gadmin status gpe If GPE is down, you can find the logs necessary with this command : gadmin log -v gpe

  • Memory is full Run this command to check memory usage on the system : free -g

  • Disk is full Check disk usage on the system : df -lh

  • Kafka is DOWN You can check the status of Kafka with this command : gadmin status kafka If it is down, take a look at the log with this command : vim ~/tigergraph/kafka/kafka.out

  • Multiple Loading Jobs By default, the Kafka loader is configured to allow a single loading job. If you execute multiple loading jobs at once, they will run sequentially.

Data Not Loaded

If the loading job completes, but data is not loaded, there may be issues with the data source or your loading job. Here are things to check for:

  • Any invalid lines in the data source file. Check the log file for any errors. If an input value does not match the vertex or edge type, the corresponding vertex or edge will not be created.

  • Using quotes in the data file may cause interference with the tokenization of elements in the data file. Please check the GSQL Language Reference section under Other Optional LOAD Clauses. Look for the QUOTE parameter to see how you should set up your loading job.

  • Your loading job loads edges in the incorrect order. When you defined the graph schema, the from and to vertex order will affect the way you write the loading job. If you wrote the loading job in reversed order, the edges will not be created, possibly also affecting the population of vertices.

Loading Failure

Possible causes of a loading job failure are:

  • Loading job timed out If a loading job hangs for 600 seconds, it will automatically time out.

  • Port Occupied Loading jobs require port 8500. Please ensure that this port is open.

Further Debugging

If after taking these actions you cannot solve the issue, please reach out to support@tigergraph.com to request assistance.

Last updated