The Troubleshooting Guide teaches you how check on the status of your TigerGraph system, and when needed, how to find the log files in order to get a better understanding of why certain errors are occurring. This section covers log file debugging for data loading and querying.
Before any deeper investigation, always run these general system checks :
$ gadmin status (Make sure all TigerGraph services are UP.)$ df -lh (Make sure all servers are getting enough disk space.)$ free -g (Make sure all servers have enough memory.)$ tsar (Make sure there is no irregular memory usage on the system.)$ dmesg -T | tail (Make sure there are no Out of Memory, or any other errors.)
The following command reveals the location of the log files :
You will be presented with a list of log files. The left side of the resulting file paths is the component for which the respective log file is logging information. The majority of the time, these files will contain what you are looking for. You may notice that there are multiple files for each TigerGraph component.
In order to diagnose an issue for a given component, you'll want to check the .out log file extension for that component.
$ gadmin logADMIN : /home/tigergraph/tigergraph/log/admin/ADMIN#1.outADMIN : /home/tigergraph/tigergraph/log/admin/ADMIN.INFOCTRL : /home/tigergraph/tigergraph/log/controller/CTRL#1.logCTRL : /home/tigergraph/tigergraph/log/controller/CTRL#1.outDICT : /home/tigergraph/tigergraph/log/dict/DICT#1.outDICT : /home/tigergraph/tigergraph/log/dict/DICT.INFOETCD : /home/tigergraph/tigergraph/log/etcd/ETCD#1.outEXE : /home/tigergraph/tigergraph/log/executor/EXE_1.logEXE : /home/tigergraph/tigergraph/log/executor/EXE_1.outGPE : /home/tigergraph/tigergraph/log/gpe/GPE_1#1.outGPE : /home/tigergraph/tigergraph/log/gpe/log.INFOGSE : /home/tigergraph/tigergraph/log/gse/GSE_1#1.outGSE : /home/tigergraph/tigergraph/log/gse/log.INFOGSQL : /home/tigergraph/tigergraph/log/gsql/GSQL#1.outGSQL : /home/tigergraph/tigergraph/log/gsql/log.INFOGUI : /home/tigergraph/tigergraph/log/gui/GUI#1.outIFM : /home/tigergraph/tigergraph/log/informant/IFM#1.logIFM : /home/tigergraph/tigergraph/log/informant/IFM#1.outKAFKA : /home/tigergraph/tigergraph/log/kafka/controller.logKAFKA : /home/tigergraph/tigergraph/log/kafka/kafka-request.logKAFKA : /home/tigergraph/tigergraph/log/kafka/kafka.logKAFKA : /home/tigergraph/tigergraph/log/kafka/server.logKAFKA : /home/tigergraph/tigergraph/log/kafka/state-change.logKAFKACONN: /home/tigergraph/tigergraph/log/kafkaconn/KAFKACONN#1.outKAFKACONN: /home/tigergraph/tigergraph/log/kafkaconn/kafkaconn.logKAFKASTRM-LL: /home/tigergraph/tigergraph/log/kafkastrm-ll/KAFKASTRM-LL_1.outKAFKASTRM-LL: /home/tigergraph/tigergraph/log/kafkastrm-ll/kafkastrm-ll.logNGINX : /home/tigergraph/tigergraph/log/nginx/logs/NGINX#1.outNGINX : /home/tigergraph/tigergraph/log/nginx/logs/error.logNGINX : /home/tigergraph/tigergraph/log/nginx/logs/nginx.access.logNGINX : /home/tigergraph/tigergraph/log/nginx/logs/nginx.error.logRESTPP : /home/tigergraph/tigergraph/log/restpp/RESTPP#1.outRESTPP : /home/tigergraph/tigergraph/log/restpp/log.INFORESTPP-LOADER: /home/tigergraph/tigergraph/log/fileLoader/log.INFOTS3 : /home/tigergraph/tigergraph/log/ts3/TS3_1.logTS3 : /home/tigergraph/tigergraph/log/ts3/TS3_1.outTS3SERV: /home/tigergraph/tigergraph/log/ts3serv/TS3SERV#1.outZK : /home/tigergraph/tigergraph/log/zk/ZK#1.outZK : /home/tigergraph/tigergraph/log/zk/zookeeper.log
To aid in the effort of system debugging, there is a tool you can use to collect all relevant log files from around the time of a system malfunction or error. Collection of these files greatly improves the efficiency of the support process, as this minimizes the need to access a customer environment to diagnose issues remotely. This will also avoid delayed restart of services.
Here is the relevant information from the TigerGraph servers that will be collected when running the
gcollect [Options] COMMANDOptions:-h, --help: show this help message and exit-A num, --after-context num: Print num lines of trailing context after each match.-B num, --before-context num: Print num lines of leading context before each match.-c, --components gpe,gse,rest: only collect information related to the specified component(s). All by default. Supported components: gpe,gse,gsql,dict,tsar,kafka,zk,rest,nginx,admin,restpp_loader,fab,kafka-stream,kafka-connect-n, --nodes m1,m2: only search patterns for specified nodes. (only works in together with command "grep")-s, --start DateTime: logs older than this DateTime will be ignored. Format: 2006-01-02,15:04:05-e, --end DateTime: logs newer than this DateTime will be ignored. Format: 2006-01-02,15:04:05-t, --tminus num: only search for logs that are generated in the past num seconds.-r, --request_id id: only collect information related to the specified request id. Lines match "pattern" will also be printed.-b, --before num: how long before the query should we start collecting. (in seconds, could ONLY be used with [--reqest_id] option).-d, --duration num: how long after the query should we stop collecting. (in seconds, could ONLY be used with [--reqest_id] option).-o, --output_dir dir: specify the output directory, "./output" by default. (ALERT: files in this folder will be DELETED.)-p, --pattern regex: collect lines from logs which match the regular expression. (Could have more than one regex, lines that match any of the regular expressions will be printed.)-i, --ignore-case: Ignore case distinctions in both the PATTERN and the input files.-D, --display: Print to screen.Commands:grep: search patterns from logs files that have been collected before.show: show all the requests' id during the specified time window.collect: collect all the debugging information which satisfy all the requirements specified by Options.
All the log files will be printed in the output directory, specified when running the
gcollect command, and each node has a subdirectory. Each component will have one or two log files.
# show all requests during last hour./bin/gcollect -t 3600 show# collect debug info for a specific request, either contains "RESTPP_2_1.1559075028795" or "error", from [T - 60s] to [T + 120s] (T is the time when the request was issued)./bin/gcollect -r RESTPP_2_1.1559075028795 -b 60 -d 120 -p "error" collect# collect debug info during "5/22/2019,18:00" and "5/22/2019,19:00" for all components, either contains "error" or "FAILED", case insensitive../bin/gcollect -i -p "error" -p "FAILED" -s "2019-05-22,18:00:00" -e "2019-05-22,19:00:00" collect# Search for "unknown" from logs files that have been collected before, only search components "admin" and "gpe", case insensitive, print 1 line of trailing context after matching lines, print 2 lines of leading context before matching lines, print to screen../bin/gcollect -i -p "unknown" -c admin,gpe -D -A 1 -B 2 grep
The installation will quit if there are any missing dependency packages, and output a message. Please run
bash install_tools.sh to install all missing packages. You will need internet connection to install the missing dependencies.
The /home directory requires at least 200MB of space, or the installation will fail with an out of disk message. This is temporary only during installation and will be moved to the root directory once installation is complete.
The /tmp directory requires at least 1GB of space, or the installation will fail with an out of disk message
The directory in which you choose to install TigerGraph requires at least 20GB of space, otherwise the installation will report the error and exit.
If your firewall blocks all ports not defined for use, we recommend opening up internal ports 1000-50000.
If you are using a cloud instance, you will need to configure the firewall rules through the respective consoles.
e.g. Amazon AWS or Microsoft Azure
If you are managing a local machine, you can manage your open ports using the
iptables command. Please refer to the example below to help with your firewall configuration.
# iptables help page$ sudo iptables -h# This will list your firewall rules$ sudo iptables -L# Allow incoming SSH connections to port 22 from the 192.168.0.0 subnet$ sudo iptables -A INPUT -p tcp --dport 22 -s 192.168.0.0/24 -j ACCEPT$ sudo iptables -A INPUT -p tcp --dport 22 -s 127.0.0.0/8 -j ACCEPT$ sudo iptables -A INPUT -p tcp --dport 22 -j DROP
As of v3.0, we can run the installer with the -F flag, which will open tcp ports among the cluster nodes. This will resolve any firewall issues that may block the installation from completing.
To better help you understand the flow of a query within the TigerGraph system, we've provided the diagram below with arrows showing the direction of information flow. We'll walk through the execution of a typical query to show you how to observe the information flow as recorded in the log files.
From calling a query to returning the result, here is how the information flows: 1. Nginx receives the request.
grep <QUERY_NAME> /home/tigergraph/tigergraph/log/nginx/logs/nginx.access.log
2. Nginx sends the request to Restpp.
grep <QUERY_NAME> /home/tigergraph/tigergraph/log/restpp/log.INFO
3. Restpp sends an ID translation task to GSE and a query request to GPE. 4. GSE sends the translated ID to GPE, and the GPE starts to process the query. 5. GPE sends the query result to Restpp, and sends a translation task to GSE, which then sends the translation result to Restpp.
grep <REQUEST_ID> /home/tigergraph/tigergraph/log/gpe/log.INFO
grep <REQUEST_ID> /home/tigergraph/tigergraph/log/gse/log.INFO
6. Restpp sends the result back to Nginx.
grep <REQUEST_ID> /home/tigergraph/tigergraph/log/restpp/log.INFO
7. Nginx sends the response.
grep <QUERY_NAME> /home/tigergraph/tigergraph/log/nginx/logs/nginx.access.log
Check recently executed query:$ grep UDF:: /home/tigergraph/tigergraph/log/gpe/log.INFO | tail -n 50Get the number of queries executed recently:$ grep UDF::End /home/tigergraph/tigergraph/log/gpe/log.INFO | wc -lGrep distributed query log:$ grep “Action done” /home/tigergraph/tigergraph/log/gpe/log.INFO | tail -n 50Grep logs from all servers:$ grun all “grep UDF:: /home/tigergraph/tigergraph/log/gpe/log.INFO | tail -n 50”
Multiple situations can lead to slower than expected query performance:
When a query begins to use too much memory, the engine will start to put data onto the disk, and memory swapping will also kick in. Use the Li command:
free -g to check available memory and swap status. To combat this, you can either optimize the data structure used within the query or increase the physical memory size on the machine.
GSQL Logic Usually, a single server machine can process up to 20 million edges per second. If the actual number of vertices or edges is much much lower, most of the time it can be due to inefficient query logic. That is, the query logic is now following the natural execution of GSQL. You will need to optimize your query to tune the performance.
When the query writes the result to the local disk, the disk IO may be the bottleneck for the query's performance. Disk performance can be checked with this Linux command :
sar 1 10.
If you are writing (PRINT) one line at a time and there are many lines, storing the data in one data structure before printing may improve the query performance.
Huge JSON Response If the JSON response size of a query is too massive, it may take longer to compose and transfer the JSON result than to actually traverse the graph. To see if this is the cause, check the GPE log.INFO file. If the query execution is already completed in GPE but has not been returned, and CPU usage is at about 200%, this is the most probable cause. If possible, please reduce the size of the JSON being printed.
Memory Leak This is a very rare issue. The query will progressively become slower and slower, while GPE's memory usage increases over time. If you experience these symptoms on your system, please report this to the TigerGraph team.
Network Issues When there are network issues during communication between servers, the query can be slowed down drastically. To identify that this is the issue, you can check the CPU usage of your system along with the GPE log.INFO file. If the CPU usage stays at a very low level and GPE keeps printing ??? , this means network IO is very high.
Frequent Data Ingestion in Small Batches Small batches of data can increase the data loading overhead and query processing workload. Please increase the batch size to prevent this issue.
When a query hangs, or seems to run forever, it can be attributed to these possibilities :
Services are down
Please check that TigerGraph services are online and running. Run
gadmin status and possibly check the logs for any issues that you find from the status check.
Query infinite loop
To verify this is the issue, check the GPE log.INFO file to see if graph iteration log lines are continuing to be produced. If they are, and the edgeMaps log the same number of edges every few iterations, you have an infinite loop in your query.
If this is the case, please restart GPE to stop the query :
gadmin restart gpe -y.
Proceed to refine your query and make sure your loops within the query are able to break out of the loop.
Query is still running, it is just slow If you have a very large graph, please be patient. Ensure that there is no infinite loop in your query, and refer to the slow query performance section for possible causes.
GraphStudio Error If you are running the query from GraphStudio, the loading bar may continue spinning as if the query has not finished running. You can right-click the page and select inspect->console (in the Google Chrome browser) and try to find any suspicious errors there.
If a query runs and does not return a result, it could be due to two reasons: 1. Data is not loaded. From the Load Data page on GraphStudio, you are able to check the number of loaded vertices and edges, as well as a number of each vertex or edge type. Please ensure that all the vertices and edges needed for the query are loaded.
2. Properties are not loaded. The number of vertices and edges traversed can be observed in the GPE log.INFO file. If for one of the iterations you see activated 0 vertices, this means no target vertex satisfied your searching condition. For example, the query can fail to pass a WHERE clause or a HAVING clause. If you see 0 vertex reduces while the edge map number is not 0, that means that all edges have been filtered out by the WHERE clause, and that no vertices have entered into the POST-ACCUM phase. If you see more than 0 vertex reduces, but activated 0 vertices, this means all the vertices were filtered out by the HAVING clause.
To confirm the reasoning within the log file, use GraphStudio to pick a few vertices or edges that should have satisfied the conditions and check their attributes for any unexpected errors.
Query Installation may fail for a handful of reasons. If a query fails to install, please check the GSQL log file. The default location for the GSQL log is here :
Go down to the last error and it will point you to the error. This will show you any query errors that could be causing the failed installation. If you have created a user-defined function, you could potentially have a c++ compilation error.
The following example shows the system free memory is 69%
I0520 23:40:09.845811 7828 gsystem.cpp:622] System_GSystem|GSystemWatcher|Health|ProcMaxGB|0|ProcAlertGB|0|CurrentGB|1|SysMinFreePct|10|SysAlertFreePct|30|FreePct|69
log:W0312 02:10:57.839139 15171 scheduler.cpp:116] System Memory in Critical state. Aborted.. Aborting.
Using GraphStudio, you are able to see, from a high-level, a number of errors that may have occurred during the loading. This is accessible from the Load Data page. Click on one of your data sources, then click on the second tab of the graph statistics chart. There, you will be able to see the status of the data source loading, number of loaded lines, number of lines missing, and lines that may have an incorrect number of columns. (Refer to picture below.)
If you see there are a number of issues from the GraphStudio Load Data page, you can dive deeper to find the cause of the issue by examining the log files. Check the loading log located here:
Open up the latest .log file and you will be able to see details about each data source. The picture below is an example of a correctly loaded data file.
Here is an example of a loading job with errors :
From this log entry, you are able to see the errors being marked as lines with invalid attributes. The log will provide you the line number from the data source which contains the loading error, along with the attribute it was attempting to load to.
Normally, a single server running TigerGraph will be able to load from 100k to 1000k lines per second, or 100GB to 200GB of data per hour. This can be impacted by any of the following factors:
Loading Logic How many vertices/edges are generated from each line loaded?
Data Format Is the data formatted as JSON or CSV? Are multi-level delimiters in use? Does the loading job intensively use temp_tables?
Hardware Configuration Is the machine set up with HDD or SSD? How many CPU cores are available on this machine?
Network Issue Is this machine doing local loading or remote POST loading? Any network connectivity issues?
Size of Files How large are the files being loaded? Many small files may decrease the performance of the loading job.
High Cardinality Values Being Loaded to String Compress Attribute Type How diverse is the set of data being loaded to the String Compress attribute?
To combat the issue of slow loading, there are also multiple methods:
If the computer has many cores, consider increasing the number of Restpp load handlers.
$ gadmin config entry RESTPP.Factory.HandlerCountincrease the number of handlerssave$ gadmin --config apply
~/tigergraph/gstore and store them on separate disks.
Do distributed loading.
Do offline batch loading.
Combine many small files into one larger file.
When a loading job seems to be stuck, here are things to check for :
GPE is DOWN
You can check the status of GPE with this command :
gadmin status gpe
If GPE is down, you can find the logs necessary with this command :
gadmin log -v gpe
Memory is full
Run this command to check memory usage on the system :
Disk is full
Check disk usage on the system :
Kafka is DOWN
You can check the status of Kafka with this command :
gadmin status kafka
If it is down, take a look at the log with this command :
Multiple Loading Jobs By default, the Kafka loader is configured to allow a single loading job. If you execute multiple loading jobs at once, they will run sequentially.
If the loading job completes, but data is not loaded, there may be issues with the data source or your loading job. Here are things to check for:
Any invalid lines in the data source file. Check the log file for any errors. If an input value does not match the vertex or edge type, the corresponding vertex or edge will not be created.
Using quotes in the data file may cause interference with the tokenization of elements in the data file. Please check the GSQL Language Reference section under Other Optional LOAD Clauses. Look for the QUOTE parameter to see how you should set up your loading job.
Your loading job loads edges in the incorrect order. When you defined the graph schema, the from and to vertex order will affect the way you write the loading job. If you wrote the loading job in reversed order, the edges will not be created, possibly also affecting the population of vertices.
If you know what data you expect to see (number of vertices and edges, and attribute values), but the loaded data does not mean your expectations, there are a number of possible causes to investigate:
First, check the logs for important clues.
Are you reaching and reading all the data sources (paths and permissions)?
Is the data mapping correct?
Are your data fields correct? In particular, check data types. For strings, check for unwanted extra strings. Leading spaces are not removed unless you apply an optional token function to trim the extra spaces.
Do you have duplicate ids, resulting in the same vertex or edge being loading more than once. Is this intended or unintended? TigerGraph's default loading semantics is UPSERT. Check the loading documentation to maker sure you understand the semantics in detail:
Possible causes of a loading job failure are:
Loading job timed out If a loading job hangs for 600 seconds, it will automatically time out.
Port Occupied Loading jobs require port 8500. Please ensure that this port is open.
This section will only cover the debugging schema change jobs, for more information about schema changes, please read the Modifying a Graph Schema page.
Understanding what happens behind the scenes during a schema change.
DSC (Dynamic Schema Change) Drain - Stops the flow of traffic to RESTPP and GPE If GPE receives a DRAIN command, it will wait 1 minute for existing running queries to finish up. If the queries do not finish within this time, the DRAIN step will fail, causing the schema change to fail.
DSC Validation - Verification that no queries are still running.
DSC Apply - Actual step where the schema is being changed.
DSC Resume - Traffic resumes after schema change is completed. Resume will automatically happen if a schema change fails. RESTPP comes back online. All buffered query requests will go through after RESTPP resumes, and will use the new updated schema.
Failure when creating a graph
Global Schema Change Failure
Local Schema Change Failure
Dropping a graph fails
If GPE or RESTPP fail to start due to YAML error, please report this to TigerGraph.
If you encounter a failure, please take a look at the GSQL log file :
gadmin log gsql. Please look for these error codes:
Error code 8 - The engine is not ready for the snapshot. Either the pre-check failed or snapshot was stopped. The system is in critical non-auto recoverable error state. Manual resolution is required. Please contact TigerGraph support.
Error code 310 - Schema change job failed and the proposed change has not taken effect. This is the normal failure error code. Please see next section for failure reasons.
Another schema change or a loading job is running. This will cause the schema change to fail right away.
GPE is busy. Potential reasons include :
Long running query.
Loading job is running.
Rebuild process is taking a long time.
Service is down. (RESTPP/GPE/GSE)
Cluster system clocks are not in sync. Schema change job will think the request is stale, causing this partition's schema change to fail.
Config Error. If the system is shrunk manually, schema change will fail.
You will need to check the logs in this order : GSQL log, admin_server log, service log.
Admin_server log files can be found here :
~/tigergraph/log/admin/ You will want to take a look at the INFO file.
The service log is each of the services respectively.
gadmin log <service_name> will show you the location of these log files.
$ grep DSC ~/tigergraph/log/admin/INFO.20181011-101419.98774I1015 12:04:14.707512 116664 gsql_service.cpp:534] Notify RESTPP DSCDrain successfully.I1015 12:04:15.765108 116664 gsql_service.cpp:534] Notify GPE DSCDrain successfully.I1015 12:04:16.788666 116664 gsql_service.cpp:534] Notify GPE DSCValidation successfully.I1015 12:04:17.805620 116664 gsql_service.cpp:534] Notify GSE DSCValidation successfully.I1015 12:04:18.832386 116664 gsql_service.cpp:534] Notify GPE DSCApply successfully.I1015 12:04:21.270011 116664 gsql_service.cpp:534] Notify RESTPP DSCApply successfully.I1015 12:04:21.692147 116664 gsql_service.cpp:534] Notify GSE DSCApply successfully.
E1107 14:13:03.625350 98794 gsql_service.cpp:529] Failed to notify RESTPP with command: DSCDrain. rc: kTimeout. Now trying to send Resume command to recover.E1107 14:13:03.625562 98794 gsql_service.cpp:344] DSC failed at Drain stage, rc: kTimeoutE1107 14:14:03.814132 98794 gsql_service.cpp:513] Failed to notify RESTPP with command: DSCResume. rc: kTimeout
In this case, we see that RESTPP failed at the DRAIN stage. We need to first look at whether RESTPP services are all up. Then, verify that the time of each machine is the same. If all these are fine, we need to look at RESTPP log to see why it fails. Again, use the "DSC" keyword to navigate the log.
To check the status of GSE, and all other processes, run
gadmin status to show the status of key TigerGraph processes. As with all other processes, you are able to find the log file locations for GSE by the
gadmin log command. Refer to the Location of Log Files for more information about which files to check.
$ gadmin log gseGSE : /home/tigergraph/tigergraph/log/gse/GSE_1#1.outGSE : /home/tigergraph/tigergraph/log/gse/log.INFO
If the GSE process fails to start, it is usually attributed to a license issue, please check these factors :
gadmin status license This command will show you the expiration date of your license.
Single Node License on a Cluster If you are on a TigerGraph cluster, but using a license key intended for a single machine, this will cause issues. Please check with your point of contact to see which license type you have.
Graph Size Exceeds License Limit Two cases may apply for this reason. The first reason is you have multiple graphs but your license only allows for a single graph. The second reason is that your graph size exceeds the memory size that was agreed upon for the license. Please check with your point of contact to verify this information.
Usually in this state, GSE is warming up. This process can take quite some time depending on the size of your graph.
<INCLUDE PROCESS NAME SHOWING CPU USAGE TO VERIFY THE "WARM UP" STATE>
GSE crashes are likely due to and Out Of Memory issue. Use the
dmesg -T command to check any errors.
If your system has unexpectedly high memory usage, here are possible causes :
Length of ID strings is too long GSE will automatically deny IDs with a length longer than 16k. Memory issues could also arise if an ID string is too long ( > 500). One proposed solution to this is to hash the string.
Too Many Vertex Types Check the number of unique vertex types in your graph schema. If your graph schema requires more than 200 unique vertex types, please contact TigerGraph support.
If your browser crashes or freezes (shown below), please refresh your browser.
If you suspect GraphStudio has crashed, first check
gadmin status to verify all the components are in good shape. Two known causes of GraphStudio crashes are :
Huge JSON response User-written queries often return very large JSON responses. There is a JSON size limiter, but this could still potentially cause an issue. This issue can be mitigated by editing he maximum response size in this file :
$ gadmin config entry GUI.RESTPPResponseSizeLimitGUI.RESTPPResponseSizeLimit [ 33554432 ]: The RESTPP response size limit bytes.✔ New: 33554432▐The default size is 33554432, so you would increase this value.
Very Dense Graph Visualization On the Explore Graph page, the "Show All Paths" query on a very dense graph is known to cause a crash.
To find the location of GraphStudio log files, use this command :
gadmin log vis
$ gadmin log guiGUI : /home/tigergraph/tigergraph/log/gui/GUI#1.out
Allowing GraphStudio DEBUG mode will print out more information to the log files. To allow DEBUG mode, edit the following configuration entry :
$ gadmin config entry GUI.BasicConfig.EnvGUI.BasicConfig.Env [ ]: The runtime environment variables, separated by ';'✔ New: DEBUG=true;▐
After editing the file, run
gadmin restart vis -y to restart the GraphStudio service. Follow along the log file to see what is happening :
tail -f /home/tigergraph/tigergraph/log/gui/GUI#1.out
Repeat the error inducing operations in GraphStudio and view the logs.
There are list of known GraphStudio issues here.
If after taking these actions you cannot solve the issue, please reach out to email@example.com to request assistance.