High Availability (HA) Overview

A TigerGraph system with High Availability (HA) is a cluster of server nodes which uses replication to provide continuous operation of most services in the event that one or more nodes is offline.

For example, an application’s predefined queries will continue to run and file-based loading will continue with files resident at any of the working nodes.

Incorporating HA can yield a number of other system benefits, such as:

  • The query workload is distributed across all replicas.

    • This includes replicas used for long-running queries or with the GSQL-REPLICA header.

  • Data loading operations are distributed across all nodes.

  • Individual nodes can fail without impacting query workloads.

    • If a query does fail during a node failure the system adjusts to accommodate the failed node (typically up to 30 seconds). It is highly recommended to adopt client-side retry logic as a workaround.

  • Individual nodes can be re-paved without impacting query workloads.

The re-pavement of a node, is an offline process that takes a node offline intentionally for maintenance or updates and then is brought back online for service.

HA Considerations

TigerGraph HA provides continuous operation of some but not all services. Please note the following exceptions and consider your options for taking additional steps to maintain continuous operation or to restore service as quickly as possible.

If an HA system is operating with a failed node, unless the node is recovered, replaced, or the system is reconfigured to exclude that node, the following services are limited or unavailable:
  • A data partition slated for a connector-based loading, such as, s3 files or via Kafka, cannot be loaded.

  • New queries cannot be installed.

    However, new interpreted and any existing queries can still be executed.
  • Schema changes are not allowed.

  • Backup and export operations are not available and will be rejected.

As a workaround, if the failed node cannot be recovered (e.g. hardware issue), full operation can be restored temporarily by the removal of the failing nodes. For example, a 5 x 2 cluster with one node removed would become a 4x2 + 1, where 1 is the data partition that is not being replicated.

3.9.2 and Below

In addition to the considerations above, in versions 3.9.2 and below, users will not be able to run a GSQL query when a single node is down in a High Availability cluster.

In this case, as with the other consideration cases above, the failed node needs to be removed from the cluster via:

gadmin cluster remove <node_name>:<node_ip_address>

This issue is no longer present in versions 3.9.3 and 3.10.0.

High Availability Cluster Configuration

Here you will find detailed information about terminology, system requirements, and how to configure an HA cluster.

High Availability Support for GSQL Server

Learn how TigerGraph incorporates built-in HA for all the internal critical components.

High Availability Support for Application Server

Here you will find detailed information about how TigerGraph supports native HA functionality for its application server, which serves the APIs for TigerGraph’s GUI - GraphStudio and Admin Portal.

Cluster Commands

Here platform owners can learn advanced Linux commands that simplify platform operations and can be performed during debugging on HA clusters.

Removal of Failed Nodes

Here you find detailed Instructions for the removal of a failed node.

File and Kafka loaders HA with Auto-Restart

Loading jobs for Kafka and Local Files will automatically restart the loader job process if it unexpectedly exits. And will then continue to import data from the loading job.

This functionality is enabled by default, but users can disable this feature by setting LoaderRetryMax=0 through gadmin config set RESTPP.BasicConfig.Env.

If not set, the default value is 20; LoaderRetryMax=20.

Additionally, the current limit on how many times the loading job will be restarted after a failure is up to 20 times.