High Availability (HA) Overview

A TigerGraph system with High Availability (HA) is a cluster of server nodes which uses replication to provide continuous operation of most services in the event that one or more nodes is offline.

For example, an application’s predefined queries will continue to run and file-based loading will continue with files resident at any of the working nodes.

Incorporating HA can yield a number of other system benefits, such as:

The query workload is distributed across all replicas.
- This includes replicas used for long-running queries or with the GSQL-REPLICA header.
Data loading operations are distributed across all nodes.
Individual nodes can fail without impacting query workloads.
- If a query does fail during a node failure the system adjusts to accommodate the failed node (typically up to 30 seconds). It is highly recommended to adopt client-side retry logic as a workaround.
Individual nodes can be re-paved without impacting query workloads.

The re-pavement of a node, is an offline process that takes a node offline intentionally for maintenance or updates and then is brought back online for service.

HA Considerations

TigerGraph HA provides continuous operation of some but not all services. Please note the following exceptions and consider your options for taking additional steps to maintain continuous operation or to restore service as quickly as possible.

If an HA system is operating with a failed node, unless the node is recovered, replaced, or the system is reconfigured to exclude that node, the following services are limited or unavailable:

A data partition slated for a connector-based loading, such as, s3 files or via kafka, cannot be loaded.
New queries cannot be installed.

However, new interpreted and any existing queries can still be executed.
Schema changes are not allowed.
Backup and export operations are not available and will be rejected.
If the primary node is offline, access to Graph Studio is interrupted, but resumes once the primary node is back online.

As a workaround, if the failed node cannot be recovered (e.g. hardware issue), full operation can be restored temporarily by the removal of the failing nodes. For example, a 5 x 2 cluster with one node removed would become a 4x2 + 1, where 1 is the data partition that is not being replicated.

3.9.2 and Below

In addition to the considerations above, in versions 3.9.2 and below, users will not be able to run a GSQL query when a single node is down in a High Availability cluster.

In this case, as with the other consideration cases above, the failed node needs to be removed from the cluster via:

gadmin cluster remove <node_name>:<node_ip_address>

This issue is no longer present in versions 3.9.3 and above.

High Availability (HA) Overview

HA Considerations

3.9.2 and Below

High Availability Cluster Configuration

High Availability Support for GSQL Server

High Availability Support for Application Server

Cluster Commands

Removal of Failed Nodes