HA Cluster Configuration

A TigerGraph system with High Availability (HA) is a cluster of server machines which uses replication to provide continuous service when one or more servers are not available or when some service components fail.

TigerGraph HA service provides load balancing when all components are operational, as well as automatic failover in the event of a service disruption.

The following terms describe the size and dimension of a HA cluster:

replication factor

Total number of instances of your data.

partitioning factor

Number of machines across which one copy of the data is distributed. In practice, the partitioning factor of a cluster is decided by dividing the total number of machines in the cluster by the replication factor.

cluster size

Total number of machines in the cluster.

If a HA cluster has a replication factor of 2, the cluster maintains two instances of the data, stored on separate machines.

If a HA cluster has a partitioning factor of 2, each instance of the data is distributed across two machines.

Replication and distribution both increase query throughput and greater system resiliency. There is no upper limit for either partitioning factor or replication factor.

For example, the following cluster has a replication factor of 2 and a cluster size of 10, which produces a partitioning factor of \(10 / 2 = 5\).

Replication factor vs. partitioning factor
Figure 1. Replication factor vs. partitioning factor

System Requirements

  • A highly available cluster has a cluster size of at least 4 and a replication factor of at least 2. A highly available cluster can withstand the failure of any node in the cluster.

  • If you choose to resize a cluster after installation, ensure every node across the cluster is running the same version of TigerGraph.

Configuring HA settings

There are two ways of configuring HA settings for a cluster:

Configure HA settings during installation

Configuring a HA cluster is part of platform installation.

During TigerGraph platform installation, specify the replication factor. The default value for replication factor is 1, which means there is no HA setup for the cluster. The user does not explicitly set the partitioning factor. Instead, the TigerGraph system will set the partitioning factor through the following formula:

partitioning factor = (number of machines / replication factor)

If the division does not produce an integer, some machines will be left unused. For example, if you install a 7-node cluster with a replication factor of 2, the resulting configuration is 2-way HA for a database with a partitioning factor of 3. One machine is left unused.

Resize a running cluster

In addition to configuring HA settings during installation, you can also resize a running cluster.

Cluster resizing allows you to perform the following:

  • Expand a cluster

  • Shrink a cluster

  • Repartition a cluster

During cluster resizing, you should expect a brief amount of cluster downtime, during which your cluster will not be able to respond to requests. The specific amount of downtime varies depending on your cluster size, machine memory and CPU, etc. For reference:

Table 1. Expansion downtime - Example 1
Before expansion After expansion

Replication factor

2

2

Partitioning factor

2

4

Data volume (per copy)

100 GB

Instance type

GCP EC2 with 8 vCPUs and 32 GB memory

Downtime

About 4 minutes

Table 2. Expansion downtime - Example 2
Before expansion After expansion

Replication factor

2

4

Partitioning factor

4

4

Data volume (per copy)

500 GB

Instance type

GCP EC2 with 8 vCPUs and 32 GB memory

Downtime

About 9 minutes