Disaster Recovery Support

Overview:

TigerGraph architecture is built with no Single Point of Failure (SPOF). This provides fault tolerance built at each component level. Any component or server failure is handled seamlessly by TigerGraph's Continuous Availability. However, there are situations where the failures span the entire cluster due to loss of data center or any other catastrophic event, Continuous Availability will not be sufficient. Typically, such an event would be defined as ‘Disaster’. Customers would need a Disaster Recovery plan to get services back up in the event of a disaster.

Cross-Region Replication (CRR) is a new feature that will allow users to keep two or more TigerGraph clusters in different data centers or regions in sync.

For customers, Cross-region replication will help deliver on the following business goals:

  • Disaster Recovery: Support Disaster Recovery functionality with the use of a dedicated remote cluster

  • Enhanced Availability: Enhance Inter-cluster data availability by synchronizing data using Read Replicas across two clusters

  • Enhanced Performance: If the customer application is spread over different regions, CRR can take advantage of data locality to avoid network latency.

  • Improved System Load-balancing: CRR allows you to distribute computation load evenly across two clusters if the same data sets are accessed in both clusters.

  • Data Residency Compliance: Cross-Region replication allows you to replicate data between different data centers or Regions to satisfy compliance requirements. Additionally, this feature can be used to set up clusters in the same region to satisfy more stringent Data sovereignty or localization business requirements.

Besides Disaster recovery and enhanced business continuity, this will enable forward-thinking customers to set up the clusters as part of Blue/Green deployment purposes for agile upgrades.

Design:

Disaster Recovery support will include complete native support for all Data and Metadata replication including Automated schema changes, User management, and Query management.

Cross-region replication will be delivered in two phases:

  • Phase1: Cross-region replication support for data from Primary to DR cluster. Metadata operations will not be supported. Phase 1 will be delivered in TigerGraph 3.1.

  • Phase2: Complete native support for all Data and Metadata replication including Automated schema changes, User and Query management. Phase 2 will be delivered in TigerGraph 3.2.

User Impact and Changes:

Primary and StandBy clusters need to have the same number of partitions. However, the clusters can have different numbers of replicas. Also, the clusters can be in the same region or data center.

Primary Cluster setup:

There are no configuration changes required for the primary cluster. This feature is designed not to impact the primary cluster operations in any way. However, the primary cluster should be running on TigerGraph Version 3.1.

DR Cluster setup

The remote cluster needs to be set up to be used as a Disaster Recovery cluster. The following configurations should be set up by the operations team to enable synchronization of data between primary and remote clusters. As mentioned before, there is no need to change any cluster configurations for the Primary cluster.

gadmin config set System.CrossRegionReplication.Enabled true
gadmin config set System.CrossRegionReplication.PrimaryKafkaIPs <primary_ips> // IP lists, comma separated
gadmin config set System.CrossRegionReplication.PrimaryKafkaPort <kafka_port>
gadmin config set System.CrossRegionReplication.GpeTopicPrefix Primary
gadmin config apply -y
gadmin init kafka -y
gadmin restart -y

Metadata Operations

As explained in the design overview, all the data loaded in the Primary cluster will be copied and loaded into the DR cluster automatically. In TigerGraph 3.1, users will also have to manually perform all the metadata operations. The Metadata operations include Schema change, Installation of new queries, and User Management operations.

With respect to Schema change, users will have to perform all the Schema change operations on the DR cluster in the same order after successfully applying schema change in the primary cluster . Without applying the corresponding schema change in DR cluster, data updates will pause in the DR cluster. Or if wrong schema change (or wrong order) is performed in the DR cluster, there will be data inconsistency issues resulting in loss of cluster services.

In TigerGraph 3.2, these Metadata commands will be automatically replicated from the primary cluster to the DR cluster.

Failover

In the event of catastrophic failure that has impacted the full cluster due to Data Center or Region failure, the customer can take steps to failover to the DR cluster. This is a manual process. Users will have to make the following configuration changes to upgrade the DR cluster to become the primary cluster.

gadmin config set System.CrossRegionReplication.Enabled false
gadmin config set System.CrossRegionReplication.PrimaryKafkaIPs
gadmin config set System.CrossRegionReplication.PrimaryKafkaPort
gadmin config set System.CrossRegionReplication.GpeTopicPrefix Primary gadmin config apply -y
gadmin restart -y

If we want to set up a new DR cluster over the upgraded primary cluster:

gadmin config set System.CrossRegionReplication.Enabled true
gadmin config set System.CrossRegionReplication.PrimaryKafkaIPs <primary_ips> // IP lists, comma separated
gadmin config set System.CrossRegionReplication.PrimaryKafkaPort <kafka_port>
gadmin config set System.CrossRegionReplication.GpeTopicPrefix Primary.Primary // Yes P.P
gadmin config apply -y
gadmin init kafka -y
gadmin restart -y