TigerGraph architecture is built with no Single Point of Failure (SPOF). This provides fault tolerance built at each component level. Any component or server failure is handled seamlessly by TigerGraph's Continuous Availability.
However, there are situations where the failures span the entire cluster due to loss of data center or any other catastrophic event, Continuous Availability will not be sufficient. Typically, such an event would be defined as a Disaster. Customers would need a disaster recovery (DR) plan to get services back up in the event of a disaster.
Cross-Region Replication (CRR) is a new feature that will allow users to keep two or more TigerGraph clusters in different data centers or regions in sync.
For customers, cross-region replication will help deliver on the following business goals:
Disaster Recovery: Support Disaster Recovery functionality with the use of a dedicated remote cluster
Enhanced Availability: Enhance Inter-cluster data availability by synchronizing data using Read Replicas across two clusters
Enhanced Performance: If the customer application is spread over different regions, CRR can take advantage of data locality to avoid network latency.
Improved System Load-balancing: CRR allows you to distribute computation load evenly across two clusters if the same data sets are accessed in both clusters.
Data Residency Compliance: Cross-Region replication allows you to replicate data between different data centers or Regions to satisfy compliance requirements. Additionally, this feature can be used to set up clusters in the same region to satisfy more stringent Data sovereignty or localization business requirements.
Besides Disaster recovery and enhanced business continuity, this will enable forward-thinking customers to set up the clusters as part of Blue/Green deployment purposes for agile upgrades.
Disaster Recovery support will include complete native support for all Data and Metadata replication including Automated schema changes, User management, and Query management.
Cross-region replication will be delivered in two phases:
Phase1: Cross-region replication support for data from Primary to DR cluster. Metadata operations will not be supported. Phase 1 will be delivered in TigerGraph 3.1.
Phase2: Complete native support for all Data and Metadata replication including Automated schema changes, User and Query management. Phase 2 will be delivered in TigerGraph 3.2.
To support cross-region replication, primary and standby clusters need to have the same number of partitions. However, the clusters can have different numbers of replicas. Also, the clusters can be in the same region or data center.
The following setup is needed in order to perform a failover in the event of a disaster:
There are no configuration changes required for the primary cluster. This feature is designed not to impact the primary cluster operations in any way. However, the primary cluster should be running on TigerGraph Version 3.1.
The remote cluster needs to be set up to be used as a Disaster Recovery cluster. The following configurations should be set up by the operations team to enable the synchronization of data between primary and remote clusters.
gadmin config set System.CrossRegionReplication.Enabled truegadmin config set System.CrossRegionReplication.PrimaryKafkaIPs <primary_ips> // IP lists, comma separatedgadmin config set System.CrossRegionReplication.PrimaryKafkaPort <kafka_port>gadmin config set System.CrossRegionReplication.GpeTopicPrefix Primarygadmin config apply -ygadmin init kafka -ygadmin restart -y
All the data loaded in the Primary cluster will be copied and loaded into the DR cluster automatically. In TigerGraph 3.1, users will also have to manually perform all the metadata operations. The Metadata operations include Schema change, Installation of new queries, and User Management operations.
With respect to Schema change, users will have to perform all the Schema change operations on the DR cluster in the same order after successfully applying schema change in the primary cluster. Without applying the corresponding schema change in the DR clusters, data updates will pause in the DR clusters. Or if wrong schema change (or wrong order) is performed in the DR cluster, there will be data inconsistency issues resulting in loss of cluster services.
In the event of catastrophic failure that has impacted the full cluster due to Data Center or Region failure, the customer can initiate the failover to the DR cluster. This is a manual process. Users will have to make the following configuration changes to upgrade the DR cluster to become the primary cluster.
gadmin config set System.CrossRegionReplication.Enabled falsegadmin config set System.CrossRegionReplication.PrimaryKafkaIPsgadmin config set System.CrossRegionReplication.PrimaryKafkaPortgadmin config set System.CrossRegionReplication.GpeTopicPrefix Primarygadmin config apply -ygadmin restart -y
If we want to set up a new DR cluster over the upgraded primary cluster:
gadmin config set System.CrossRegionReplication.Enabled truegadmin config set System.CrossRegionReplication.PrimaryKafkaIPs <primary_ips> // IP lists, comma separatedgadmin config set System.CrossRegionReplication.PrimaryKafkaPort <kafka_port>gadmin config set System.CrossRegionReplication.GpeTopicPrefix Primary.Primary // Yes P.Pgadmin config apply -ygadmin init kafka -ygadmin restart -y
There is no limit on the number of times a cluster can failover to another cluster. When designating a new DR cluster, make sure that you set the
System.CrossRegionReplication.GpeTopicPrefix parameter correctly by adding an additional
.Primary . For example, if your original cluster fails over once, and the current cluster's
Primary, then the new DR cluster needs to have its
Primary.Primary. If it needs to fail over again, the new DR cluster needs to have its
GpeTopicPrefix be set to