Removal of Failed Nodes
If a node fails in a highly available (HA) cluster, as in a hardware failure (replication factor > 1), you can remove the failed node from the cluster while keeping all your data intact. After removal, use the cluster expansion feature to restore your cluster to its original size. Restoring is important as node removal does not redistribute the data in your cluster.
You should only consider removing a node when the node has failed. Removing a node from a cluster does not redistribute the data in the cluster. If all nodes are working in a cluster, and you want to reduce the size of the cluster, shrink the cluster instead.
Remove a single node
If only a single node fails in a HA cluster, you can always remove the node without incurring any data loss.
Prerequisites
-
Your cluster should have a replication factor greater than 1.
-
All services on the rest of the nodes in the cluster except the node to be removed are in
RUNNING
status: You can check service status by runninggadmin status -v
-
If any service is down, try running
gadmin start all --ignore-error
to start them. If you cannot start them, please open a support ticket.
-
Remove multiple nodes
If there are multiple node failures in a cluster, recovering all data might not be possible. However, as long as there remains one complete replica of your data, it is possible to remove the failed nodes and restore the cluster afterwards with all data intact.
Node removal is intended for machine failures. Concurrent failures of multiple nodes is an extremely unlikely event under normal circumstances. Therefore, the default Kafka configurations in TigerGraph do not prepare for multi-node failure.
You can modify the default Kafka configurations to prepare for multi-node failure. However, you must do so before the failures happen.
If you experience multi-node failures without modifying the default configurations and still want to remove the failed nodes, please open a support ticket. |
Prerequisites
-
Your cluster’s Kafka configurations have been modified before the failures occurred.
-
To prepare for 2-node failures, run
gadmin config set Kafka.MinInsyncReplicas 1
to setKafka.MinInsyncReplicas
to 1. -
To prepare for concurrent failures of more nodes, run
gadmin config set Kafka.Replica.TopicReplicaFactor <value>
to satisfy this following condition:-
Replica.TopicReplicaFactor - <number_of_concurrent_failures> >= MinInsyncReplicas
-
-
-
More than half of the nodes in your cluster are still operational.
-
Your cluster has a replication factor greater than 1.
-
All services in your cluster should be in
RUNNING
status. You can check service status by runninggadmin status -v
. If a service is down, and you cannot restart it, please open a support ticket. -
There is at least 1 complete instance of the data. In other words, at least one replica does have a failed node.
Procedure
To remove failed nodes from a cluster, run the following command.
Replace <node_name>
with the names of the nodes and node_ip_address
with the internal IP addresses of the nodes.
Connect the different node name-IP pairs with a comma (,
):
$ gadmin cluster remove <node_name>:<node_ip_address>,<node_name>:<node_ip_address> ...