Removal of Failed Nodes

If a node fails in a highly available (HA) cluster, as in a hardware failure (replication factor > 1), you can remove the failed node from the cluster while keeping all your data intact. After removal, use the cluster expansion feature to restore your cluster to its original size. Restoring is important as node removal does not redistribute the data in your cluster.

You should only consider removing a node when the node has failed. Removing a node from a cluster does not redistribute the data in the cluster. If all nodes are working in a cluster, and you want to reduce the size of the cluster, shrink the cluster instead.

Remove a single node

If only a single node fails in a HA cluster, you can always remove the node without incurring any data loss.

Prerequisites

  • Your cluster should have a replication factor greater than 1.

  • All services on the rest of the nodes in the cluster except the node to be removed are in RUNNING status: You can check service status by running gadmin status -v

    • If any service is down, try running gadmin start all --ignore-error to start them. If you cannot start them, please open a support ticket.

Procedure

To remove a failed node from a cluster, run the following command and replace <node_name> with the name of the node and node_ip_address with the internal IP address of the node:

$ gadmin cluster remove <node_name>:<node_ip_address>

Remove multiple nodes

If there are multiple node failures in a cluster, recovering all data might not be possible. However, as long as there remains one complete replica of your data, it is possible to remove the failed nodes and restore the cluster afterwards with all data intact.

Node removal is intended for machine failures. Concurrent failures of multiple nodes is an extremely unlikely event under normal circumstances. Therefore, the default Kafka configurations in TigerGraph do not prepare for multi-node failure.

You can modify the default Kafka configurations to prepare for multi-node failure. However, you must do so before the failures happen.

If you experience multi-node failures without modifying the default configurations and still want to remove the failed nodes, please open a support ticket.

Prerequisites

  • Your cluster’s Kafka configurations have been modified before the failures occurred.

    • To prepare for 2-node failures, run gadmin config set Kafka.MinInsyncReplicas 1 to set Kafka.MinInsyncReplicas to 1.

    • To prepare for concurrent failures of more nodes, run gadmin config set Kafka.Replica.TopicReplicaFactor <value> to satisfy this following condition:

      • Replica.TopicReplicaFactor - <number_of_concurrent_failures> >= MinInsyncReplicas

  • More than half of the nodes in your cluster are still operational.

  • Your cluster has a replication factor greater than 1.

  • All services in your cluster should be in RUNNING status. You can check service status by running gadmin status -v. If a service is down, and you cannot restart it, please open a support ticket.

  • There is at least 1 complete instance of the data. In other words, at least one replica does have a failed node.

Procedure

To remove failed nodes from a cluster, run the following command.

Replace <node_name> with the names of the nodes and node_ip_address with the internal IP addresses of the nodes. Connect the different node name-IP pairs with a comma (,):

$ gadmin cluster remove <node_name>:<node_ip_address>,<node_name>:<node_ip_address> ...