Thursday, August 27, 2009

Redundant management nodes in MySQL Cluster

Every time I teach the MySQL Cluster architecture, someone inevitably asks "Isn't the management node (ndb_mgmd) a single point of failure?" The short answer: no. The management node is not a SPOF because the cluster can continue without it. However, it's inconvenient if your management node is down because the management node does several things such as:

  • Provide status information about the cluster and allow you to use the ndb_mgm for various maintenance tasks like taking a hot backup
  • Own the cluster config file (therefore it must be running to start a node)
  • Arbitration in case of a potential split-brain
  • Logging


So while the management node can be down, it is nice to have a redundant one for failover. This is very easy to do:


  1. Add 2 [NDB_MGMD] sections to config.ini:
    [NDB_MGMD]
    #Id is required when defining multiple mgmt nodes
    Id=1
    Hostname=192.168.0.31

    [NDB_MGMD]
    Id=2
    Hostname=192.168.0.32

  2. Change the ndb-connectstring to include both IPs of the management nodes:
    [mysql_cluster]
    ndb-connectstring=192.168.0.31,192.168.0.32

  3. Make sure the config.ini is on both management nodes and that the files are identical. Start both ndb_mgmd nodes.


That's it! The management nodes will act in an active-passive way and failover as necessary. Make sure you do not run any management node on the same physical host as a data node - it will cause a cluster shutdown if they fail simultaneously.

5 comments:

  1. Nice Information. Can you tell me what about the data nodes here, how are they configured. Assume I have two data node, so these two data nodes will connect to bot the management nodes?

    Thanks

    ReplyDelete
  2. The --ndb-connectstring option the ndb data nodes are started with is the same format as the ndb-connectstring option in the my.cnf. eg: ndbd --ndb-connectstring="192.168.0.31,192.168.0.32"

    ReplyDelete
  3. thanks for your post can I ask you same quesions on your email about the cluster?

    ReplyDelete
  4. Why does the cluster shutdown occurs when the data and the management node are on the same host or say a blade and a full blade goes down. Is it because of the race condition? Cant the management node first fail-over and then the data node could talk to the second management node?

    ReplyDelete