Details: Category: Technology

http://galeracluster.com/documentation-webpages/restartingcluster.html

This is one of the most common issues that could carry issues if you have a MariaDB Galera Cluster. After restarting the cluster, the nodes are not joining. If you look into your log, you may find a line like this:

It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1.

Continue reading, here are some steps I did to recover mine.

First Attempt

If all nodes are down at some point, take note of what node was the last to be offline. This node will need to be the first to be started. Start that node with the --wsrep-new-cluster flag on. In CentOS 6, and any Linux that still uses SystemV it can be done with a service mysql start --wsrep-new-cluster command. If you are using CentOS 7 or any distribution with Systemd, you need to do something like systemctl set-environment _WSREP_NEW_CLUSTER='--wsrep-new-cluster' && systemctl start mariadb && systemctl set-environment _WSREP_NEW_CLUSTER=''. The second systemctl is important because of Systemd has environment persistency and it is very unlikely that that node will be the last to be offline, not to mention it is very easy to forget that flag.

Go to the other server and restart it. Check the log, you should see rsync messages and a WSREP log telling the cluster name and reporting a new node in the cluster.

If this doesn't fix your cluster, continue with the second attempt.

Second Attempt

Find a file named grstate.dat, usually, it is in the /var/lib/mysql directory. This file is very interesting, it contains cluster state information. Edit it on both servers and put the safe_to_bootstrap field to 1 on both servers. Start the first one (from the first attempt). In the second server run the galera_recovery script. This script will recover the position number you need to start your cluster. It will output something like WSREP: Recovered position <uuid>. Next is to start the MariaDB with that using the --wsrep_start_position flag.

In my case, I started to see rsync synchronization and after a while, my second server was online.

Big Data? More than 1 GB of data or slow link between Nodes?

So after starting the MariaDB daemon, you see that data is syncing. Indeed, it finishes the synchronization but you see an error like this: WSREP: Failed to read uuid:seqno from joiner script. It seems that systemd is timing out and the parent process is killed. The solution is extending the timeout of systemd. Edit the mariadb.service unit file and at the bottom of it, add or edit a line like TimeoutStartSec=9999999999999999min. This will give you enough time to finish the sync. Remember to type systemctl daemon-reload to apply the change.

Last Resource

Delete the grstate.dat and ib_log* files. This will force a full resync. Just restart Mariadb as usual.

Good luck!

Popular Tags

Recovering MariaDB Galera Cluster after a Restart

First Attempt

Second Attempt

Big Data? More than 1 GB of data or slow link between Nodes?

Last Resource

Related Articles

Latest Articles