MySQL NDB Cluster Backup & Restore In An Easy Way

January 13, 2021

In this post, we will see, how easily user can take NDB Cluster backup and then restore it. NDB cluster supports online backups, which are taken while transactions are modifying the data being backed up. In NDB Cluster, each backup captures all of the table content stored in the cluster.

User can take backup in the following states:

When the cluster is live and fully operational
When the cluster is live, but in a degraded state:

Some data nodes are down
Some data nodes are restarting

During read and write transactions

Users can restore backups in the following cluster environments:

Restore to the same physical cluster
Restore into a different physical cluster
Restore into a different configuration cluster i.e. backup taken from a 4 nodes cluster and restore into 8 data nodes cluster
Restore into a different cluster version

Backups can be restored flexibly:

Restore can be run locally or remotely w.r.t the data nodes
Restore can be run in parallel across data nodes
Can restore a partial set of the tables captured in the backup

Use cases of Backup & Restore:

Disaster recovery - setting up a cluster from scratch
Setup NDB Cluster asynchronous replication
Recovery from user/DBA accidents like dropping of a table/database/schema changes etc
During NDB Cluster software upgrade

Limitations:

Schemas and table data for tables stored using the NDB Cluster engine are backed up
Views, stored procedure, triggers and tables/schemas from other storage engine like Innodb are not backed up. Users need to use other MySQL backup tools like mysqldump/mysqlpump etc to capture these
Support for only full backup. No incremental or partial backup supported.

NDB Cluster Backup & Restore concept in brief:

In NDB Cluster, tables are horizontally partitioned into a set of partitions, which are then distributed across the data nodes in the cluster. The data nodes are logically grouped into nodegroups. All data nodes in a nodegroup (up to four) contain the same sets of partitions, kept in sync at all times. Different nodegroups contain different sets of partitions. At any time, each partition is logically owned by just one node in one nodegroup, which is responsible for including it in a backup.

When a backup starts, each data node scans the set of table partitions it owns, writing their records to its local disk. At the same time, a log of ongoing changes is also recorded. The scanning and logging are synchronised so that the backup is a snapshot at a single point in time. Data is distributed across all the data nodes, and the backup occurs in parallel across all nodes, so that all data in the cluster is captured. At the end of a backup, each data node has recorded a set of files (*.data, *.ctl, *.log), each containing a subset of cluster data.

During restore, each set of files will be restored [in parallel] to bring the cluster to the snapshot state. The CTL file is used to restore the schema, the DATA file is used to restore most of the data, and the LOG file is used to ensure snapshot consistency.

Let’s look at NDB Cluster backup and restore feature through an example:

To demonstrate this feature, let’s create a NDB Cluster with below environment.

NDB Cluster 8.0.22 version

2 Management servers
4 Data nodes servers
2 Mysqld servers
6 API nodes
NoOfReplicas = 2

If you are wondering how to setup a NDB Cluster, then please look into my previous blog here.

Step 1:

Before we start the cluster, let’s modify the cluster config file (config.ini) for backup. When backup starts, it create 3 files (BACKUP-backupid.nodeid.Data, BACKUP-backupid.nodeid.ctl, BACKUP-backupid.nodeid.log) under a directory named BACKUP. By default, this directory BACKUP created under each data node data directory. It is advisable to create this BACKUP directory outside the data directory. This can be done by adding a config variable ‘BackupDataDir’ to cluster configuration file i.e. config.ini

In the below example, I have assigned a path to ‘BackupDataDir‘ in config.ini:
BackupDataDir=/export/home/saroj/mysql-tree/8.0.22/ndbd/node1/data4

Step 2:

Let’s look at the cluster from the management client (bin/ndb_mgm):

Step 3:

As cluster is up and running so let’s create a database, a table and do some transactions on it.

Let’s insert rows into table ‘t1’ either thru sql or thru any tools. Let’s continue the rows insertion thru sql to have a significant amount of datas in the cluster.

Let’s check the rows count from table ‘t1’. From the below image, we can see that table 't1' has ‘396120’ rows in it.

Step 4:

Now issue a backup command from the management client (bin/ndb_mgm) while some transactions on the table ‘t1’ was going on. We will delete rows from table ‘t1’ and issue a backup command in parallel.

While delete ops is going on, issue a backup command from the management client:

Let’s check the new row count from table ‘t1’ after all the delete ops finished. From the below image, we can see that now the table ‘t1’ has ‘306120’ rows.

Let’s look at the files backup created. As we have assigned a path to backup files so let’s discuss about these files in brief.

From the above image, we can see that, for each backup, one backup directory is created (BACKUP-backupid) and under each backup directory, 3 files are created. These are:

BACKUP-backupid-0.node_id.Data (BACKUP-1-0.1.Data):

The above file contains most of the data stored in the table fragments owned by this node. In the above example, 1 is the backupid, 0 is a hardcoded value for future use. 1 is node_id of the data node 1.

BACKUP-backupid.node_id.ctl (BACKUP-1.1.ctl):

The above file contains table meta data i.e. table definitions, index definitions.

BACKUP-backupid.node_id.log (BACKUP-1.1.log):

This file contains all the row changes that happened to the tables while the backup was in progress. These logs will execute during restore either as roll forward or roll back depends on whether the backup is snapshot start or snapshot end.

Note:
User can restore from anywhere i.e. doesn’t have to be from any particular data node. ndb_restore is an NDB API client program, so can run anywhere that can access the cluster.

Step 5:

Upon successfully completion of a backup, the output will looks like below:

From the above image, Node 1 is the master node who initiate the backup, node 254 is the management node on which the START BACKUP command was issued, and Backup 1 is the 1st backup taken. #LogRecords ‘30000’ indicate that while backup was in progress some transaction was also running on the same table. #Records shows the number of records captured across the cluster.

User can see the backup status also from the “cluster log” as shown below:

2021-01-12 15:00:04 [MgmtSrvr] INFO -- Node 1: Backup 1 started from node 254
2021-01-12 15:01:18 [MgmtSrvr] INFO -- Node 1: Backup 1 started from node 254 completed. StartGCP: 818 StopGCP: 855 #Records: 306967 #LogRecords: 30000 Data: 5950841732 bytes Log: 720032 bytes

So this concludes our NDB Cluster backup procedure.

Step 6:

We will now try to restore the data from the backup taken above. Let’s shutdown the cluster, cleanup all the files except the backup files and then start the cluster with initial (with no data).

Let’s restore the backup to a different cluster. From the below image, we can see that data node Id’s are different from the cluster where backup was taken.

Now let’s see if our database ‘test1’ is exist in the cluster or not after initial start.

From the above image, we can see that, database ‘test1’ is not present. Now let’s start our restore process from the backup image #1 (BACKUP-1).

The NDB restore works in this flow:

It first restore the meta data from the *.ctl file so that all the tables/indexes can be recreated in the database.
Then it restore the data files (*.Data) i.e. inserts all the records into the tables in the database.
At the end, it executes all the transaction logs (*.log) rollback or roll forward to make the database consistent.
Since restore will fail while restoring unique and foreign key constraints that are taken from the backup image so user must disable the index at the beginning and once restore is finished, again user need to rebuild the index.

Step 7:

Let’s start the restoration of meta data.

Meta data restore, disable index and data restore can execute at one go, or can be done in serial. This restore command can be issued from any data node or can be from a non-data node as well.

In this example, I am issuing meta data restore and disable index from Data Node 1 only for once. Upon successful completion, I will issue the data restore.

Data Node 1:

shell> bin/ndb_restore -n node_id -b backup_id -m --disable-indexes --ndb-connectstring=cluster-test01:1186,cluster-test02:1186 –backup_path=/path/to/backup directory

-n: node id of the data node from where backup was taken. Do not confuse with the data node id of the new cluster.
-b: backup id (we have taken backup id as ‘1’)
-m: meta data restoration (recreate table/indexes)
--disable-indexes: disable restoration of indexes during restore of data
--ndb-connectstring (-c): Connection to the management nodes of the cluster.
--backup_path: path to the backup directory where backup files exist.

The results of above meta restore from data node 1 is shown below:

Let’s start the data restore on the Data Node 1.

Data Node 1:
shell> bin/ndb_restore -n node_id -b backup_id -r --ndb-connectstring=cluster-test01:1186,cluster-test02:1186 –backup_path=/path/to/backup directory

Below, I am trying to capture the logs from the data restore run results as it started and then at the end.

From the above image, we can see that restore went successful. Restore skips restoration of system table data. System tables referred to here are tables used internally by NDB Cluster. Thus these tables should not be overwritten by the data from a backup. Backup data is restored in fragments, so whenever a fragment is found, ndb_restore checks if it belongs to a system table. If it does belong to a system table, ndb_restore decides to skip restoring it and prints a 'Skipping fragment' log message.

Let’s finish all the remaining data restore from the other data nodes. These data restore can be run in parallel to minimise the restore time. Here, we don’t have to pass -m, --disable-indexes again to restore command as we need to do it only once. With the first restore completion, it has already created tables, indexes etc so no need to recreate it again and will also fail. Once all the data are restored into the table(s), we will enable the indexes and constraints again using the –rebuild-indexes option. Note that rebuilding the indexes and constraints like this ensures that they are fully consistent when the restore completes.

Data Node 2:
shell> bin/ndb_restore -n node_id -b backup_id -r --ndb-connectstring=cluster-test01:1186,cluster-test02:1186 –backup_path=/path/to/backup directory

Data Node 3:
shell> bin/ndb_restore -n node_id -b backup_id -r --ndb-connectstring=cluster-test01:1186,cluster-test02:1186 –backup_path=/path/to/backup directory

Data Node 4:
shell> bin/ndb_restore -n node_id -b backup_id -r --ndb-connectstring=cluster-test01:1186,cluster-test02:1186 –backup_path=/path/to/backup directory

Ndb restore (ndb_restore) is an API, it needs API slots to connect to cluster. Since we have initiated 3 ndb_restore programme in parallel from Data node ID 4, 5 and 6 so we can see from the below image that ndb_restore took API ID: 47, 48 and 49.

Let’s see the results from the remaining data nodes.

Since all the ndb_restore API finished successfully, we can see that the API ID that it had taken to connect the cluster has been released.

The last step is to rebuild the index. This can also done from any data nodes or from any non-data nodes but only once.

Data Node 1:
shell> bin/ndb_restore -n node_id -b backup_id --rebuild-indexes --ndb-connectstring=cluster-test01:1186,cluster-test02:1186 –backup_path=/path/to/backup directory

--rebuild-indexes: It enables rebuilding of ordered indexes and foreign key constraints.

Step 8:

So we have finished our restoration steps. Let’s check the database, table, rows count in table etc ..

So database ‘test1’ is already created.

Now we can see that table ‘t1’ has been created and the row count#306120 which is also matching with our backup image (look at Step# 4).

So this concludes our NDB Cluster backup and restore feature. There are many more options user can pass to both backup (START BACKUP) and restore (ndb_restore) programme based on the requirements. In the above example, I have selected the basic minimum options user might need for backup and restore. For more information on these options, please refer to NDB Cluster reference manual here.