Migrate from MongoDB to HAQM DocumentDB using the offline method

HAQM DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. The HAQM DocumentDB Migration Guide outlines three primary approaches for migrating from MongoDB to HAQM DocumentDB: offline, online, and hybrid.

The offline migration approach is the fastest and simplest of the three, but it incurs the longest period of downtime. This approach is a good choice for proofs of concepts, development and test workloads, and production workloads for which downtime is not of primary concern. In this first of a three-part series on migration, I use the offline approach to migrate data from a MongoDB replica set on HAQM EC2 to an HAQM DocumentDB cluster.

Related HAQM DocumentDB posts

Offline migration overview

The following diagram illustrates an offline migration from MongoDB to HAQM DocumentDB.

This approach has five basic steps:

Stop application writes to the source MongoDB deployment.
Dump indexes and data to an EC2 instance using the mongodump tool.
(Optional) Restore the indexes to the HAQM DocumentDB cluster using the HAQM DocumentDB Index Tool.
Restore the data to the HAQM DocumentDB cluster using the mongorestore tool.
Change the connection string in the application to point to the HAQM DocumentDB cluster.

Preparing for migration

To perform this offline migration, I need the following three components:

A source MongoDB deployment
An EC2 instance for exporting and importing data
A target HAQM DocumentDB cluster

Before migrating to the HAQM DocumentDB cluster, I will stop application writes to the source MongoDB deployment. This step is required to ensure that no data changes in my migration source while migrating to the DocumentDB cluster. The source MongoDB deployment is a replica set deployed on HAQM EC2. To minimize the impact of the migration to any workload on this replica set, I export the data from a secondary instance.

Note: If your MongoDB source is using a version of MongoDB earlier than 3.6, you must upgrade your source deployment and your application drivers. They must be compatible with MongoDB 3.6 at a minimum to use HAQM DocumentDB.

You can determine the version of your source deployment by running the following in the MongoDB shell:

rs0:PRIMARY> db.version()
3.6.9

Using the HAQM DocumentDB console, I create an HAQM DocumentDB cluster as the migration target, as shown following.

The time it takes to restore the data is in part determined by the size of the target cluster’s primary instance. To achieve the highest import throughput, I create a single r5.24xlarge instance, the largest size supported by HAQM DocumentDB in this AWS Region. Smaller instance sizes also work, but they might require more time to import the data. After my data is migrated, I can change my primary instance to a different instance size as needed. I can then add additional read replicas for read scaling and high availability.

The last component is the EC2 instance on which I will run the export and import processes (the migration instance). A key consideration is to ensure that my migration instance’s HAQM EBS volume is large enough to hold my exported data. You can obtain a rough estimate of a database’s size in bytes by running the db.stats() command in the mongo shell, and looking at the value of storageSize.

The migration instance needs the mongo shell, along with the mongodump and mongorestore tools. At a minimum, I need to install the mongodb-org-shell and mongodb-org-tools packages. (See the MongoDB documentation for instructions.)

Because HAQM DocumentDB uses Transport Layer Security (TLS) encryption by default, I must also download the HAQM RDS certificate authority (CA) file to use the mongo shell to connect:

[ec2 ]$ curl -O http://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem

(You can also disable TLS. For more information, see Encrypting Connections Using TLS in the HAQM DocumentDB Developer Guide.)

After installing the shell and tools, I ensure that the migration instance can communicate with both the source instance and the target HAQM DocumentDB cluster. I do this by connecting to each and running a ping command as follows.

To connect to the source replica set instance:

[ec2]$ mongo --host my-secondary-hostname \
--username myuser --password mypassword
…
rs0:PRIMARY> db.runCommand('ping')
{ "ok" : 1 }

To connect to the HAQM DocumentDB cluster:

[ec2]$ mongo --ssl --host docdb-cluster-endpoint \
--sslCAFile rds-combined-ca-bundle.pem --username myuser \
--password mypassword
…
rs0:PRIMARY> db.runCommand('ping')
{ "ok" : 1 }

If I had trouble connecting to either my source instance or my HAQM DocumentDB cluster, I would check the security group configuration. I would ensure that the EC2 instance has permission to connect to both on the MongoDB port in use (27017 by default). For additional troubleshooting options, see the HAQM DocumentDB documentation.

Dumping the data using mongodump

With connectivity established, I can now export the data and indexes to the EC2 migration instance using the mongodump tool. I set the –-readPreference option to secondary to force the dump to connect to a secondary replica set member. This step reduces the potential impact of the mongodump on the source deployment. To use the --readPreference option, I must connect to the replica set member using the form replicaSetName/replicasetMember:

[ec2]$ mongodump --host rs0/myhost --username user \
--password password --db books --authenticationDatabase admin \
--readPreference secondary
2019-03-19T00:16:57.095+0000	writing books.j to
2019-03-19T00:16:57.095+0000	writing books.a to
2019-03-19T00:16:57.424+0000	done dumping books.j (100000 documents)
2019-03-19T00:16:57.445+0000	done dumping books.a (100000 documents)
…

The time it takes the data to export depends on the size of the source dataset, the speed of the network between the migration instance and the source, and the migration instance’s resources.

Restoring indexes using the HAQM DocumentDB Index Tool

Although it’s not required for an offline migration, the HAQM DocumentDB Index Tool allows me to check the dumped indexes for compatibility and pre-create the indexes on the target HAQM DocumentDB cluster. Pre-creating indexes can dramatically reduce the overall restore time because the indexes can be populated in parallel while restoring, rather than serially after data is restored with mongorestore.

You can obtain this tool by cloning the HAQM DocumentDB Tools GitHub repository and following the instructions in README.md.

After installing the HAQM DocumentDB Index Tool, I can use it to verify that no index definition incompatibilities exist:

[ec2]$ python migrationtools/documentdb_index_tool.py –-show-issues –-dir <dump_dir>

Now I can create the indexes in the target HAQM DocumentDB cluster using the index tool:

[ec2]$ python migrationtools/documentdb_index_tool.py –-restore-indexes –-dir <dump_dir> --host docdb-cluster-endpoint –-tls –-tls-ca-file rds-combined-ca-bundle.pem --username myuser --password mypassword

Restoring data to the cluster using mongorestore

With the indexes pre-created, I restore the exported data to my target HAQM DocumentDB cluster using the mongorestore tool. I can use mongorestore to parallelize imports with the –-numInsertionWorkersPerCollection option. Setting this option to the number of vCPUs on my HAQM DocumentDB cluster’s primary instance is a good place to start. This cluster has a primary instance size of r5.24xlarge, which has 96 vCPUs, so I use the value 64. Because I pre-created my indexes with the HAQM DocumentDB Index Tool, I pass the –-noIndexRestore option so that I don’t try to build indexes twice:

[ec2]$ mongorestore --host docdb-cluster-endpoint –-ssl –-sslCAFile rds-combined-ca-bundle.pem --username myuser --password mypassword – numInsertionWorkersPerCollection 96 --noIndexRestore <dump_dir>

Note: If I had performed a full mongodump (that is, if I hadn’t used the --db option to specify a database to dump), I would need to remove the admin directory from the resulting dump directory. Otherwise I would encounter an error when attempting to restore to HAQM DocumentDB.

Pointing to the HAQM DocumentDB cluster

After my data restore is complete, I am ready to change my application’s database connection string to use my HAQM DocumentDB cluster. For more information, see Working with HAQM DocumentDB Endpoints in the HAQM DocumentDB Developer Guide.

Summary

In this first of a three-part series, I described the basic steps in the offline approach for migrating data from MongoDB to HAQM DocumentDB. You can find information on additional migration approaches, along with other considerations when migrating to HAQM DocumentDB in the HAQM DocumentDB Migration Guide.

If you have any questions or comments about this blog post, please use the comments section on this page.

About the Author

Jeff Duffy is a NoSQL Specialist SA at HAQM Web Services, focusing on HAQM DocumentDB.

AWS Database Blog