Game Developer’s Guide to HAQM DocumentDB (with MongoDB compatibility) Part Three: Operation Best Practices

Introduction

To continue our discussion about HAQM DocumentDB best practices in part two, we are going to focus on data protection, scaling, monitoring, and cost optimization in this blog.

Protecting data

To protect your data stored in HAQM DocumentDB, you should encrypt it by enabling the storage encryption option during cluster creation. Encryption is enabled cluster-wide by default and applied to all instances, logs, automated backups, and snapshots. HAQM DocumentDB handles encryption and decryption of your data transparently, with minimal impact on performance. You can use the default AWS key or bring your own key to encrypt your data.

HAQM DocumentDB supports role-based access control (RBAC), which should be used to enforce least privilege for read-only access to databases or collections, and for multi-tenant application designs. The article Introducing role-based access control for HAQM DocumentDB (with MongoDB compatibility) explains the concepts and capabilities of RBAC in HAQM DocumentDB.

Using AWS Secrets Manager, you can retrieve HAQM DocumentDB passwords programmatically and automatically rotate them to replace hardcoded credentials in your code as explained in How to rotate HAQM DocumentDB and HAQM Redshift credentials in AWS Secrets Manager.

AWS CloudTrail integrates with HAQM DocumentDB, providing a tracking record of database actions taken by users, roles, or other AWS services, and capturing all API calls for HAQM DocumentDB. See Logging HAQM DocumentDB API Calls with AWS CloudTrail for details. In addition, you can opt in to the auditing feature of HAQM DocumentDB to record Data Definition Language (DDL), Data Manipulation Language (DML), authentication, authorization, and user management events to HAQM CloudWatch Logs in JSON document format. See Auditing HAQM DocumentDB Events for details.

Scaling

As discussed in part one, HAQM DocumentDB supports both vertical and horizontal scaling. We’ll dive deeper to focus on read consistency and preference, read and write traffic, dynamic read preference, and asymmetric workloads.

Let’s dive into read consistency and read preferences. Reads from the primary instance are strongly consistent providing read-after-write consistency. On the other end, when looking at replicas, the reads are eventually consistent. The latency between primary and replicas is typically less than 100ms. To scale your reads, it is recommended to use replica instances. This is done using readPreference of secondaryPreferred. In a situation that a replica instance goes down, the read request is directed to next available replica instance. Lastly, if no read replicas are available, then the reads are directed to the primary instance.

Now that we have a general understanding of how read consistency and preferences work, let’s talk about scaling out the read traffic. Let’s take the example where we have a three-instance cluster (primary and two read replicas) and we want to scale to support high read throughput. We can simply add read replicas! You can scale up to 15 total read replicas per cluster. If you use the read preference as secondaryPreferred, the driver will use the new replicas automatically. Sometimes you may want to have your default setting, but need to override per query basis for a stronger consistency. For example, while your default may be pointing to a secondaryPrefered read replica you can construct a query that overrides that and pulls from “primary” instead.

While it’s generally recommended to use homogenous instances in a cluster, you can consider asymmetric workloads. For example, if your workload has on-demand or periodic analytic needs such as running a report every month, you can right-size the cluster for the online workload and add a new node for analytics. From there you run the reporting queries against the new node and when you are done, shut down the node to save on cost.

Thus far we’ve talked about scaling needs for reading, but let’s focus on scaling strategies for writes! Let’s say that your write throughput has increased, or you are expecting it to increase. Currently we are using a cluster with three r6g.large instances, but we want to move to r6g.4xlarge instances. First, we would add the three new r6g.4xlarge nodes to the cluster, and to be safe we will choose a higher promotion tier for the larger sized nodes. Note that HAQM DocumentDB will bias to choose larger nodes to become the primary by default. Now for the magic; we invoke an auto failover and the larger instance will automatically be selected as primary. When that occurs, you can safely delete the smaller r6g.large instances. The following image visualizes this process.”

This image provides an illustration of scaling write traffic.

Lastly, let’s finish off the topic of scaling with storage and I/O. Storage, as well as I/O, automatically scales within HAQM DocumentDB. Storage is scaled in 10GiB segments up to 128 TiB. If you are migrating workloads to HAQM DocumentDB and you are intending to delete some data, like historical data that’s no longer needed, do so before you migrate over the data to HAQM DocumentDB to reduce cost.

Monitoring

The subject of monitoring for many of AWS services is rich in details and features. HAQM DocumentDB is no different. Let’s take a high-level look at the main areas that cover the breadth of monitoring for this service.

HAQM CloudWatch

HAQM DocumentDB publishes more than 50 operational CloudWatch metrics that can be monitored. CloudWatch allows you to set alarms and send notifications when metrics cross predetermined values. These metrics provide information about instances (BufferCacheHitRation, FreeableMemory, and others) clusters (DBClusterReplicaLagMaximum, VolumeWriteIOPS, and others), storage (VolumeBytesUsed), and backup (SnapshotStorageUsed, TotalBackupStorageBilled, and others).

If you wish to go deeper on HAQM DocumentDB CloudWatch metrics, Monitoring metrics and setting up alarms on your HAQM DocumentDB clusters covers it in great detail.

Profiler and Auditing

What is the profiler? The profiler helps you identify slow queries as well as discover opportunities for new indexes. Furthermore, it helps you detect optimizations for existing indexes to improve query performance. You define a threshold, and any operation that takes longer than that threshold to run are logged to CloudWatch Logs. Use these logs to identify queries that are not using an index or that are using indexes that are not optimal. Lastly, starting profiler will help us troubleshoot any slow running queries. Read Profiling HAQM DocumentDB Operations for details.

What is auditing? Auditing allows you to log certain events that occur in HAQM DocumentDB. The events can be DDL events (authorization, managing users, creating indexes, etc.) and DML events (create, read, update, and delete). Since audit logs are written to CloudWatch logs, you can alarm on specific operations that occur. For example, 10 incorrect login attempts in a single minute. Read Introducing DML auditing for HAQM DocumentDB for details.

The profiler and auditing are disabled by default, so you must activate and configure them before use. Use the following links to the developer guide for information on how to activate them.
How to start the profiler?
How to activate auditing?

Performance Insights

Performance Insights adds to the existing HAQM DocumentDB monitoring features to illustrate your cluster performance and help you analyze any issues that affect it. With the Performance Insights dashboard, you can visualize the database load, and filter the load by waits, query statements, hosts, or applications. Performance Insights is included with HAQM DocumentDB instances and stores 7 days of performance history in a rolling window at no additional cost. Performance Insights is disabled by default and can be enabled per instance. For a quick demo, the video Getting started with HAQM DocumentDB Observability and Monitoring covers this feature.

Event Subscriptions

Lastly, you may want to subscribe to events that occur within HAQM DocumentDB. HAQM DocumentDB groups these events into categories that you can subscribe to so that you can be notified when an event in that category occurs. What are the different categories of events generated by HAQM DocumentDB? Cluster, instance, cluster snapshot, and parameter group events. You can easily subscribe to all events in a category from all resources (for example all cluster events for all clusters) or a specific event in a category from a specific resource (for example failover events for the primary instance in the production cluster). These events are generated by default so no additional HAQM DocumentDB configuration is required. Read Using HAQM DocumentDB Event Subscriptions for details.

Optimizing cost

HAQM DocumentDB provides per second billing for instances, with a 10-minute minimum billing period. To proactively manage the spend of your HAQM DocumentDB clusters, create billing alerts at thresholds of 50 percent and 75 percent of your expected bill for the month. Read Create a billing alarm for more information about creating billing alerts. You can track your costs at a granular level and understand how your costs are being allocated across resources by tagging your clusters and instances. To learn how to track the instance, storage, IOPS, and backup storage costs for your HAQM DocumentDB clusters, read Using cost allocation tags with HAQM DocumentDB.

To help manage costs for non-production environments, you should temporarily stop all the instances in your cluster (for up to seven days) when they aren’t needed, and restart them to resume your work. While your cluster is stopped, you are only charged for storage, manual snapshots, and automated backup storage within your specified retention window. You aren’t charged for any instance hours. See Stopping and Starting an HAQM DocumentDB Cluster for details

Compute and storage are decoupled in HAQM DocumentDB by design. Data is replicated six ways across three Availability Zones, providing highly durable storage regardless of the number of instances in the cluster. It’s recommended to have three or more instances for your production clusters to achieve high availability; whereas for non-production environments, you can reduce cost by running a single instance if downtime can be tolerated.

Both Time To Live indexes and change streams features will additional IOs when data is read, inserted, updated and deleted. If you are not utilizing these features in your application, deactivate them to reduce costs.

As your data grows, you should consider implementing an appropriate data archival strategy to maintain only operational data in your cluster and archive infrequently accessed data to other low-cost storage options such as HAQM Simple Storage Service (HAQM S3). Read Optimize data archival costs in HAQM DocumentDB using rolling collections to understand how to implement data archival strategy for HAQM DocumentDB.

As we round the corner on optimizing cost, let’s take a look at backups. First off, you cannot turn off backups. The minimum backup retention period is one day, and the maximum retention period is thirty-five days. Like many other familiar best practices for backup tolerance, we recommend setting retention based on your Recovery Point Objective (RPO). The backup window you define is useful to start other automated processes such as building development and test clusters from the snapshot created. Lastly, it can take up to five minutes to get live data to the backup. This means that you can recover to anytime from five minutes ago until the backup retention period.

Examples

To help you get started, we have included links to code samples. These include how to connect to HAQM DocumentDB in multiple popular programming languages, code from blogs, and AWS Lambda samples.

Connecting Programmatically to HAQM DocumentDB
Lambda function samples
Code samples from other blogs
Main HAQM DocumentDB samples GitHub repository

Summary

In this post, we’ve discussed important best practices for managing and optimizing HAQM DocumentDB clusters. Gaming customers are adopting HAQM DocumentDB to simplify the design and management of the backend database engine that powers their amazing games. By leveraging the best practices outlined in this post, you can have a good head start in your successful implementation of HAQM DocumentDB.

To learn more about HAQM DocumentDB, you can follow our videos for HAQM DocumentDB Insider Hour where you can find detailed information on well-architected lens, workshop, migration, and new launches. There are also HAQM Document tools on GitHub that you can use for your evaluation and migration efforts.

AWS for Games Blog