Anomaly Detection in AWS Lambda using HAQM DevOps Guru’s ML-powered insights

Critical business applications are monitored in order to prevent anomalies from negatively impacting their operational performance and availability. HAQM DevOps Guru is a Machine Learning (ML) powered solution that aids operations by detecting anomalous behavior and providing insights and recommendations for how to address the root cause before it impacts the customer.

This post demonstrates how HAQM DevOps Guru can detect an anomaly following a critical AWS Lambda function deployment and its remediation recommendations to fix such behavior.

Solution Overview

HAQM DevOps Guru lets you monitor resources at the region or AWS CloudFormation level. This post will demonstrate how to deploy an AWS Serverless Application Model (AWS SAM) stack, and then enable HAQM DevOps Guru to monitor the stack.

You will utilize the following services:

AWS Lambda
HAQM EventBridge
HAQM DevOps Guru

The architecture diagram shows an AWS SAM stack containing AWS Lambda and HAQM EventBridge resources, as well as HAQM DevOps Guru monitoring the resources in the AWS SAM stack.

Figure 1: HAQM DevOps Guru monitoring the resources in an AWS SAM stack

The architecture diagram shows an AWS SAM stack containing AWS Lambda and HAQM EventBridge resources, as well as HAQM DevOps Guru monitoring the resources in the AWS SAM stack.

This post simulates a real-world scenario where an anomaly is introduced in the AWS Lambda function in the form of latency. While the AWS Lambda function execution time is within its timeout threshold, it is not at optimal performance. This anomalous execution time can result in larger compute times and costs. Furthermore, this post demonstrates how HAQM DevOps Guru identifies this anomaly and provides recommendations for remediation.

Here is an overview of the steps that we will conduct:

First, we will deploy the AWS SAM stack containing a healthy AWS Lambda function with an HAQM EventBridge rule to invoke it on a regular basis.
We will enable HAQM DevOps Guru to monitor the stack, which will show the AWS Lambda function as healthy.
After waiting for a period of time, we will make changes to the AWS Lambda function in order to introduce an anomaly and redeploy the AWS SAM stack. This anomaly will be identified by HAQM DevOps Guru, which will mark the AWS Lambda function as unhealthy, provide insights into the anomaly, and provide remediation recommendations.
After making the changes recommended by HAQM DevOps Guru, we will redeploy the stack and observe HAQM DevOps Guru marking the AWS Lambda function healthy again.

This post also explores utilizing Provisioned Concurrency for AWS Lambda functions and the best practice approach of utilizing Warm Start for variables reuse.

Pricing

Before beginning, note the costs associated with each resource. The AWS Lambda function will incur a fee based on the number of requests and duration, while HAQM EventBridge is free. With HAQM DevOps Guru, you only pay for the data analyzed. There is no upfront cost or commitment. Learn more about the pricing per resource here.

Prerequisites

To complete this post, you need the following prerequisites:

An AWS account. For this post, we utilize the account number 111111111111. We will conduct AWS Serverless Application Model (AWS SAM) stack operations and monitoring in this account.
Access to your local terminal with the AWS SAM command line interface (CLI) installed.
Access to your local terminal with the git CLI.
AWS credentials for enabling the AWS SAM CLI to make calls to AWS Services on your behalf. In this post, AWS SAM needs access to AWS CloudFormation.
An Integrated Development Environment (IDE) of choice installed on your local machine.

Getting Started

We will set up an application stack in our AWS account that contains an AWS Lambda and an HAQM EventBridge event. The event will regularly trigger the AWS Lambda function, which simulates a high-traffic application. To get started, please follow the instructions below:

In your local terminal, clone the amazon-devopsguru-samples repository.

git clone http://github.com/aws-samples/amazon-devopsguru-samples.git

In your IDE of choice, open the amazon-devopsguru-samples repository.
In your terminal, change directories into the repository’s subfolder amazon-devopsguru-samples/generate-lambda-devopsguru-insights.

cd amazon-devopsguru-samples/generate-lambda-devopsguru-insights

Utilize the SAM CLI to conduct a guided deployment of lambda-template.yaml.

sam deploy --guided --template lambda-template.yaml
    Stack Name [sam-app]: DevOpsGuru-Sample-AnomalousLambda-Stack
    AWS Region [us-east-1]: us-east-1
    #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    Confirm changes before deploy [y/N]: y
    #SAM needs permission to be able to create roles to connect to the resources in your template
    Allow SAM CLI IAM role creation [Y/n]: y
    Save arguments to configuration file [Y/n]: y
    SAM configuration file [samconfig.toml]: y
    SAM configuration environment [default]: default

You should see a success message in your terminal, such as:

Successfully created/updated stack - DevOpsGuru-Sample-AnomalousLambda-Stack in us-east-1.

Enabling HAQM DevOps Guru

Now that we have deployed our application stack, we can enable HAQM DevOps Guru.

Log in to your AWS Account.
Navigate to the HAQM DevOps Guru service page.
Click “Get started”.
In the “HAQM DevOps Guru analysis coverage” section, select “Choose later”, then click “Enable”.

HAQM DevOps Guru analysis coverage menu which asks which AWS resources to analyze. The “Choose later” option is selected.

Figure 2.1: HAQM DevOps Guru analysis coverage menu

On the left-hand menu, select “Settings”
In the “DevOps Guru analysis coverage” section, click on “Manage”.
Select the “Analyze all AWS resources in the specified CloudFormation stacks in this Region” radio button.
The stack created in the previous section should appear. Select it, click “Save”, and then “Confirm”.

HAQM DevOps Guru analysis coverage menu which asks which AWS resources to analyze. The “Analyze all AWS resources in the specified CloudFormation stacks in this Region” option is selected and CloudFormation stacks are displayed to choose from.

Figure 2.2: HAQM DevOps Guru analysis coverage resource selection

Before moving on to the next section, we must allow HAQM DevOps Guru to baseline the resources and benchmark the application’s normal behavior. For our serverless stack with two resources, we recommend waiting two hours before carrying out the next steps. When enabled in a production environment, depending upon the number of resources selected for monitoring, it can take up to 24 hours for HAQM DevOps Guru to complete baselining.

Once baselining is complete, the HAQM DevOps Guru dashboard, an overview of the health of your resources, will display the application stack, DevOpsGuru-Sample-AnomalousLambda-Stack, and mark it as healthy, shown below.

HAQM DevOps Guru Dashboard displays the system health summary and system health overview of each CloudFormation stack. The DevOpsGuru-Sample-AnomalousLambda-Stack is marked as healthy with 0 reactive insights and 0 proactive insights.

Figure 2.3: HAQM DevOps Guru Healthy Dashboard

Enabling SNS

If you would like to set up notifications upon the detection of an anomaly by HAQM DevOps Guru, then please follow these additional instructions.

HAQM DevOps Guru Specify an SNS topic menu which enables notifications for important DevOps Guru events. No SNS topics are currently configured.

Figure 3: HAQM DevOps Guru Specify an SNS topic

Invoking an Anomaly

Once HAQM DevOps Guru has identified the stack as healthy, we will update the AWS Lambda function with suboptimal code. This update will simulate an update to critical business applications which are causing the anomalous performance.

Open the amazon-devopsguru-samples repository in your IDE.
Open the file generate-lambda-devopsguru-insights/lambda-code.py
Uncomment lines 7-8 and save the file. These lines of code will produce an anomaly due to the function’s increased runtime.
Deploy these updates to your stack by running:

cd generate-lambda-devopsguru-insights 
sam deploy --template lambda-template.yaml -stack-name DevOpsGuru-Sample-AnomalousLambda-Stack

Anomaly Overview

Shortly after, HAQM DevOps Guru will generate a reactive insight from the sample stack. This insight contains recommendations, metrics, and events related to anomalous behavior. View the unhealthy stack status in the Dashboard.

HAQM DevOps Guru Dashboard displays the system health summary and system health overview of each CloudFormation stack. The DevOpsGuru-Sample-AnomalousLambda-Stack is marked as unhealthy with 1 reactive insights and 0 proactive insights.

Figure 4.1: HAQM DevOps Guru Unhealthy Dashboard

By clicking on the “Ongoing reactive insight” within the tile, you will be brought to the Insight Details page. This page contains an array of useful information to help you understand and address anomalous behavior.

Insight overview

Utilize this section to get a high-level overview of the insight. You can see that the status of the insight is ongoing, 1 AWS CloudFormation stack is affected, the insight started on Sept-08-2021, it does not have an end time, and it was last updated on Sept-08-2021.

HAQM DevOps Guru Insight Details page has multiple information sections. The Insight overview is the first section which displays the status is ongoing, there is 1 affected stack, the start time and last updated time. The end time is empty as the insight is ongoing.

Figure 4.2: HAQM DevOps Guru Ongoing Reactive Insight Overview

Aggregated metrics

The Aggregated metrics tab displays metrics related to the insight. The table is grouped by AWS CloudFormation stacks and subsequent resources that created the metrics. In this example, the insight was a product of an anomaly in the “duration p50” metric generated by the “DevOpsGuruSample-AnomalousLambda” AWS Lambda function.

AWS Lambda duration metrics derive from a percentile statistic utilized to exclude outlier values that skew average and maximum statistics. The P50 statistic is typically a great middle estimate. It is defined as 50% of estimates exceed the P50 estimate and 50% of estimates are less than the P50 estimate.

The red lines on the timeline indicate spans of time when the “duration p50” metric emitted unusual values. Click the red line in the timeline in order to view detailed information.

Choose View in CloudWatch to see how the metric looks in the CloudWatch console. For more information, see Statistics and Dimensions in the HAQM CloudWatch User Guide.
Hover over the graph in order to view details about the anomalous metric data and when it occurred.
Choose the box with the downward arrow to download a PNG image of the graph.

HAQM DevOps Guru Insight Details page contains aggregated metrics. The Duration p50 metric is selected and displayed in graph form.

Figure 4.3: HAQM DevOps Guru Ongoing Reactive Insight Aggregated Metrics

Graphed anomalies

The Graphed anomalies tab displays detailed graphs for each of the insight’s anomalies. Because our insight was comprised of a single anomaly, there is one tile with details about unusual behavior detected in related metrics.

Choose View all statistics and dimensions in order to see details about the anomaly. In the window that opens, you can:
Choose View in CloudWatch in order to see how the metric looks in the CloudWatch console.
Hover over the graph to view details about the anomalous metric data and when it occurred.
Choose Statistics or Dimension in order to customize the graph’s display. For more information, see Statistics and Dimensions in the HAQM CloudWatch User Guide.

HAQM DevOps Guru Insight Details page contains Graphed anomalies. The p50 metric of the AWS/Lambda duration in displayed in graph form.

Figure 4.4: HAQM DevOps Guru Ongoing Reactive Insight Graphed Anomaly

Related events

In Related events, view AWS CloudTrail events related to your insight. These events help understand, diagnose, and address the underlying cause of the anomalous behavior. In this example, the events are:

CreateFunction – when we created and deployed the AWS SAM template containing our AWS Lambda function.
CreateChangeSet – when we pushed updates to our stack via the AWS SAM CLI.
UpdateFunctionCode – when the AWS Lambda function code was updated.

Continuation of figure 4.4

Figure 4.5: HAQM DevOps Guru Ongoing Reactive Insight Related Events

Recommendations

The final section in the Insight Detail page is Recommendations. You can view suggestions that might help you resolve the underlying problem. When HAQM DevOps Guru detects anomalous behavior, it attempts to create recommendations. An insight might contain one, multiple, or zero recommendations.

In this example, the HAQM DevOps Guru recommendation matches the best resolution to our problem-provisioned concurrency.

HAQM DevOps Guru Insight Details page contains Recommendations. The suggested recommendation is to configure provisioned concurrency for the AWS Lambda.

Figure 4.6: HAQM DevOps Guru Ongoing Reactive Insight Recommendations

Understanding what happened

HAQM DevOps Guru recommends enabling Provisioned Concurrency for the AWS Lambda functions in order to help it scale better when responding to concurrent requests. As mentioned earlier, Provisioned Concurrency keeps functions initialized by creating the requested number of execution environments so that they can respond to invocations. This is a suggested best practice when building high-traffic applications, such as the one that this sample is mimicking.

In the anomalous AWS Lambda function, we have sample code that is causing delays. This is analogous to application initialization logic within the handler function. It is a best practice for this logic to live outside of the handler function. Because we are mimicking a high-traffic application, the expectation is to receive a large number of concurrent requests. Therefore, it may be advisable to turn on Provisioned Concurrency for the AWS Lambda function. For Provisioned Concurrency pricing, refer to the AWS Lambda Pricing page.

Resolving the Anomaly

To resolve the sample application’s anomaly, we will update the AWS Lambda function code and enable provisioned concurrency for the AWS Lambda infrastructure.

Opening the sample repository in your IDE.
Open the file generate-lambda-devopsguru-insights/lambda-code.py.
Move lines 7-8, the code forcing the AWS Lambda function to respond slowly, above the lambda_handler function definition.
Save the file.
Open the file generate-lambda-devopsguru-insights/lambda-template.yaml.
Uncomment lines 15-17, the code enabling provisioned concurrency in the sample AWS Lambda function.
Save the file.
Deploy these updates to your stack.

cd generate-lambda-devopsguru-insights 
sam deploy --template lambda-template.yaml --stack-name DevOpsGuru-Sample-AnomalousLambda-Stack

After completing these steps, the duration P50 metric will emit more typical results, thereby causing HAQM DevOps Guru to recognize the anomaly as fixed, and then close the reactive insight as shown below.

HAQM DevOps Guru Insight Summary page displays the reactive insight has been closed.

Figure 5: HAQM DevOps Guru Closed Reactive Insight

Clean Up

When you are finished walking through this post, you will have multiple test resources in your AWS account that should be cleaned up or un-provisioned in order to avoid incurring any further charges.

Opening the sample repository in your IDE.
Run the below AWS SAM CLI command to delete the sample stack.

cd generate-lambda-devopsguru-insights 
sam delete --stack-name DevOpsGuru-Sample-AnomalousLambda-Stack

Conclusion

As seen in the example above, HAQM DevOps Guru can detect anomalous behavior in an AWS Lambda function, tie it to relevant events that introduced that anomaly, and provide recommendations for remediation by using its pre-trained ML models. All of this was possible by simply enabling HAQM DevOps Guru to monitor the resources with minimal configuration changes and no previous ML expertise. Start using HAQM DevOps Guru today.

AWS DevOps & Developer Productivity Blog