AWS Database Blog
Serverless Scaling for Ingesting, Aggregating, and Visualizing Apache Logs with HAQM Kinesis Firehose, AWS Lambda, and HAQM Elasticsearch Service
September 8, 2021: HAQM Elasticsearch Service has been renamed to HAQM OpenSearch Service. See details.
February 9, 2024: HAQM Kinesis Data Firehose has been renamed to HAQM Data Firehose. Read the AWS What’s New post to learn more.
Pubali Sen and Shankar Ramachandran are solutions architects at HAQM Web Services.
In 2016, AWS introduced the EKK stack (HAQM Elasticsearch Service, HAQM Kinesis, and Kibana, an open source plugin from Elastic) as an alternative to ELK (HAQM Elasticsearch Service, the open source tool Logstash, and Kibana) for ingesting and visualizing Apache logs. One of the main features of the EKK stack is that the data transformation is handled via the HAQM Kinesis Firehose agent. In this post, we describe how to optimize the EKK solution—by handling the data transformation in HAQM Kinesis Firehose through AWS Lambda.
In the ELK stack, the Logstash cluster handles the parsing of the Apache logs. However, the Logstash cluster must be designed and maintained for scale management. This type of server management requires a lot of heavy lifting on the user’s part. The EKK solution eliminates this work with HAQM Kinesis Firehose, AWS Lambda, and HAQM Elasticsearch Service (HAQM ES).
Solution overview
Let’s look at the components and architecture of the EKK optimized solution.
HAQM Kinesis Firehose
HAQM Kinesis Firehose provides the easiest way to load streaming data into AWS. In this solution, Firehose helps capture and automatically load the streaming log data to HAQM ES, and backs it up in HAQM S3. For more information about Firehose, see What is HAQM Kinesis Firehose?
AWS Lambda
AWS Lambda lets you run code without provisioning or managing servers. It automatically scales your application by running code in response to each trigger. Your code runs in parallel and processes each trigger individually, scaling precisely with the size of the workload. In the EKK solution, HAQM Kinesis Firehose invokes the Lambda function to transform incoming source data and deliver the transformed data to the managed HAQM ES cluster. For more information about AWS Lambda, see the AWS Lambda documentation.
HAQM Elasticsearch Service
HAQM ES is a popular search and analytics engine that provides real-time application monitoring and log and clickstream analytics. In this solution, the Apache logs are stored and indexed in HAQM ES. As a managed service, HAQM ES is easy to deploy, operate, and scale in the AWS Cloud. Using a managed service eliminates administrative overhead, including patch management, failure detection, node replacement, backups, and monitoring. Because HAQM ES includes built-in integration with Kibana, it eliminates having to install and configure that platform—simplifying your process even more. For information about HAQM ES, see What Is HAQM Elasticsearch Service?
HAQM Kinesis Data Generator
This solution uses the HAQM Kinesis Data Generator (KDG) to produce the Apache access logs. The KDG makes it easy to simulate Apache access logs and demonstrate the processing pipeline and scalability of the solution.
Architecture
The following diagram shows the architecture of the EKK optimized stack.
Configuring the EKK optimized stack
This section describes the steps for setting up the EKK optimized solution. (Feel free to set it up in any Region you want.)
Create the AWS Lambda function for data transformation
Firehose provides the following Lambda blueprints that you can use to create a Lambda function for data transformation:
- General Firehose Processing:Contains the data transformation and status. Use this blueprint for any custom transformation logic.
- Apache Log to JSON:Parses and converts Apache log lines to JSON objects, using predefined JSON field names.
- Apache Log to CSV:Parses and converts Apache log lines to CSV format.
- Syslog to JSON:Parses and converts Syslog lines to JSON objects, using predefined JSON field names.
- Syslog to CSV:Parses and converts Syslog lines to CSV format.
In the AWS Lambda console, create a new Lambda function by selecting the kinesis-firehose-apachelog-to-json blueprint. Set the timeout for the Lambda function to one minute.
Attach HAQM ES and HAQM CloudWatch Logs full access policies to the Lambda function. By attaching the HAQM ES permission, you allow the Lambda function to write to the logs in the HAQM ES cluster. The CloudWatch Logs access will help you monitor the Lambda function.
Set up the Elasticsearch cluster
Create the HAQM ES domain in the HAQM ES console or by using the create-elasticsearch-domain command in the AWS CLI.
This example uses the following configuration:
- Domain Name: LogESCluster
- Elasticsearch Version: 1
- Instance Count: 2
- Instance type:medium.elasticsearch
- Enable dedicated master: true
- Enable zone awareness: true
Other settings are left as default.
Set up the Firehose delivery stream and link the Lambda function
In the Firehose console, create a new delivery stream with HAQM ES as the destination. In the Configuration section, enable data transformation, and choose the generic Firehose processing Lambda function that was created from the blueprint.
For detailed steps in this process, see Create a Firehose Delivery Stream to HAQM Elasticsearch Service.
Create an HAQM Cognito user and sign in to the KDG
Before you can send data to HAQM Kinesis, you must create an HAQM Cognito user in your AWS account with permissions to access HAQM Kinesis. To simplify this process, a Lambda function and an AWS CloudFormation template are provided to create the user and assign just enough permissions to use the KDG. To learn more, see the KDG Help page.
You can create the CloudFormation stack by clicking the following link. This link takes you to the AWS CloudFormation console and starts the stack creation wizard.
09/20/2021: We’re updating the following link. Please refer back to this post in a day or two for the most accurate and helpful information.
Create an HAQM Cognito user with AWS CloudFormation
Provide a username and password for the user that you will use to sign in to the HAQM Kinesis Data Generator. Accept the defaults for the other options.
After you create the CloudFormation stack, you must use a special URL to access the HAQM Kinesis Data Generator. AWS CloudFormation creates this URL as part of the stack generation. To get the URL, choose the CloudFormation stack, and then choose the Outputs tab as shown in the following screenshot. Bookmark this URL in your browser for easy access to the KDG.
Click the URL and sign in to the KDG using the username and password you provided while setting up the CloudFormation stack.
Set up the KDG record template for the Apache access logs
The HAQM Kinesis Data Generator can generate records using random data based on a template that you provide. The record template can be of any type: JSON, CSV, or unstructured. The KDG creates a unique record based on the template, replacing your template records with actual data.
The following shows the template for the Apache logs:
{{internet.ip}} - - [{{date.now("DD/MMM/YYYY:HH:mm:ss ZZ")}}] "GET /index.html HTTP/1.1" 200 104 "-" "ELB-HealthChecker/1.0"
In the KDG, set Records per Second to 100.
To start the data streaming, choose Send Data to HAQM Kinesis.
Data transformation
The following shows the Apache logs before transformation:
The following shows the Apache logs after transformation:
File backup in HAQM S3
The raw files are backed up in HAQM S3. The following shows the HAQM S3 console.
Monitoring and troubleshooting the Firehose delivery stream
The HAQM Kinesis Firehose console helps you monitor and troubleshoot the data delivery and data transformation.
The following screenshot shows the details of the log delivery to HAQM S3 and HAQM ES. Notice that there were errors while delivering the logs to the Elasticsearch cluster.
Setting up Kibana for the Elasticsearch cluster
HAQM ES provides a default installation of Kibana with every HAQM ES domain. You can find the Kibana endpoint on your domain dashboard in the HAQM ES console. (You can restrict HAQM ES to an IP-based access policy.)
In Kibana, for the Index name or pattern, type logdiscovery. This is the name of the HAQM ES index that you created for the web server access logs.
In the Time-field name list, choose @timestamp_utc.
On the Kibana console, choose the Discover tab on the left side to view the Apache access logs.
Conclusion
This post described how to transform and ship Apache logs using a serverless architecture. Firehose and AWS Lambda automatically scale up or down based on the rate at which your application generates logs. You can increase the records per second in the HAQM Kinesis Data Generator to easily test the end-to-end scalability of this solution. This solution addresses the challenges encountered in Logstash—that is, hard-to-manage scaling and tedious cluster management.
To learn more about scaling HAQM ES clusters, see the HAQM Elasticsearch Service Developer Guide.
Managed services like HAQM Kinesis Firehose, AWS Lambda, and HAQM ES simplify provisioning and managing a log aggregation system. The ability to run transformations and add any custom transformation logic against your streaming log data using AWS Lambda further strengthens the case for using an EKK optimized stack.