AWS Public Sector Blog
AWS transforms DC police department’s data processing system by utilizing Spark on Lambda
The Metropolitan Police Department of Washington DC (DC-MPD) is one of the ten largest police agencies in the United States, serving as the primary law enforcement entity for the District of Columbia. Embracing cutting-edge technology, DC-MPD has integrated evidence analysis techniques with state-of-the-art information technology to advance crime-mapping, real-time crime statistics dashboards, and summary statistics for a holistic view of crime dynamics. This modern approach is further complemented by a community policing philosophy, which emphasizes building strong partnerships between the police force and the residents of the district.
At the heart of DC-MPD’s operations lies a well-established data pipeline that efficiently handles a vast array of information sourced from over 400 datasets derived from the Mark43 records management system. The extract and management of this data are facilitated by solutions offered by HAQM Web Services (AWS), including AWS Database Migration Service (DMS), which enables seamless data extraction and storage in an HAQM Simple Storage Service (HAQM S3) raw bucket.
However, despite these advancements, the department encountered significant challenges, including duplicate values within the staging bucket, inadequate error handling within Lambda functions, and occasional disruptions in the curated bucket.
Solution overview
To address these challenges, DC-MPD collaborated with AWS to design and implement a system that focused on enhancing their existing extract, transform, load (ETL) pipeline. The primary objective was to eliminate duplicate records, establish robust error detection mechanisms, and improve the overall error handling and orchestration of their data pipeline.
By leveraging AWS technologies such as Spark on AWS Lambda, AWS Glue, and AWS Step Functions, DC-MPD was able to successfully transform raw data into open table formats, orchestrate the pipeline using AWS Step Functions, and enable users to access datasets through HAQM Athena.
The solution achieved several key objectives, including the creation of necessary HAQM S3 buckets for input, output, error handling, implementation of a Glue job for data format conversion, and a table in HAQM DynamoDB for data processing management. By enhancing the architecture, the team created a more streamlined data processing workflow that could adapt to various operational scenarios.
The data processing workflow leverages several AWS services to manage and process data based on specific conditions and events.
Data ingestion into HAQM S3
Incoming data files are uploaded to an HAQM S3 bucket, organized by database- and table-specific folders. This structure helps maintain logical segmentation for processing.
Event-base processing via AWS Step Functions
When a new file arrives in HAQM S3, an HAQM EventBridge rule triggers an AWS Step Functions state machine. The Step Functions workflow then determines if the file represents a full load or an incremental load, based on its naming convention.
Tracking pipeline state in HAQM DynamoDB
To manage the workflow effectively, the state machine references an HAQM DynamoDB table. The table maintains:
- Job Flags: Each folder or data group has a job flag—G (process with AWS Glue), L (process with AWS Lambda), or P (pause).
- File History: Names and timestamps of the most recently processed file and any failed file are recorded.
- Failure Handling: If a file fails to process, the pipeline sets its job flag to P to pause any further operations, allowing time for investigation and remediation.
- Job References: The table also stores resource references such as the HAQM Resource Names (ARN) of the AWS Glue job or Lambda function used for processing.
- Lock Management: DynamoDB is also used to implement a locking mechanism, ensuring only one transaction can process at a time per table.
Handling failure and pausing executions
In the event of a failure, AWS Step Functions automatically updates the DynamoDB table to P, preventing additional incremental loads from being processed. This pause mechanism enables data integrity and allows operations teams to investigate the root cause.
Full load vs. incremental load processing
- Full Load: If the state machine detects a full-load file, the workflow pauses, and the corresponding incremental files are transferred to an HAQM Simple Queue Service (HAQM SQS) queue. This allows the team to handle large-scale ingestion or backfill operations after the full load completion.
- Incremental Load: For incremental load files, the Step Functions workflow checks the folder’s job flag. Depending on whether it is set to G or L, AWS Glue or AWS Lambda is invoked respectively. This approach optimizes processing based on file size and complexity.
AWS Glue (G flag) is used for large files or complex ETL tasks, leveraging its scalability and batch processing capabilities. While Glue incurs costs based on Data Processing Units (DPUs), it is cost-efficient for high-throughput, resource-intensive jobs.
AWS Lambda (L flag) is invoked for smaller files or lightweight transformations, utilizing its cost-effective pay-per-use model. Lambda’s 15-minute runtime limit and pricing based on execution time and requests make it ideal for smaller datasets and short-lived tasks. However, if a full load is in progress, incoming incremental files are automatically sent to an HAQM SQS queue. These queued files are processed in order after the full load completes, following the standard G or L flag logic once resumed.
Orchestration and monitoring
After triggering AWS Glue or AWS Lambda, Step Functions monitor the job’s execution status. Once processing is successful, the output file is moved to a staging HAQM S3 bucket for further processing. This design provides a serverless architecture for orchestrating both incremental and full-load pipelines, with a mechanism for failure detection and pause/resume, ensuring data quality and operational oversight.
Unlocking benefits by utilizing AWS technologies for data processing
- Performance: An advantage of this ETL pipeline is its high throughput and near real-time processing speed. By leveraging an open table format and running Spark on AWS Lambda, 1 million records were processed in under 90 seconds, ensuring rapid data transformation. The serverless architecture automatically scales to match demand, the pipeline maintains consistent performance, enabling DC-MPD to process large datasets in a timely manner.
- Scalability: The scalable architecture designed for DC-MPD allows for efficient handling of large datasets and data processing tasks. Utilizing Spark on AWS Lambda, the system dynamically triggers Spark jobs in response to new data arrivals in the HAQM S3 input bucket, which means that it can efficiently scale according to the volume of incoming data. The orchestration of tasks through AWS Step Functions allows data processing to occur in the correct sequence, allowing for seamless integration with existing systems while accommodating varying data loads, whether incremental or full.This flexibility to scale and manage data processing tasks in real-time makes the solution ideal for a law enforcement agency, as timely access to data can be critical for effective crime-fighting efforts.
- Cost optimization: Cost efficiency is a key advantage of the AWS-based ETL pipeline. By utilizing serverless services like AWS Lambda, Glue, and Step Functions, DC-MPD minimizes the costs associated with infrastructure management. Serverless computing allows the department to pay only for the compute resources that are consumed during data processing tasks, as opposed to maintaining always-on servers. With the integration of HAQM EventBridge to trigger data processing only when new data is uploaded to the HAQM S3 bucket, the agency incurs costs only when necessary, maximizing resource utilization and minimizing waste.The cost-optimized architecture not only supports DC-MPD’s operational needs, but also enables the department to allocate resources effectively, enhancing overall efficiency without compromising performance.
- Flexibility: The newly designed pipeline offers significant flexibility, enabling DC-MPD to adapt to various processing conditions as dictated by real-time data states. The use of HAQM DynamoDB for mapping source folders to their corresponding Lambda functions allows for dynamic execution of data processing tasks based on job flags. This means that, depending on the current needs or the nature of the incoming data, the pipeline can either invoke a Glue job or a Lambda function.Moreover, the error handling mechanisms are robust, with any failed processes being routed to a dedicated HAQM S3 failed bucket for investigation and reprocessing. This architecture not only allows for error isolation, but also facilitates continual improvements to data processing workflows, making it highly adaptable to changing operational requirements.
Conclusion
Utilizing a comprehensive suite of services to successfully optimize ETL pipeline, the Metropolitan Police Department of Washington DC (DC-MPD) has been able to achieve efficient data transformation, enhanced error handling, and streamlined orchestration. This solution positions DC-MPD at the forefront of technological advancements in law enforcement data management and strengthens the department’s capabilities to enhance public safety outcomes.