AWS for Industries
WirelessCar builds automotive data lake solution using HAQM S3, HAQM Redshift
WirelessCar worked alongside HAQM Web Services (AWS) to commoditize connected vehicle services and turn its data into insights, digital services, and revenue. Connected vehicle services in any vehicle— not just premium vehicles—is becoming a new normal. Vehicles are producing a huge amount of sensor and user pattern data, and there is a shift of user focus from vehicle hardware to software-defined experiences when choosing a vehicle. Therefore, original equipment manufacturers (OEMs) need to come up with new business models to make connected vehicles a profitable business and turn data into insights to provide personalized new services for better user experience, safety, and loyalty. A first step in this direction is to build a data lake that is compliant with the European Union’s General Data Protection Regulation (GDPR).
Alongside AWS, WirelessCar built a data lake solution using numerous AWS services to bring relevant data out of team silos while respecting consumer privacy and OEM data separation. These services include HAQM Simple Storage Service (HAQM S3), object storage built to retrieve any amount of data from anywhere, HAQM Redshift, which helps companies accelerate their time to insights with fast, easy, and secure cloud data warehousing at scale, HAQM Kinesis Data Firehose, which lets users load real-time streams into data lakes, warehouses, and analytics services, and streams from HAQM DynamoDB, a fast, flexible NoSQL database service for single-digit millisecond performance at any scale. This solution collects OEM data in a per-tenant HAQM S3 bucket of millions of vehicles and processes it using AWS Lambda, a serverless, event-driven compute service that lets users run code for virtually any type of application or backend service without provisioning or managing servers, and AWS Fargate, a serverless compute service for containers. User data access, OEM data separation, and GDPR compliance is managed with the per-tenant HAQM S3 bucket, AWS Identity and Access Management (AWS IAM), which provides fine-grained access control across all of AWS, and Active Directory Connector (AD Connector), a directory gateway which lets users redirect directory requests to their on-premises Microsoft Active Directory without caching any information in the cloud.
One major challenge is establishing data collection processes and pipelines from OEM programs because each OEM provides different services that are based on a different technology stack and have been developed by WirelessCar for 20 years. A data lake is used for the analysis of global services provided by WirelessCar to OEMs, connected vehicle analytics on an aggregated level, domain- and region-specific insights, and metrics such as the cost of WirelessCar services per vehicle per year, vehicle diagnostic errors, and forecasting, which were previously tracked in ad hoc Excel spreadsheets. Based on this data lake, we built intelligent dashboards for exploratory analysis, and we are building services based on artificial intelligence (AI) and machine learning (ML) for advanced analytics for OEMs and solution users. These insights and data-driven new services will create value for OEMs and solution users and create new revenue sources and business models from data collected from vehicles.
This blog post shows an example setup for a data lake solution for OEMs and automotive data. It highlights architecture, use cases, and a solution to build an automotive data lake. Further, it paves a path to build a dashboard and AI/ML–based services for OEMs and solution users.
Overview of solution
WirelessCar has multiple solutions developed during its 20 years of experience in the industry. Teams working for different OEMs perform continuous refactoring such as switching from SQL databases to NoSQL alternatives like HAQM DynamoDB. WirelessCar has hundreds of AWS accounts throughout its organization, with different teams operating their services in separated accounts to apply least-privilege access and limit the scope of impact of issues. The data lake solution had to provide significant freedom for teams to refactor and control their own accounts and, at the same time, provide central access to datasets. This solution also needed to find processes and pipelines to collect data from a diverse set of solutions.
WirelessCar opted for using HAQM S3 buckets with write-only permissions for teams. These buckets provide a high-capacity, virtually infinite input buffer for the data lake. The raw data is typically either JSON lines or comma-separated values (CSV) files or in compressed formats. By supporting write access to AWS IAM roles in the source accounts, we provide the teams the ability to stream data into HAQM S3 with the method of their choice—for example, HAQM Kinesis Data Firehose, HAQM S3 replication, direct HAQM S3 PUTs, or HAQM DynamoDB streams in a single destination data lake account. Independent teams are provided per-tenant buckets where they can stream data. Once ingested, automatic jobs move the input files to a separated archive bucket with suitable HAQM S3 storage class transitions. AWS Fargate hourly initiated tasks and AWS Lambda cleanse data in HAQM S3, create datasets, and load them into HAQM Redshift in tables suitable for efficient queries. Once in HAQM Redshift, datasets are normally transformed to a structured columnar format. Because WirelessCar operates a multitenant environment, we have chosen to have multiple HAQM Redshift clusters to separate costs and backups for different OEMs. Datasets in HAQM Redshift are queried by admins (the data management team), data scientists, and dashboard readers using HAQM QuickSight, a popular cloud-native, serverless business intelligence service.
This way raw vehicle data is collected from many OEM programs, cleansed, and archived, and datasets are loaded in HAQM Redshift to be queried with data security measures.
Solution
Data lake provisioning
The WirelessCar data management team set up the data lake using AWS CloudFormation templates. AWS CloudFormation speeds up cloud provisioning with infrastructure as code. The templates are for HAQM Redshift clusters, HAQM S3 input, archive buckets, and AWS Lambda. WirelessCar OEM program teams request a data ingestion HAQM S3 bucket and an HAQM Resource Name (ARN) with a write-only access role for the same HAQM S3 bucket for each tenant. Each OEM has a unique tenant for each of their brands.
Breaking data silos with data ingestion pipelines
Each OEM team inside WirelessCar pushes data from its account with the provided ARN role in a provisioned HAQM S3 bucket in the central data lake account. Depending on the data source, different methods of writing data to HAQM S3 are used. A small HAQM DynamoDB table could be exported in its entirety with the HAQM DynamoDB–to–HAQM S3 export feature. A larger HAQM DynamoDB table had its change stream continually written using HAQM Kinesis Data Firehose to compress partitioned chunks in HAQM S3. Methods of writing data to an HAQM S3 bucket in the data lake account were up to the producers. The data management team provided guidance and template solutions. HAQM S3 works alongside AWS Managed Services, which helps users operate their AWS infrastructure more efficiently and securely, to write data. This helped WirelessCar to break data silos and gather data from multiple sources in its data lake account.
There is a particular situation with cross-account AWS IAM roles that is unique. An HAQM S3 policy referencing an AWS IAM role by an ARN will not support the deletion and recreation of an AWS IAM role, even under an identical ARN. While this is a design choice that makes sense in the general case, it puts an operational constraint between the two parties that is not justified in this case. This is solved without compromising security by using a StringEquals condition on the aws:PrincipalArn.
The intermediate storage in HAQM S3 offers decoupling between data producers and the data lake. Input data is generally relatively structured, because it has been sent from vehicles and processed in an OEM tenant account. The data is placed in HAQM S3 in JSON, CSV, or a batched and compressed format. Depending on the type of data, some anonymization or pseudonymization is already applied by the source OEM programs using AWS Lambda transformation in HAQM Kinesis Data Firehose. Most of the data is delivered incrementally, because HAQM Kinesis Data Firehose will automatically split it into suitable chunks.
Once the data is ingested, the source files are transferred to a separate archival HAQM S3 bucket. The archival HAQM S3 bucket makes it simple to replay data deliveries during testing or refer to the unmodified source data for troubleshooting purposes. It also means that the input HAQM S3 bucket will always remain empty, except for files that are just about to be picked up for processing.
Data cleansing
AWS Lambda and AWS Fargate are used to cleanse data and load it in HAQM Redshift for querying. HAQM S3 initiates AWS Lambda for data processing. AWS Fargate batch job processing is initiated every 15 minutes. In the data processing step, input data files from HAQM S3 are processed and loaded in HAQM Redshift. The HAQM Redshift COPY command also efficiently ingests even large files directly from HAQM S3 into HAQM Redshift.
AWS Glue—a simple, scalable, and serverless data integration service—or even direct analytical queries into production databases will lower the latency of data access. But following best practices, WirelessCar decided to not query live production databases. Doing so can hamper workload performance. HAQM S3 Object Lambda helps WirelessCar to reduce latencies to the order of minutes till data is ready for consumption in HAQM Redshift, which is reasonable for WirelessCar use cases.
DBT, a data build tool, gives analytics engineers the ability to transform data in their warehouses by simply writing select statements. DBT handles turning these select statements into tables and views. It runs on an automatic schedule in AWS Fargate, performing certain tasks at frequent intervals and larger updates nightly. These are simple filtering views to reduce bad data and the aggregation of views or tables to reduce the number of rows. In certain cases, sensitive source data such as geospatial information is masked or reduced in precision using DBT views. The final layer exposes the datasets in structured columnar format. HAQM Redshift user-defined functions (UDFs) facilitate HAQM Redshift invoking AWS Lambda from SQL queries. This gives WirelessCar the ability to put certain business logic functions in any language which is preferred—for example, Python, which is well adapted for data science purposes—and enrich geospatial data with, for example, geogrid identifiers or lookups using HAQM Location Service, which lets users securely and easily add location data to applications. AWS Lambda, AWS Fargate, the DBT tool, and UDFs help WirelessCar to process data ingested in HAQM S3 buckets and create columnar datasets in HAQM Redshift for data consumption.
Data protection
Because WirelessCar operates cloud solutions for multiple OEMs, it is important to separate storage costs and backups per OEM. Therefore, multiple HAQM Redshift clusters are used per OEM. Schema structures are used to separate different tenants / car brands within an OEM. WirelessCar will use HAQM Redshift Serverless, which helps users get insights from data in seconds without having to manage data warehouse infrastructure, because some of the HAQM Redshift clusters do not need to be running continuously.
The GDPR and the California Consumer Privacy Act are regulatory compliance requirements for dealing with user data. Anonymization of data helps facilitate compliance. When personal data is processed, it must be deleted when required. To avoid manual processes, the WirelessCar data management team consumes the HAQM DynamoDB stream from the source database and replays the operations in HAQM Redshift. This action facilitates the removal of any data deleted in the source system from HAQM Redshift too. This also has the benefit of covering user-level actions, like explicitly deleting a trip, as well as batch jobs actions of removing all trips for a vehicle and time to live (TTL) events actions, like removing trips that should no longer be stored.
Data access
Data from HAQM Redshift is accessed by HAQM QuickSight. HAQM QuickSight is used for creating visualization dashboards to create insights from data. WirelessCar uses Active Directory integration to manage access permissions to datasets. This helps in following data regulatory compliance. Data is exported periodically from HAQM Redshift to HAQM S3 for short-term analysis by a limited number of data scientists. In order to facilitate compliance with regulation, this HAQM S3 data is automatically deleted using life cycle events.
To avoid central bottlenecks, the data management team is intentionally kept small, with 3-4 people actively working on the data lake. In contrast, WirelessCar has dozens of teams serving multiple car makers. The data lake solution was set up in 2021, and all OEM solutions now have one or more data streams ingesting data in the data lake. It is our intention to continue this work in 2022 and gather an even greater number of datasets to permit innovation across previously existing data silos.
For the future, we plan to make use of HAQM Redshift Serverless because it helps us to scale up the number of clusters used for cost and backup separation without increasing our fixed costs. As our data volumes grow, it is our intention to shift data out of HAQM Redshift storage and seamlessly query HAQM S3 using HAQM Redshift Spectrum, which lets users query data directly from files on HAQM S3, for longer time series. This data lake is used for creating exploratory dashboard visualizations and developing AI/ML–based services for OEMs and solution users.
Conclusion
WirelessCar is collecting data across all OEM programs and creating a regulatory compliant data lake. This data lake is used for dashboard exploratory analysis and creating new AI/ML–based services for connected mobility. Please reach out to us with your questions or adopt the WirelessCar data lake solution for your workloads. We will share more about building dashboard services for connected mobility in our next blog.