AWS Big Data Blog
Category: AWS Step Functions
How the Georgia Data Analytics Center built a cloud analytics solution from scratch with the AWS Data Lab
This is a guest post by Kanti Chalasani, Division Director at Georgia Data Analytics Center (GDAC). GDAC is housed within the Georgia Office of Planning and Budget to facilitate governed data sharing between various state agencies and departments. The Office of Planning and Budget (OPB) established the Georgia Data Analytics Center (GDAC) with the intent […]
ETL orchestration using the HAQM Redshift Data API and AWS Step Functions with AWS SDK integration
Extract, transform, and load (ETL) serverless orchestration architecture applications are becoming popular with many customers. These applications offers greater extensibility and simplicity, making it easier to maintain and simplify ETL pipelines. A primary benefit of this architecture is that we simplify an existing ETL pipeline with AWS Step Functions and directly call the HAQM Redshift […]
Doing more with less: Moving from transactional to stateful batch processing
HAQM processes hundreds of millions of financial transactions each day, including accounts receivable, accounts payable, royalties, amortizations, and remittances, from over a hundred different business entities. All of this data is sent to the eCommerce Financial Integration (eCFI) systems, where they are recorded in the subledger. Ensuring complete financial reconciliation at this scale is critical […]
Build and orchestrate ETL pipelines using HAQM Athena and AWS Step Functions
Extract, transform, and load (ETL) is the process of reading source data, applying transformation rules to this data, and loading it into the target structures. ETL is performed for various reasons. Sometimes ETL helps align source data to target data structures, whereas other times ETL is done to derive business value by cleansing, standardizing, combining, […]
Prepare, transform, and orchestrate your data using AWS Glue DataBrew, AWS Glue ETL, and AWS Step Functions
Data volumes in organizations are increasing at an unprecedented rate, exploding from terabytes to petabytes and in some cases exabytes. As data volume increases, it attracts more and more users and applications to use the data in many different ways—sometime referred to as data gravity. As data gravity increases, we need to find tools and […]
Centralize feature engineering with AWS Step Functions and AWS Glue DataBrew
One of the key phases of a machine learning (ML) workflow is data preprocessing, which involves cleaning, exploring, and transforming the data. AWS Glue DataBrew, announced in AWS re:Invent 2020, is a visual data preparation tool that enables you to develop common data preparation steps without having to write any code or installation. In this […]
Orchestrate an HAQM EMR on HAQM EKS Spark job with AWS Step Functions
At re:Invent 2020, we announced the general availability of HAQM EMR on HAQM EKS, a new deployment option for HAQM EMR that allows you to automate the provisioning and management of open-source big data frameworks on HAQM Elastic Kubernetes Service (HAQM EKS). With HAQM EMR on EKS, you can now run Spark applications alongside other […]
Building complex workflows with HAQM MWAA, AWS Step Functions, AWS Glue, and HAQM EMR
HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a fully managed service that makes it easy to run open-source versions of Apache Airflow on AWS and build workflows to run your extract, transform, and load (ETL) jobs and data pipelines. You can use AWS Step Functions as a serverless function orchestrator to build scalable […]
Multi-tenant processing pipelines with AWS DMS, AWS Step Functions, and Apache Hudi on HAQM EMR
Large enterprises often provide software offerings to multiple customers by providing each customer a dedicated and isolated environment (a software offering composed of multiple single-tenant environments). Because the data is in various independent systems, large enterprises are looking for ways to simplify data processing pipelines. To address this, you can create data lakes to bring […]
Automating EMR workloads using AWS Step Functions
HAQM EMR allows you to process vast amounts of data quickly and cost-effectively at scale. Using open-source tools such as Apache Spark, Apache Hive, and Presto, and coupled with the scalable storage of HAQM Simple Storage Service (HAQM S3), HAQM EMR gives analytical teams the engines and elasticity to run petabyte-scale analysis for a fraction […]