AWS Big Data Blog
Category: Serverless
Event-driven refresh of SPICE datasets in HAQM QuickSight
Businesses are increasingly harnessing data to improve their business outcomes. To enable this transformation to a data-driven business, customers are bringing together data from structured and unstructured sources into a data lake. Then they use business intelligence (BI) tools, such as HAQM QuickSight, to unlock insights from this data. To provide fast access to datasets, […]
Making ETL easier with AWS Glue Studio
AWS Glue Studio is an easy-to-use graphical interface that speeds up the process of authoring, running, and monitoring extract, transform, and load (ETL) jobs in AWS Glue. The visual interface allows those who don’t know Apache Spark to design jobs without coding experience and accelerates the process for those who do. AWS Glue Studio was […]
Building an AWS Glue ETL pipeline locally without an AWS account
This blog was last reviewed May, 2022. If you’re new to AWS Glue and looking to understand its transformation capabilities without incurring an added expense, or if you’re simply wondering if AWS Glue ETL is the right tool for your use case and want a holistic view of AWS Glue ETL functions, then please continue […]
Developing AWS Glue ETL jobs locally using a container
April 2024: This post was reviewed for accuracy. Glue 1.0 is deprecated. Refer to Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container for latest solution. March 2022: Newer versions of the product are now available to be used for this post. AWS Glue is a fully managed extract, […]
How Aruba Networks built a cost analysis solution using AWS Glue, HAQM Redshift, and HAQM QuickSight
February 2023 Update: Console access to the AWS Data Pipeline service will be removed on April 30, 2023. On this date, you will no longer be able to access AWS Data Pipeline though the console. You will continue to have access to AWS Data Pipeline through the command line interface and API. Please note that […]
Optimize Python ETL by extending Pandas with AWS Data Wrangler
April 2024: This post was reviewed for accuracy. Developing extract, transform, and load (ETL) data pipelines is one of the most time-consuming steps to keep data lakes, data warehouses, and databases up to date and ready to provide business insights. You can categorize these pipelines into distributed and non-distributed, and the choice of one or […]
Stream Twitter data into HAQM Redshift using HAQM MSK and AWS Glue streaming ETL
This post demonstrates how customers, system integrator (SI) partners, and developers can use the serverless streaming ETL capabilities of AWS Glue with HAQM Managed Streaming for Kafka (HAQM MSK) to stream data to a data warehouse such as HAQM Redshift. We also show you how to view Twitter streaming data on HAQM QuickSight via HAQM Redshift.
How Wind Mobility built a serverless data architecture
We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from HAQM S3, processes it using Apache Spark, and writes the results to our HAQM Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.
Process data with varying data ingestion frequencies using AWS Glue job bookmarks
We often have data processing requirements in which we need to merge multiple datasets with varying data ingestion frequencies. Some of these datasets are ingested one time in full, received infrequently, and always used in their entirety, whereas other datasets are incremental, received at certain intervals, and joined with the full datasets to generate output. To address this requirement, this post demonstrates how to build an extract, transform, and load (ETL) pipeline using AWS Glue.
Extend your HAQM Redshift Data Warehouse to your Data Lake
HAQM Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools. Many companies today are using HAQM Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become […]