AWS Big Data Blog
Category: Serverless
Performing data transformations using Snowflake and AWS Glue
May 2022: This post was reviewed for accuracy. In the connected world, data is getting generated from many different sources in a wide variety of data formats. Enterprises are looking for tools to ingest from these evolving data sources as well as programmatically customize the ingested data to meet their data analytics needs. You also need […]
Building AWS Glue Spark ETL jobs by bringing your own JDBC drivers for HAQM RDS
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue has native connectors to connect to supported data sources either on AWS or elsewhere using JDBC drivers. Additionally, AWS Glue now enables you to bring your own JDBC drivers […]
Developing, testing, and deploying custom connectors for your data stores with AWS Glue
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue already integrates with various popular data stores such as the HAQM Redshift, RDS, MongoDB, and HAQM S3. Organizations continue to evolve and use a variety of data stores that best fit […]
Migrating data from Google BigQuery to HAQM S3 using AWS Glue custom connectors
July, 2022: This post was reviewed and updated to include a mew data point on the effective runtime with the latest version, explaining Glue 3,0 and autoscaling. October, 2024: In Glue 4.0 we have introduced a native and managed connector for Google BigQuery. You can follow the instruction in the blog postUnlock scalable analytics with […]
Building AWS Glue Spark ETL jobs using HAQM DocumentDB (with MongoDB compatibility) and MongoDB
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue has native connectors to connect to supported data sources on AWS or elsewhere using JDBC drivers. Additionally, AWS Glue now supports reading and writing to HAQM DocumentDB (with MongoDB […]
Writing to Apache Hudi tables using AWS Glue Custom Connector
December 2022: This post was reviewed for accuracy. In today’s world, most organizations have to tackle the 3 V’s of variety, volume and velocity of big data. In this blog post, we talk about dealing with the variety and volume aspects of big data. The challenge of dealing with the variety involves processing data from […]
Validate, evolve, and control schemas in HAQM MSK and HAQM Kinesis Data Streams with AWS Glue Schema Registry
August 30, 2023: HAQM Kinesis Data Analytics has been renamed to HAQM Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Data streaming technologies like Apache Kafka and HAQM Kinesis Data Streams capture and distribute data generated by thousands or millions of applications, websites, or machines. These technologies […]
Building complex workflows with HAQM MWAA, AWS Step Functions, AWS Glue, and HAQM EMR
HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a fully managed service that makes it easy to run open-source versions of Apache Airflow on AWS and build workflows to run your extract, transform, and load (ETL) jobs and data pipelines. You can use AWS Step Functions as a serverless function orchestrator to build scalable […]
Estimating scoring probabilities by preparing soccer matches data with AWS Glue DataBrew
In soccer (or football outside of the US), players decide to take shots when they think they can score. But how do they make that determination vs. when to pass or dribble? In a fraction of a second, in motion, while chased from multiple directions by other professional athletes, they think about their distance from […]
Orchestrating an AWS Glue DataBrew job and HAQM Athena query with AWS Step Functions
As the industry grows with more data volume, big data analytics is becoming a common requirement in data analytics and machine learning (ML) use cases. Also, as we start building complex data engineering or data analytics pipelines, we look for a simpler orchestration mechanism with graphical user interface-based ETL (extract, transform, load) tools. Recently, AWS […]