AWS Big Data Blog
Category: HAQM EMR
Dream11’s journey to building their Data Highway on AWS
This is a guest post co-authored by Pradip Thoke of Dream11. In their own words, “Dream11, the flagship brand of Dream Sports, is India’s biggest fantasy sports platform, with more than 100 million users. We have infused the latest technologies of analytics, machine learning, social networks, and media technologies to enhance our users’ experience. Dream11 […]
HAQM EMR Studio (Preview): A new notebook-first IDE experience with HAQM EMR
We’re happy to announce HAQM EMR Studio (Preview), an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter notebooks and tools like Spark UI and YARN Timeline Service to simplify debugging. […]
How the Allen Institute uses HAQM EMR and AWS Step Functions to process extremely wide transcriptomic datasets
This is a guest post by Gautham Acharya, Software Engineer III at the Allen Institute for Brain Science, in partnership with AWS Data Lab Solutions Architect Ranjit Rajan, and AWS Sr. Enterprise Account Executive Arif Khan. The human brain is one of the most complex structures in the universe. Billions of neurons and trillions of […]
HAQM EMR now provides up to 30% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances
HAQM EMR now supports M6g, C6g and R6g instances with HAQM EMR versions 6.1.0, 5.31.0 and later. These instances are powered by AWS Graviton2 processors that are custom designed by AWS using 64-bit Arm Neoverse cores to deliver the best price performance for cloud workloads running in HAQM Elastic Compute Cloud (HAQM EC2). On Graviton2 […]
Data preprocessing for machine learning on HAQM EMR made easy with AWS Glue DataBrew
The machine learning (ML) lifecycle consists of several key phases: data collection, data preparation, feature engineering, model training, model evaluation, and model deployment. The data preparation and feature engineering phases ensure an ML model is given high-quality data that is relevant to the model’s purpose. Because most raw datasets require multiple cleaning steps (such as […]
Accessing and visualizing external tables in an Apache Hive metastore with HAQM Athena and HAQM QuickSight
Many organizations have an Apache Hive metastore that stores the schemas for their data lake. You can use HAQM Athena due to its serverless nature; Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. You may also want to reliably query the rich datasets in the lake, with their schemas […]
Orchestrating analytics jobs by running HAQM EMR Notebooks programmatically
HAQM EMR is a big data service offered by AWS to run Apache Spark and other open-source applications on AWS in a cost-effective manner. HAQM EMR Notebooks is a managed environment based on Jupyter Notebook that allows data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive […]
How the ZS COVID-19 Intelligence Engine helps Pharma & Med device manufacturers understand local healthcare needs & gaps at scale
This post is co-written by Parijat Sharma: Principal, Strategy & Transformation, Wenhao Xia: Manager, Data Science, Vineeth Sandadi: Manager, Business Consulting from ZS Associates, Inc, Arianna Tousi: Strategy, Insights and Planning Consultant from ZS, Gopi Vikranth: Associate Principal from ZS. In their own words, “We’re passionately committed to helping our clients and their customers thrive, […]
Optimizing HAQM EMR for resilience and cost with capacity-optimized Spot Instances
HAQM EMR now supports the capacity-optimized allocation strategy for HAQM Elastic Compute Cloud (HAQM EC2) Spot Instances for launching Spot Instances from the most available Spot Instance capacity pools by analyzing capacity metrics in real time. You can now specify up to 15 instance types in your EMR task instance fleet configuration. This provides HAQM […]
Apply record level changes from relational databases to HAQM S3 data lake using Apache Hudi on HAQM EMR and AWS Database Migration Service
Data lakes give organizations the ability to harness data from multiple sources in less time. Users across different roles are now empowered to collaborate and analyze data in different ways, leading to better, faster decision-making. HAQM Simple Storage Service (HAQM S3) is the highly performant object storage service for structured and unstructured data and the […]