AWS Machine Learning Blog

Category: HAQM SageMaker Data Wrangler

Access Snowflake data using OAuth-based authentication in HAQM SageMaker Data Wrangler

In this post, we show how to configure a new OAuth-based authentication feature for using Snowflake in HAQM SageMaker Data Wrangler. Snowflake is a cloud data platform that provides data solutions for data warehousing to data science. Snowflake is an AWS Partner with multiple AWS accreditations, including AWS competencies in machine learning (ML), retail, and […]

Build custom code libraries for your HAQM SageMaker Data Wrangler Flows using AWS Code Commit

As organizations grow in size and scale, the complexities of running workloads increase, and the need to develop and operationalize processes and workflows becomes critical. Therefore, organizations have adopted technology best practices, including microservice architecture, MLOps, DevOps, and more, to improve delivery time, reduce defects, and increase employee productivity. This post introduces a best practice […]

Accelerate time to insight with HAQM SageMaker Data Wrangler and the power of Apache Hive

HAQM SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in HAQM SageMaker Studio. Data Wrangler enables you to access data from a wide variety of popular sources (HAQM S3, HAQM Athena, HAQM Redshift, HAQM EMR and Snowflake) and over 40 other third-party sources. […]

Achieve rapid time-to-value business outcomes with faster ML model training using HAQM SageMaker Canvas

Machine learning (ML) can help companies make better business decisions through advanced analytics. Companies across industries apply ML to use cases such as predicting customer churn, demand forecasting, credit scoring, predicting late shipments, and improving manufacturing quality. In this blog post, we’ll look at how HAQM SageMaker Canvas delivers faster and more accurate model training times enabling […]

Extract non-PHI data from HAQM HealthLake, reduce complexity, and increase cost efficiency with HAQM Athena and HAQM SageMaker Canvas

In today’s highly competitive market, performing data analytics using machine learning (ML) models has become a necessity for organizations. It enables them to unlock the value of their data, identify trends, patterns, and predictions, and differentiate themselves from their competitors. For example, in the healthcare industry, ML-driven analytics can be used for diagnostic assistance and […]

Introducing HAQM SageMaker Data Wrangler’s new embedded visualizations

Manually inspecting data quality and cleaning data is a painful and time-consuming process that can take a huge chunk of a data scientist’s time on a project. According to a 2020 survey of data scientists conducted by Anaconda, data scientists spend approximately 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), […]

Prepare data from HAQM EMR for machine learning using HAQM SageMaker Data Wrangler

Data preparation is a principal component of machine learning (ML) pipelines. In fact, it is estimated that data professionals spend about 80 percent of their time on data preparation. In this intensive competitive market, teams want to analyze data and extract more meaningful insights quickly. Customers are adopting more efficient and visual ways to build […]

Interactive data prep widget for notebooks powered by HAQM SageMaker Data Wrangler

According to a 2020 survey of data scientists conducted by Anaconda, data preparation is one of the critical steps in machine learning (ML) and data analytics workflows, and often very time consuming for data scientists. Data scientists spend about 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), and […]

How to schedule jobs and parameterize your datasets in HAQM SageMaker Data Wrangler

Data is transforming every field and every business. However, with data growing faster than most companies can keep track of, collecting data and getting value out of that data is a challenging thing to do. A modern data strategy can help you create better business outcomes with data. AWS provides the most complete set of […]

Detect multicollinearity, target leakage, and feature correlation with HAQM SageMaker Data Wrangler

In machine learning (ML), data quality has direct impact on model quality. This is why data scientists and data engineers spend significant amount of time perfecting training datasets. Nevertheless, no dataset is perfect—there are trade-offs to the preprocessing techniques such as oversampling, normalization, and imputation. Also, mistakes and errors could creep in at various stages […]