AWS Machine Learning Blog

Category: HAQM SageMaker Data Wrangler

Use HAQM SageMaker Data Wrangler in HAQM SageMaker Studio with a default lifecycle configuration

If you use the default lifecycle configuration for your domain or user profile in HAQM SageMaker Studio and use HAQM SageMaker Data Wrangler for data preparation, then this post is for you. In this post, we show how you can create a Data Wrangler flow and use it for data preparation in a Studio environment […]

Import data from cross-account HAQM Redshift in HAQM SageMaker Data Wrangler for exploratory data analysis and data preparation

Organizations moving towards a data-driven culture embrace the use of data and machine learning (ML) in decision-making. To make ML-based decisions from data, you need your data available, accessible, clean, and in the right format to train ML models. Organizations with a multi-account architecture want to avoid situations where they must extract data from one […]

Prepare data faster with PySpark and Altair code snippets in HAQM SageMaker Data Wrangler

HAQM SageMaker Data Wrangler is a purpose-built data aggregation and preparation tool for machine learning (ML). It allows you to use a visual interface to access data and perform exploratory data analysis (EDA) and feature engineering. The EDA feature comes with built-in data analysis capabilities for charts (such as scatter plot or histogram) and time-saving […]

Unified data preparation and model training with HAQM SageMaker Data Wrangler and HAQM SageMaker Autopilot – Part 1

September 2023: This post was reviewed and updated for accuracy. Data fuels machine learning (ML); the quality of data has a direct impact on the quality of ML models. Therefore, improving data quality and employing the right feature engineering techniques are critical to creating accurate ML models. ML practitioners often tediously iterate on feature engineering, […]

Easily create and store features in HAQM SageMaker without code

Data scientists and machine learning (ML) engineers often prepare their data before building ML models. Data preparation typically includes data preprocessing and feature engineering. You preprocess data by transforming data into the right shape and quality for training, and you engineer features by selecting, transforming, and creating variables when building a predictive model. HAQM SageMaker […]

Create train, test, and validation splits on your data for machine learning with HAQM SageMaker Data Wrangler

In this post, we talk about how to split a machine learning (ML) dataset into train, test, and validation datasets with HAQM SageMaker Data Wrangler so you can easily split your datasets with minimal to no code. Data used for ML is typically split into the following datasets: Training – Used to train an algorithm […]

Build a risk management machine learning workflow on HAQM SageMaker with no code

Since the global financial crisis, risk management has taken a major role in shaping decision-making for banks, including predicting loan status for potential customers. This is often a data-intensive exercise that requires machine learning (ML). However, not all organizations have the data science resources and expertise to build a risk management ML workflow. HAQM SageMaker […]

Process larger and wider datasets with HAQM SageMaker Data Wrangler

HAQM SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes in HAQM SageMaker Studio. Data Wrangler can simplify your data preparation and feature engineering processes and help you with data selection, cleaning, exploration, and visualization. Data Wrangler has over 300 built-in transforms written in PySpark, […]

Pandas user-defined functions are now available in HAQM SageMaker Data Wrangler

HAQM SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes. With Data Wrangler, you can select and query data with just a few clicks, quickly transform data with over 300 built-in data transformations, and understand your data with built-in visualizations without writing any code. Additionally, […]

Create random and stratified samples of data with HAQM SageMaker Data Wrangler

In this post, we walk you through two sampling techniques in HAQM SageMaker Data Wrangler so you can quickly create processing workflows for your data. We cover both random sampling and stratified sampling techniques to help you sample your data based on your specific requirements. Data Wrangler reduces the time it takes to aggregate and […]