Using R with HAQM SageMaker

July, 2022: This post was reviewed and updated for relevancy and accuracy, with an updated AWS CloudFormation Template.

December 2020: Post updated with changes required for HAQM SageMaker SDK v2

This blog post describes how to train, deploy, and retrieve predictions from a machine learning (ML) model using HAQM SageMaker and R. The model predicts abalone age as measured by the number of rings in the shell. The reticulate package will be used as an R interface to HAQM SageMaker Python SDK to make API calls to HAQM SageMaker. The reticulate package translates between R and Python objects, and HAQM SageMaker provides a serverless data science environment to train and deploy ML models at scale.

To follow along with this blog post, you should have a basic understanding of R and be familiar with the following tidyverse packages: dplyr, readr, stringr, and ggplot2. You use RStudio to run the code. RStudio is an integrated development environment (IDE) for working with R. We will be using fully managed RStudio on HAQM SageMaker.

Launching AWS CloudFormation

Please refer to Get started with RStudio on HAQM SageMaker which details the steps you take to create a SageMaker domain with RStudio. Use the following AWS CloudFormation stack to create a domain with a user profile:

Launching this stack creates the following resources:

A SageMaker domain with RStudio
A SageMaker RStudio user profile
An IAM service role for SageMaker RStudio domain execution
An IAM service role for SageMaker RStudio user profile execution

After you launch the stack, follow these steps to configure and connect to RStudio:

On the Select template page, choose Next.
On the Specify stack details page, in the Stack name section, enter a name.
On the Specify stack details page, in the Execution Role Arn, leave blank unless you already have the required role created.
On the Specify stack details page, in the Vpc Id parameter, select your Vpc.
On the Specify stack details page, in the Subnet Id(s) parameter, select your subnets.
On the Specify stack details page, in the App Network Access Type parameter, select either PublicInternetOnly or VpcOnly.
On the Specify stack details page, in the Security Group(s) parameter, select your security groups.
On the Specify stack details page, in the Domain Execution Role Arn, leave blank unless you already have the required role created.
Leave the remaining required parameters as is.
There are also optional parameters which we will not use:
1. Customer managed CMK
2. RStudio connect URL
3. RStudio package manager URL
4. 3 RStudio custom images
On the bottom of the Specify stack details page, choose Next.
On the Configure stack options page, choose Next.
On the Review page, select the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box and choose Next.

Once the stack status is CREATE_COMPLETE, navigate to the HAQM SageMaker Control Panel and launch the RStudio app for rstudio-user.

On the RStudio Workbench launcher page start a new R session using the RSession Base image.

Reticulating the HAQM SageMaker Python SDK

First, load the reticulate library and import the sagemaker Python module. Once the module is loaded, use the $ notation in R instead of the . notation in Python to view the available classes.

Use the Session class, as shown in the following image.

The Session class provides operations for working with the following boto3 resources with HAQM SageMaker:

To view the objects available to the Session class, use the $ notation, as shown in the following image.

Creating and accessing the data storage

Let’s create an HAQM Simple Storage Service (HAQM S3) bucket for your data. You will need the IAM role that allows HAQM SageMaker to access the bucket.

Specify the HAQM S3 bucket to store the training data, the model’s binary file, and output from the training job:

library(reticulate)
sagemaker <- import('sagemaker')
session <- sagemaker$Session()
bucket <- session$default_bucket()
role_arn <- sagemaker$get_execution_role()

Note:

You do not need to install Miniconda. Type n when you are prompted.
The default_bucket function creates a unique HAQM S3 bucket with the following name: sagemaker-<aws-region-name>-<aws account number>.

Downloading and processing the dataset

The model uses the abalone dataset from the UCI Machine Learning Repository. First, download the data and start the exploratory data analysis. Use tidyverse packages to read the data, plot the data, and transform the data into ML format for HAQM SageMaker:

library(readr)
work_dir <- getwd()
system(paste('aws s3 cp', data_file, work_dir))
data_file <- 's3://sagemaker-sample-files/datasets/tabular/uci_abalone/abalone.csv'
column_names <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
abalone <- read_csv(file = file.path(work_dir, 'abalone.csv'), col_names = column_names)
head(abalone)

The preceding image shows that sex is a factor data type but is currently a character data type (F is female, M is male, and I is infant). Change sex to a factor and view the statistical summary of the dataset:

abalone$sex <- as.factor(abalone$sex)
summary(abalone)

The summary shows that the minimum value for height is 0:

Visually explore which abalones have height equal to 0 by plotting the relationship between rings and height for each value of sex:

library(ggplot2)
ggplot(abalone, aes(x = height, y = rings, color = sex)) + geom_point() + geom_jitter()

The plot shows multiple outliers: two infant abalones with a height of 0 and a few female and male abalones with greater heights than the rest. Let’s filter out the two infant abalones with a height of 0.

library(dplyr)
abalone <- abalone %>%
  filter(height != 0)

Preparing the dataset for model training

The model needs three datasets: one each for training, testing, and validation. First, convert sex into a dummy variable and move the target, rings, to the first column. HAQM SageMaker algorithms require the target to be in the first column of the dataset.

abalone <- abalone %>%
  mutate(female = as.integer(ifelse(sex == 'F', 1, 0)),
         male = as.integer(ifelse(sex == 'M', 1, 0)),
         infant = as.integer(ifelse(sex == 'I', 1, 0))) %>%
  select(-sex)
abalone <- abalone %>%
  select(rings:infant, length:shell_weight)
head(abalone)

This code produces a dataset like the following:

Next, sample 70% of the data for training the ML algorithm. Split the remaining 30% into two halves, one for testing and one for validation:

abalone_train <- abalone %>%
  sample_frac(size = 0.7)
abalone <- anti_join(abalone, abalone_train)
abalone_test <- abalone %>%
  sample_frac(size = 0.5)
abalone_valid <- anti_join(abalone, abalone_test)

Upload the training and validation data to HAQM S3 so that you can train the model. First, write the training and validation datasets to the local filesystem in .csv format:

write_csv(abalone_train, 'abalone_train.csv', col_names = FALSE)
write_csv(abalone_valid, 'abalone_valid.csv', col_names = FALSE)

Second, upload the two datasets to the HAQM S3 bucket into the data key:

s3_train <- session$upload_data(path = 'abalone_train.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')
s3_valid <- session$upload_data(path = 'abalone_valid.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')

Finally, define the HAQM S3 input types for the HAQM SageMaker algorithm:

s3_train_input <- sagemaker$TrainingInput(s3_data = s3_train,
                                          content_type = 'csv')
s3_valid_input <- sagemaker$TrainingInput(s3_data = s3_valid,
                                          content_type = 'csv')

Training the model

HAQM SageMaker algorithms are available via a Docker container. To train an XGBoost model, specify the training containers in HAQM Elastic Container Registry (HAQM ECR) for the AWS Region.

container <- sagemaker$image_uris$retrieve(framework = 'xgboost',
                                           region = session$boto_region_name,
                                           version = 'latest')

Define an HAQM SageMaker Estimator, which can train any supplied algorithm that has been containerized with Docker. When creating the Estimator, use the following arguments:

image_uri – The container image to use for training
role – The HAQM SageMaker service role that you created
instance_count – The number of HAQM EC2 instances to use for training
instance_type – The type of HAQM EC2 instance to use for training
volume_size – The size in GB of the HAQM Elastic Block Store (HAQM EBS) volume to use for storing input data during training
max_run – The timeout in seconds for training
input_mode – The input mode that the algorithm supports
output_path – The HAQM S3 location for saving the training results (model artifacts and output files)
output_kms_key – The AWS Key Management Service (AWS KMS) key for encrypting the training output
base_job_name – The prefix for the name of the training job
sagemaker_session – The Session object that manages interactions with HAQM SageMaker API

s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_uri = container,
                                           role = role_arn,
                                           instance_count = 1L,
                                           instance_type = 'ml.m5.large',
                                           volume_size = 30L,
                                           max_run = 3600L,
                                           input_mode = 'File',
                                           output_path = s3_output,
                                           output_kms_key = NULL,
                                           base_job_name = NULL,
                                           sagemaker_session = session)

Note

The equivalent to None in Python is NULL in R.

Specify the XGBoost hyperparameters and fit the model. Set the number of rounds for training to 100 which is the default value when using the XGBoost library outside of HAQM SageMaker. Also specify the input data and a job name based on the current time stamp:

estimator$set_hyperparameters(num_round = 100L)
job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')
input_data <- list('train' = s3_train_input,
                   'validation' = s3_valid_input)
estimator$fit(inputs = input_data,
              job_name = job_name)

Once training has finished, HAQM SageMaker copies the model binary (a gzip tarball) to the specified HAQM S3 output location. Get the full HAQM S3 path with this command:

estimator$model_data

Deploying the model

HAQM SageMaker lets you deploy your model by providing an endpoint that consumers can invoke by a secure and simple API call using an HTTPS request.

Let’s deploy our trained model to a ml.t2.medium instance. For more information, see HAQM SageMaker ML Instance Types.

model_endpoint <- estimator$deploy(initial_instance_count = 1L,
                                   instance_type = 'ml.t2.medium')

Generating predictions with the model

Use the test data to generate predictions. Pass comma-separated text to be serialized into JSON format by specifying CSVSerializer for the endpoint:

model_endpoint$serializer <- sagemaker$serializers$CSVSerializer()

Remove the target column and convert the first 500 observations to a matrix with no column names:

abalone_test <- abalone_test[-1]
num_predict_rows <- 500
test_sample <- as.matrix(abalone_test[1:num_predict_rows, ])
dimnames(test_sample)[[2]] <- NULL

Note

500 observations was chosen because it doesn’t exceed the endpoint limitation.

Generate predictions from the endpoint and convert the returned comma-separated string:

library(stringr)
predictions <- model_endpoint$predict(test_sample)
predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
predictions <- as.numeric(predictions)

Column-bind the predicted rings to the test data:

abalone_test <- cbind(predicted_rings = predictions, 
                      abalone_test[1:num_predict_rows, ])
head(abalone_test)

The predicted ages (number of shell rings) look like this:

Deleting the endpoint

When you’re done with the model, delete the endpoint to avoid incurring deployment costs:

model_endpoint$delete_endpoint()

Conclusion

In this blog post, you learned how to build and deploy an ML model by using HAQM SageMaker with R. Typically, you execute this workflow with Python, but we showed how you could also do it with R.

About the Author

Ryan Garner is a Data Scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their Data Science and Machine Learning problems.

AWS Machine Learning Blog