Coding with R on HAQM SageMaker notebook instances

Many AWS customers already use the popular open-source statistical computing and graphics software environment R for big data analytics and data science. HAQM SageMaker is a fully managed service that lets you build, train, and deploy machine learning (ML) models quickly. HAQM SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models. In August 2019, HAQM SageMaker announced the availability of the pre-installed R kernel in all Regions. This capability is available out-of-the-box and comes with the reticulate library pre-installed, which offers an R interface for the HAQM SageMaker Python SDK so you can invoke Python modules from within an R script.

This post describes how to train, deploy, and retrieve predictions from an ML model using R on HAQM SageMaker notebook instances. The model predicts abalone age as measured by the number of rings in the shell. You use the reticulate package as an R interface to the HAQM SageMaker Python SDK to make API calls to HAQM SageMaker. The reticulate package translates between R and Python objects, and HAQM SageMaker provides a serverless data science environment to train and deploy ML models at scale.

To follow this post, you should have a basic understanding of R and be familiar with the following tidyverse packages: dplyr, readr, stringr, and ggplot2.

Creating an HAQM SageMaker notebook instance with the R kernel

To create an HAQM SageMaker Jupyter notebook instance with the R kernel, complete the following steps:

Create a notebook instance.

You can create the notebook with the instance type and storage size of your choice. In addition, you should select the Identity and Access Management (IAM) role that allows you to run HAQM SageMaker and grants access to the HAQM Simple Storage Service (HAQM S3) bucket you need for your project. You can also select any VPC, subnets, and Git repositories, if any. For more information, see Creating IAM Roles.

When the status of the notebook is InService, choose Open Jupyter.

In the Jupyter environment, from the New drop-down menu, choose R.

The R kernel in HAQM SageMaker is built using the IRKernel package and comes with over 140 standard packages. For more information about creating a custom R environment for HAQM SageMaker Jupyter notebook instances, see Creating a persistent custom R environment for HAQM SageMaker.

When you create the new notebook, you should see the R logo in the upper right corner of the notebook environment, and also R as the kernel under that logo. This indicates that HAQM SageMaker has successfully launched the R kernel for this notebook.

End-to-end ML with R on HAQM SageMaker

The sample notebook in this post is available on the Using R with HAQM SageMaker GitHub repo.

Load the reticulate library and import the sagemaker Python module. See the following code:

library(reticulate) 
sagemaker <- import('sagemaker')

After the module loads, use the $ notation in R instead of the . notation in Python to use available classes.

Creating and accessing the data storage

The Session class provides operations for working with the following boto3 resources with HAQM SageMaker:

For this use case, you create an S3 bucket using the default bucket for HAQM SageMaker. The default_bucket function creates a unique S3 bucket with the following name: sagemaker-<aws-region-name>-<aws account number>. See the following code:

session <- sagemaker$Session() 
bucket <- session$default_bucket()

Specify the IAM role’s ARN to allow HAQM SageMaker to access the S3 bucket. You can use the same IAM role used to create this notebook. See the following code:

role_arn <- sagemaker$get_execution_role()

Downloading and processing the dataset

The model uses the Abalone dataset from the UCI Machine Learning Repository. Download the data and start the exploratory data analysis. Use tidyverse packages to read the data, plot the data, and transform the data into an ML format for HAQM SageMaker. See the following code:

library(readr)
data_file <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone <- read_csv(file = data_file, col_names = FALSE)
names(abalone) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
head(abalone)

The following table summarizes the output.

The output shows that sex is a factor data type but is currently a character data type (F is female, M is male, and I is infant). Change sex to a factor and view the statistical summary of the dataset with the following code:

abalone$sex <- as.factor(abalone$sex)
summary(abalone)

The following screenshot shows the output of this code snippet, which provides the statistical summary of the abalone dataframe.

The summary shows that the minimum value for height is 0. You can visually explore which abalones have a height equal to 0 by plotting the relationship between rings and height for each value of sex using the following code and the ggplot2 library:

library(ggplot2) 
options(repr.plot.width = 5, repr.plot.height = 4) 
ggplot(abalone, aes(x = height, y = rings, color = sex)) + geom_point() + geom_jitter()

The following graph shows the plotted data.

The plot shows multiple outliers: two infant abalones with a height of 0 and a few female and male abalones with greater heights than the rest. To filter out the two infant abalones with a height of 0, enter the following code:

library(dplyr)
abalone <- abalone %>%
  filter(height != 0)

Preparing the dataset for model training

The model needs three datasets: training, testing, and validation. Complete the following steps:

Convert sex into a dummy variable and move the target, rings, to the first column:

abalone <- abalone %>%
  mutate(female = as.integer(ifelse(sex == 'F', 1, 0)),
         male = as.integer(ifelse(sex == 'M', 1, 0)),
         infant = as.integer(ifelse(sex == 'I', 1, 0))) %>%
  select(-sex)
abalone <- abalone %>%
  select(rings:infant, length:shell_weight)
head(abalone)

The HAQM SageMaker algorithm requires the target to be in the first column of the dataset.

The following table summarizes the output data.

Sample 70% of the data for training the ML algorithm and split the remaining 30% into two halves, one for testing and one for validation:
```
abalone_train <- abalone %>%
  sample_frac(size = 0.7)
abalone <- anti_join(abalone, abalone_train)
abalone_test <- abalone %>%
  sample_frac(size = 0.5)
abalone_valid <- anti_join(abalone, abalone_test)
```
You can now upload the training and validation data to HAQM S3 so you can train the model. Please note, for CSV training, the XGBoost algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column. The code below does not save the column names to the CSV files.

Write the training and validation datasets to the local file system in .csv format:

write_csv(abalone_train, 'abalone_train.csv', col_names = FALSE) 
write_csv(abalone_valid, 'abalone_valid.csv', col_names = FALSE)

Upload the two datasets to the S3 bucket into the data key:

s3_train <- session$upload_data(path = 'abalone_train.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')
s3_valid <- session$upload_data(path = 'abalone_valid.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')

Define the HAQM S3 input types for the HAQM SageMaker algorithm:

s3_train_input <- sagemaker$s3_input(s3_data = s3_train, content_type = 'csv') 
s3_valid_input <- sagemaker$s3_input(s3_data = s3_valid, content_type = 'csv')

Training the model

The HAQM SageMaker algorithm is available via Docker containers. To train an XGBoost model, complete the following steps:

Specify the training containers in HAQM Elastic Container Registry (HAQM ECR) for your Region. See the following code:

registry <- sagemaker$amazon$amazon_estimator$registry(session$boto_region_name, algorithm='xgboost')
container <- paste(registry, '/xgboost:latest', sep='')
container

'811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest'

Define an HAQM SageMaker Estimator, which can train any supplied algorithm that has been containerized with Docker. When creating the Estimator, use the following arguments:

image_name – The container image to use for training
role – The HAQM SageMaker service role
train_instance_count – The number of EC2 instances to use for training
train_instance_type – The type of EC2 instance to use for training
train_volume_size – The size in GB of the HAQM Elastic Block Store (HAQM EBS) volume for storing input data during training
train_max_run – The timeout in seconds for training
input_mode – The input mode that the algorithm supports
output_path – The HAQM S3 location for saving the training results (model artifacts and output files)
output_kms_key – The AWS Key Management Service (AWS KMS) key for encrypting the training output
base_job_name – The prefix for the name of the training job
sagemaker_session – The Session object that manages interactions with the HAQM SageMaker API

s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_name = container,
                                     role = role_arn,
                                     train_instance_count = 1L,
                                     train_instance_type = 'ml.m5.large',
                                     train_volume_size = 30L,
                                     train_max_run = 3600L,
                                     input_mode = 'File',
                                     output_path = s3_output,
                                     output_kms_key = NULL,
                                     base_job_name = NULL,
                                     sagemaker_session = NULL)

The equivalent to None in Python is NULL in R.

Specify the XGBoost hyperparameters and fit the model.
1. Set the number of rounds for training to 100, which is the default value when using the XGBoost library outside of HAQM SageMaker.
2. Specify the input data and job name based on the current timestamp.

estimator$set_hyperparameters(num_round = 100L)

job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')

input_data <- list('train' = s3_train_input,
                   'validation' = s3_valid_input)

estimator$fit(inputs = input_data,
              job_name = job_name)

When training is complete, HAQM SageMaker copies the model binary (a gzip tarball) to the specified HAQM S3 output location. Get the full HAQM S3 path with the following code:

estimator$model_data

Deploying the model

HAQM SageMaker lets you deploy your model by providing an endpoint that you can invoke by a secure and simple API call using an HTTPS request. To deploy your trained model to a ml.t2.medium instance, enter the following code:

model_endpoint <- estimator$deploy(initial_instance_count = 1L,
                                   instance_type = 'ml.t2.medium')

Generating predictions with the model

You can now use the test data to generate predictions. Complete the following steps:

Pass comma-separated text to be serialized into JSON format by specifying text/csv and csv_serializer for the endpoint. See the following code:
```
model_endpoint$content_type <- 'text/csv'
model_endpoint$serializer <- sagemaker$predictor$csv_serializer
```
Remove the target column and convert the first 500 observations to a matrix with no column names:

abalone_test <- abalone_test[-1]
num_predict_rows <- 500
test_sample <- as.matrix(abalone_test[1:num_predict_rows, ])
dimnames(test_sample)[[2]] <- NULL

This post uses 500 observations because it doesn’t exceed the endpoint limitation.

Generate predictions from the endpoint and convert the returned comma-separated string:

library(stringr)
predictions <- model_endpoint$predict(test_sample)
predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
predictions <- as.numeric(predictions)

Column-bind the predicted rings to the test data:

# Convert predictions to Integer
abalone_test <- cbind(predicted_rings = as.integer(predictions), 
                      abalone_test[1:num_predict_rows, ])
head(abalone_test)

The following table shows the output of the code, which adds the predicted_rings to the abalone_test table. Note that the output of your code will be different than this. The reason for that is the train/validation/test split of the dataset in step 2 under “Preparing the dataset for model training” section is a random split, and thus your split will be different.

Deleting the endpoint

When you’re done with the model, delete the endpoint to avoid incurring deployment costs. See the following code:

session$delete_endpoint(model_endpoint$endpoint)

Conclusion

This post walked you through an end-to-end ML project, from collecting data, to data processing, training the model, deploying the model as an endpoint, and finally making inferences using the deployed model. For more information about creating a custom R environment for HAQM SageMaker Jupyter notebook instances, see Creating a persistent custom R environment for HAQM SageMaker. For example notebooks for R on HAQM SageMaker, see the HAQM SageMaker examples GitHub repository. You can visit R User Guide to HAQM SageMaker on the developer guide for more details on ways of leveraging HAQM SageMaker features using R. In addition, visit the AWS Machine Learning Blog to read the latest news and updates about HAQM SageMaker and other AWS AI and ML services.

About the author

Nick Minaie is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solution Architect, helping customers on their journey to well-architected machine learning solutions at scale. In his spare time, Nick enjoys abstract painting and loves to explore the nature.

AWS Machine Learning Blog