AWS Machine Learning Blog
Creating a persistent custom R environment for HAQM SageMaker
HAQM SageMaker is a fully managed service that allows you to build, train, and deploy machine learning (ML) models quickly. HAQM SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models. In August 2019, HAQM SageMaker announced the availability of the pre-installed R kernel in all Regions. This capability is available out-of-the-box and comes with the reticulate
library pre-installed. This library offers an R interface for the HAQM SageMaker Python SDK, which enables you to invoke Python modules from within an R script.
This post discusses how to create a custom R environment (kernel) in HAQM SageMaker on top of the built-in R kernel and how to persist that between sessions. The post explains how to install a new package in the R environment, how this new environment can be saved on HAQM Simple Storage Service (HAQM S3), and how you can use it to create new HAQM SageMaker instances using the HAQM SageMaker lifecycle configuration. The post also includes bash scripts that you can use for lifecycle configurations when creating or starting an HAQM SageMaker notebook instance.
Background
The R kernel in HAQM SageMaker is built using the IRKernel package, which installs a kernel with the name ir
and a display name of R
in a Jupyter environment.
You can manage this environment by using Conda, and install specific packages and dependencies. However, by default, an R kernel installed from a notebook instance doesn’t persist to other notebook instance sessions. Every time you start and stop an HAQM SageMaker instance, the R kernel returns to its default environment.
This post walks you through the process of installing R packages in HAQM SageMaker using the following sources:
- Anaconda Cloud
- CRAN
- Github
After you create your environment, you save it on the instance’s HAQM Elastic Block Store (HAQM EBS) storage to make it persistent. You can also store this environment on HAQM S3 and use it to build custom R environments for new HAQM SageMaker instances. For more information, see Customize a Notebook Instance Using a Lifecycle Configuration Script.
Creating an HAQM SageMaker notebook instance with the R kernel
To create an HAQM SageMaker notebook instance with the R kernel, complete the following steps:
- Create a notebook instance.
- When the instance status shows as
In Service
, open Jupyter. - From the New drop-down menu, choose R.
When the new notebook opens, you should see the R logo on the upper right corner of the notebook space.
For more details about creating an HAQM SageMaker notebook instance with R kernel, visit the coding with R on HAQM SageMaker notebook instances blog post.
Installing packages in the HAQM SageMaker R kernel
The HAQM SageMaker R kernel comes with over 140 standard packages. To get the list of these installed packages, you can run the following script in a SageMaker notebook instance with R kernel:
If you need to install additional packages, you can install from Anaconda Cloud, a CRAN archive, or directly from GitHub.
Installing from Anaconda Cloud
The preferred method for installing R packages is to install the package from the Anaconda Cloud repository. This method gives you access to different channels (such as R and Conda Forge), which allows you to install specific versions of the package. If you’re doing this in HAQM SageMaker using the R kernel, use the system()
command to submit the conda install
command.
If you’re installing this in the HAQM SageMaker Jupyter bash terminal, you can just use conda install
as follows:
But in HAQM SageMaker, enter the following code:
The preceding code uses the conda-forge
channel, which installs rJava
version 0.9_12 (at the time this blog post was published). However, if you use the following code (which uses r channel
), it installs version 0.9_11 (at the time this blog post was published):
To search for the specific package name and choose the correct channel for your version, visit the Anaconda Cloud website and search for the package. R packages are named in “r-<package_name>” foramt..
Conda is the preferred method for installing packages, and Anaconda Cloud is the preferred archive because it provides access to the most stable versions of Conda environments.
Installing from the CRAN archive
As an alternative to Anaconda, you can use the Comprehensive R Archive Network (CRAN) archive. The CRAN archive is a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. You can use this archive to install packages in R using install.packages()
. This installs the latest version of the package. See the following code:
Import that package to your R code with the following code:
HAQM SageMaker instances use HAQM Linux AMI, which is a distribution that evolved from Red Hat Enterprise Linux (RHEL) and CentOS. It’s available for use within HAQM Elastic Compute Cloud (HAQM EC2) instances that run HAQM SageMaker. If you’re planning to install packages directly from the source, make sure you select the right operating system. You can check the operating system with the following script in the HAQM SageMaker Jupyter bash terminal:
And the output looks like this (At the time of publication):
Installing from Github
You can also use devtools
and install_github
to get the content directly from the package developer’s repository. See the following code:
This installs the package and its dependencies. However, this isn’t the preferred method for installing packages in HAQM SageMaker.
Persisting the custom R environment between sessions
By default, HAQM SageMaker launches the base R kernel every time you stop and start an HAQM SageMaker instance. Any additional packages you install are lost when you stop the instance, and you have to reinstall the packages when you start the instance again. This is time-consuming and cumbersome. The solution is to save the environment on the EBS storage of the instance and link it to a custom R kernel upon startup using the HAQM SageMaker lifecycle configuration script. For more information, see Customize a Notebook Instance Using a Lifecycle Configuration Script.
This section outlines the steps to make your custom R environment persistent.
Saving the environment on HAQM SageMaker EBS
You first need to save the environment on the instance’s EBS storage by cloning the environment. You can run the following script in HAQM Sagemaker Jupyter bash terminal:
This creates an envs/custom-r
folder under the HAQM SageMaker folder on your instance EBS, which you have access to. See the following screenshot.
If you want to use this custom environment in the same HAQM SageMaker instance later (and not in a different instance), you can skip to the Lifecycle configuration to start the instance with the custom R environment step in this blog post.
Saving the environment to HAQM S3 to create new HAQM SageMaker instances
To use the custom R environment repeatedly when creating this HAQM SageMaker instance (for example, for your development team), save the environment to HAQM S3 as a .zip file and download that to the instance at the Create step. You can run the following script in HAQM SageMaker Juypyter bash terminal:
Lifecycle configuration to create new instances with the custom R environment
To create a new instance and use the custom environment in that instance, you need to bring the .zip environment from HAQM S3 to the instance. You can do this automatically on the HAQM SageMaker console with the lifecycle configuration script. This script downloads the .zip file from HAQM S3 to the /SageMaker/
folder on the instance’s EBS, unzips the file, recreates the /envs/
folder, and removes the redundant folders.
- On the HAQM SageMaker console, under Notebook, choose Lifecycle configurations.
- Select Create Configuration
- Name it
Custom-R-Env
. - On the Create notebook tab, enter the following script.
- Press Create Configuration.
Lifecycle configuration to start the instance with the custom R environment
This step is the same whether you created the custom R environment in the same instance and cloned it to the ./envs/
folder or downloaded the .zip file from HAQM S3 while creating the instance.
This script creates a symbolic link between the ./evns/
folder (which contains the custom R environment) and the anaconda custom-r
environment. This allows the environment to be listed under the kernels in HAQM SageMaker.
- On the HAQM SageMaker console, under Notebook, choose Lifecycle configurations.
- Select Create Configuration
- Name it
Custom-R-Env
(If you have already created the configuration in the previous step, you can select the configuration from the list and choose Edit). - On the Start notebook tab, enter the following script:
- Press Create Configuration (or Update if you are editing an existing configuration).
Assigning the lifecycle configuration to an HAQM SageMaker instance
You can assign a lifecycle configuration when creating a notebook instance. For more information, see Customize a Notebook Instance Using a Lifecycle Configuration Script.
To create a notebook with your lifecycle configuration (Custom-R-Env
), you need to assign the script to the notebook under the Additional Configuration section. All other steps are the same as creating any HAQM SageMaker instance.
Using the custom R environment
If you’re opening your existing instance where you created the custom environment, you should see your existing files and codes, as well as the /envs/
folder.
However, if you’re creating a new instance and used the lifecycle script to bring the environment from HAQM S3, complete the following steps:
- When your instance status shows as
In Service
, open Jupyter. You should see an/envs/
folder in your HAQM SageMaker files. That is your custom environment. - From the New drop-down menu, choose conda_r_custom-r.
You now have a notebook with your custom R environment. When in your notebook, you should see the R logo in the upper right corner corner of the Juypyter environment, which indicates the kernel is an R kernel, and the name of your kernel should be conda_r_custom-r
. To test the environment, import one of the libraries that you included in the custom environment (for example, rJava
).
Your custom R environment is now up and running in the instance, and you can program in R using the reticulate
package.
Conclusion
This post walked you through creating a custom, persistent R environment for HAQM SageMaker notebook instances. For example notebooks for R on HAQM SageMaker, see the HAQM SageMaker examples GitHub repository. For more details about creating an HAQM SageMaker notebook instance with R kernel, visit the coding with R on HAQM SageMaker notebook instances blog post. You can visit R User Guide to HAQM SageMaker on the developer guide for more details on ways of leveraging HAQM SageMaker features using R. In addition, for more resources to further your experience with HAQM SageMaker, see the AWS Machine Learning Blog.
About the author
Nick Minaie is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solution Architect, helping customers on their journey to well-architected machine learning solutions at scale. In his spare time, Nick enjoys abstract painting and loves to explore the nature.