Easily monitor and visualize metrics while training models on HAQM SageMaker

Data scientists and developers can now quickly and easily access, monitor, and visualize metrics that are computed while training machine learning models on HAQM SageMaker. You can now specify the metrics you want to track by using the AWS Management Console for HAQM SageMaker or by using the HAQM SageMaker Python SDK APIs. After the model training starts, HAQM SageMaker will automatically monitor and stream the specified metrics in real time to the HAQM CloudWatch console for visualizing time-series curves, such as loss curves and accuracy curves. You can also access the metrics programmatically using HAQM SageMaker Python SDK APIs.

Model training is an iterative process of teaching a model to make predictions by presenting examples from a training dataset. Typically a training algorithm computes several metrics such as training loss and prediction accuracy that help diagnose whether the model is learning well and will generalize well for making predictions on unseen data. This diagnosis is especially helpful when you are tuning your model’s hyperparameters or evaluating whether your model has the potential for deploying to production.

Now let’s dive into few examples so you can see how you can monitor and visualize these metrics on HAQM SageMaker.

HAQM SageMaker algorithms provide built-in support for metrics

All HAQM SageMaker built-in algorithms automatically compute and emit a variety of model training, evaluation, and validation metrics. For example, the HAQM SageMaker Object2Vec algorithm emits the validation:cross_entropy metric. Object2Vec is a supervised learning algorithm that can learn low dimensional dense embeddings of high dimensional objects such as words, phrases, and sentences. It also learns how similar two embeddings are in vector space. This is a technique that has applications in assessing whether a given pair of sentences in a text are similar. The validation:cross_entropy metric emitted by the algorithm measures the extent to which the prediction made by the model diverges from the actual label in the validation data set. If the model is learning well, the cross_entropy should decrease over the progression of model training.

Now let’s walk through the AWS Management Console step by step. We’ll also show you how to use the code snippets from the sample notebook for training an HAQM SageMaker Object2Vec model.

Step 1: Start the training job on HAQM SageMaker

The sample notebook has step-by-step instructions for creating the training job. You can find all the metrics emitted by the training algorithm on the AWS Management Console. In the console, open the HAQM SageMaker console and choose Training Jobs in the left navigation pane. Then, choose the training job name to open the details page for the training job.

On the training job details page, scroll down to the Metrics section to find all the metrics published by the training algorithm to your HAQM CloudWatch Logs and HAQM CloudWatch Metrics streams. You can use the regex patterns that you see next to each metric to quickly parse and filter the metric values from your HAQM CloudWatch Log files created by HAQM SageMaker.

In the next step we’ll show you how you can avoid doing the manual parsing from log files, and monitor the metric directly on your HAQM CloudWatch metrics dashboard.

Step 2: Visit the HAQM CloudWatch metrics dashboard to monitor and visualize the metrics

The training jobs details page now has a direct link to the HAQM CloudWatch metrics dashboard for the metrics emitted by the training algorithm.

Choose the link to go to your HAQM CloudWatch metrics dashboard. Use this dashboard to select the validation:cross_entropy metric for graphing and visualization.

Step 3: Using HAQM SageMaker Python SDK APIs to visualize metrics

You can also visualize the metrics inline in your HAQM SageMaker Jupyter notebooks using the HAQM SageMaker Python SDK APIs. Here is a sample code snippet.

%matplotlib inline
from sagemaker.analytics import TrainingJobAnalytics

training_job_name = '<insert job name>'
metric_name = 'validation:cross_entropy'

metrics_dataframe = TrainingJobAnalytics(training_job_name=training_job_name,metric_names=[metric_name]).dataframe()
plt = metrics_dataframe.plot(kind='line', figsize=(12,5), x='timestamp', y='value', style='b.', legend=False)
plt.set_ylabel(metric_name);

Step 4: Using the DescribeTrainingJob API action

In addition to visualizing the running value of the metric, you can also access the final value of the metric using the DescribeTrainingJob API action.

Monitoring and visualizing metrics for your own training algorithm

If you are performing model training on HAQM SageMaker using either one of the built-in deep learning framework containers such as the TensorFlow or PyTorch containers, or running your own algorithm container, you can now easily specify the metrics you want HAQM SageMaker to monitor and publish to your HAQM CloudWatch metrics dashboard.

Using the HAQM SageMaker console

While you are creating your model training job on the console, you can now specify the regex pattern for the metrics that your algorithm or model training script publishes to logs. HAQM SageMaker will automatically parse the metrics from logs and publish them to your HAQM CloudWatch metrics dashboard for graphing and visualization.

Using the AWS SDK

You can also add the MetricsDefinition for the metrics you want to track while creating a training job using the CreateTrainingJob API action.

trainingJobParams = {
   "AlgorithmSpecification": { 
      "TrainingImage": "string",
      "TrainingInputMode": "string"
   }, 
...............
...............
MetricDefinitions: [
  {
   "Name": "validation:rmse",
   "Regex": ".*\\[[0-9]+\\].*#011validation-rmse:(\\S+)"
  },
  {
   "Name": "validation:auc",
   "Regex": ".*\\[[0-9]+\\].*#011validation-auc:(\\S+)"
  },
  {
   "Name": "train:auc",
   "Regex": ".*\\[[0-9]+\\]#011train-auc:(\\S+).*"
  }
 ]
...............
...............
}

Get started with more examples and developer support

Now that you have seen examples of how to monitor and visualize metrics on HAQM SageMaker, you can try out the sample notebooks that we mentioned earlier or add metrics visualization to your own training algorithm. You can refer our developer guide for a complete listing of metrics computed by our built-in HAQM SageMaker algorithms or post your questions on our developer forum. Happy modeling!

About the Authors

Sifei Li is a Software Engineer in HAQM AI where she’s working on building HAQM Machine Learning Platforms and was part of the launch team for HAQM SageMaker.

Sumit Thakur is a Senior Product Manager for AWS Machine Learning Platforms where he loves working on products that make it easy for customers to get started with machine learning on cloud. He is product manager for HAQM SageMaker and AWS Deep Learning AMI. In his spare time, he likes connecting with nature and watching sci-fi TV series.

Andrew Packer is a Software Engineer in HAQM AI where he is excited about building scalable, distributed machine learning infrastructure for the masses. In his spare time, he likes playing guitar and exploring the PNW.

AWS Machine Learning Blog