AWS Machine Learning Blog

Using A/B testing to measure the efficacy of recommendations generated by HAQM Personalize

Machine learning (ML)-based recommender systems aren’t a new concept, but developing such a system can be a resource-intensive task—from data management during training and inference, to managing scalable real-time ML-based API endpoints. HAQM Personalize allows you to easily add sophisticated personalization capabilities to your applications by using the same ML technology used on HAQM.com for over 20 years. No ML experience required. Customers in industries such as retail, media and entertainment, gaming, travel and hospitality, and others use HAQM Personalize to provide personalized content recommendations to their users. With HAQM Personalize, you can solve the most common use cases: providing users with personalized item recommendations, surfacing similar items, and personalized re-ranking of items.

HAQM Personalize automatically trains ML models from your user-item interactions and provides an API to retrieve personalized recommendations for any user. A frequently asked question is, “How do I compare the performance of recommendations generated by HAQM Personalize to my existing recommendation system?” In this post, we discuss how to perform A/B tests with HAQM Personalize, a common technique for comparing the efficacy of different recommendation strategies.

You can quickly create a real-time recommender system on the AWS Management Console or the HAQM Personalize API by following these simple steps:

  1. Import your historical user-item interaction data.
  2. Based on your use case, start a training job using an HAQM Personalize ML algorithm (also known as recipes).
  3. Deploy an HAQM Personalize-managed, real-time recommendations endpoint (also known as a campaign).
  4. Record new user-item interactions in real time by streaming events to an event tracker attached to your HAQM Personalize deployment.

The following diagram represents which tasks HAQM Personalize manages.

This diagram represents which tasks HAQM Personalize manages

Metrics overview

You can measure the performance of ML recommender systems through offline and online metrics. Offline metrics allow you to view the effects of modifying hyperparameters and algorithms used to train your models, calculated against historical data. Online metrics are the empirical results observed in your user’s interactions with real-time recommendations provided in a live environment.

HAQM Personalize generates offline metrics using test datasets derived from the historical data you provide. These metrics showcase how the model recommendations performed against historical data. The following diagram illustrates a simple example of how HAQM Personalize splits your data at training time.

The following diagram illustrates a simple example of how HAQM Personalize splits your data at training time

Consider a training dataset containing 10 users with 10 interactions per user; interactions are represented by circles and ordered from oldest to newest based on their timestamp. In this example, HAQM Personalize uses 90% of the users’ interactions data (blue circles) to train your model, and the remaining 10% for evaluation. For each of the users in the evaluation data subset, 90% of their interaction data (green circles) is used as input for the call to the trained model, and the remaining 10% of their data (orange circle) is compared to the output produced by the model to validate its recommendations. The results are displayed to you as evaluation metrics.

HAQM Personalize produces the following metrics:

  • Coverage – This metric is appropriate if you’re looking for what percentage of your inventory is recommended
  • Mean reciprocal rank (at 25) – This metric is appropriate if you’re interested in the single highest ranked recommendation
  • Normalized discounted cumulative gain (at K) – The discounted cumulative gain is a measure of ranking quality; it refers to how well the recommendations are ordered
  • Precision (at K) – This metric is appropriate if you’re interested in how a carousel of size K may perform in front of your users

For more information about how HAQM Personalize calculates these metrics, see Evaluating a Solution Version.

Offline metrics are a great representation of how your hyperparameters and data features influence your model’s performance against historical data. To find empirical evidence of the impact of HAQM Personalize recommendations on your business metrics, such as click-through rate, conversion rate, or revenue, you should test these recommendations in a live environment, getting them in front of your customers. This exercise is important because a seemingly small improvement in these business metrics can translate into a significant increase in your customer engagement, satisfaction, and business outputs, such as revenue.

The following sections include an experimentation methodology and reference architecture in which you can identify the steps required to expose multiple recommendation strategies (for example, HAQM Personalize vs. an existing recommender system) to your users in a randomized fashion and measure the difference in performance in a scientifically sound manner (A/B testing).

Experimentation methodology

The data collected across experiments enables you to measure the efficacy of HAQM Personalize recommendations in terms of business metrics. The following diagram illustrates the experimentation methodology we suggest adhering to.

The following diagram illustrates the experimentation methodology we suggest adhering to

The process consists of five steps:

  • Research – The formulation of your questions and definition of the metrics to improve are solely based on the data you gather before starting your experiment. For example, after exploring your historical data, you might be interested in why you experience shopping cart abandonment or high bounce rates from leads generated by advertising.
  • Develop a hypothesis – You use the data gathered during the research phase to make observations and develop a change and effect statement. The hypothesis must be quantifiable. For example, providing user personalization through an HAQM Personalize campaign on the shopping cart page will translate into an increase of the average cart value by 10%.
  • Create variations based on the hypothesis – The variations of your experiment are based on the hypothesized behavior you’re evaluating. A newly created HAQM Personalize campaign can be considered the variation of your experiment when compared against an existing rule-based recommendation system.
  • Run an experiment – You can use several techniques to test your recommendation system; this post focuses on A/B testing. The metrics data gathered during the experiment help validate (or invalidate) the hypothesis. For example, a 10% increase on the average cart value after adding HAQM Personalize recommendations to the shopping cart page over 1 month compared to the average cart value keeping the current system’s recommendations.
  • Measure the results – In this step, you determine if there is statistical significance to draw a conclusion and select the best performing variation. Was the increase in your cart average value a result of the randomness of your user testing set, or did the newly created HAQM Personalize campaign influence this increase?

A/B testing your HAQM Personalize deployment

The following architecture showcases a microservices-based implementation of an A/B test between two HAQM Personalize campaigns. One is trained with one of the recommendation recipes provided by HAQM Personalize, and the other is trained with a variation of this recipe. HAQM Personalize provides various predefined ML algorithms (recipes); HRNN-based recipes enable you to provide personalized user recommendations.

The following architecture showcases a microservices-based implementation of an A/B test between two HAQM Personalize campaigns

This architecture compares two HAQM Personalize campaigns. You can apply the same logic when comparing an HAQM Personalize campaign against a custom rule-based or ML-based recommender system. For more information about campaigns, see Creating a Campaign.

The basic workflow of this architecture is as follows:

  1. The web application requests customer recommendations from the recommendations microservice.
  2. The microservice determines if there is an active A/B test. For this post, we assume your testing strategy settings are stored in HAQM DynamoDB.
  3. When the microservice identifies the group your customer belongs to, it resolves the HAQM Personalize campaign endpoint to query for recommendations.
  4. HAQM Personalize campaigns provide the recommendations for your users.
  5. The users interact with their respective group recommendations.
  6. The web application streams user interaction events to HAQM Kinesis Data Streams.
  7. The microservice consumes the Kinesis stream, which sends the user interaction event to both HAQM Personalize event trackers. Recording events is an HAQM Personalize feature that collects real-time user interaction data and provides relevant recommendations in real time.
  8. HAQM Kinesis Data Firehose ingests your user-item interactions stream and stores the interactions data in HAQM Simple Storage Service (HAQM S3) to use in future trainings.
  9. The microservice keeps track of your pre-defined business metrics throughout the experiment.

For instructions on running an A/B test, see the Retail Demo Store Experimentation Workshop section in the Github repo.

Tracking well-defined business metrics is a critical task during A/B testing; seeing improvements on these metrics is the true indicator of the efficacy of your HAQM Personalize recommendations. The metrics measured throughout your A/B tests need to be consistent across your variations (groups A and B). For example, an ecommerce site can evaluate the change (increase or decrease) on the click-through rate of a particular widget after adding HAQM Personalize recommendations (group A) compared to the click-through rate obtained using the current rule-based recommendations (group B).

An A/B experiment runs for a defined period, typically dictated by the number of users necessary to reach a statistically significant result. Tools such as Optimizely, AB Tasty, and Evan Miller’s Awesome A/B Tools can help you determine how large your sample size needs to be. A/B tests are usually active across multiple days or even weeks, in order to collect a large enough sample from your userbase. The following graph showcases the feedback loop between testing, adjusting your model, and rolling out new features on success.

The following graph showcases the feedback loop between testing, adjusting your model, and rolling out new features on success.

For an A/B test to be considered successful, you need to perform a statistical analysis of the data gathered from your population to determine if there is a statistically significant result. This analysis is based on the significance level you set for the experiment; a 5% significance level is considered the industry standard. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. A lower significance level means that we need stronger evidence for a statistically significant result. For more information about statistical significance, see A Refresher on Statistical Significance.

The next step is to calculate the p-value. The p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is TRUE. In other words, the p-value is the expected fluctuation in a given sample, similar to the variance. For example, imagine we ran an A/A test where we displayed the same variation to two groups of users. After such an experiment, we would expect the metrics results across groups to be very similar but not dramatically different: a p-value greater than your significance level. Therefore, in an A/B test, we hope to see a p-value that is less than our significance level so we can conclude the influence on the business metric was the result of your variance group. AWS Partners such as Amplitude or Optimizely provide A/B testing tools to facilitate the setup and analysis of your experiments.

A/B tests are statistical measures of the efficacy of your HAQM Personalize recommendations, allowing you to quantify the impact these recommendations have on your business metrics. Additionally, A/B tests allows you to gather organic user-item interactions that you can use to train subsequent HAQM Personalize implementations. We recommend spending less time on offline tests and getting your HAQM Personalize recommendations in front of your users as quickly as possible. This helps eliminate biases from existing recommender systems in your training dataset, which allows your HAQM Personalize deployments to learn from organic user-item interactions data.

Conclusion

HAQM Personalize is an easy-to-use, highly scalable solution that can help you solve some of the most popular recommendation use cases:

  • Personalized recommendations
  • Similar items recommendations
  • Personalized re-ranking of items

A/B testing provides invaluable information on how your customers interact with your HAQM Personalize recommendations. These results, measured according to well-defined business metrics, will give you a sense of the efficacy of these recommendations along with clues on how to further adjust your training datasets. After you iterate through this process multiple times, you will see an improvement on the metrics that matter most to improve customer engagement.

If this post helps you or inspires you to use A/B testing to improve your business metrics, please share your thoughts in the comments.

Additional resources

For more information about HAQM Personalize, see the following:


About the Author

Luis Lopez Soria is an AI/ML specialist solutions architect working with the AWS machine learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.