AWS Big Data Blog

How Kaplan, Inc. implemented modern data pipelines using HAQM MWAA and HAQM AppFlow with HAQM Redshift as a data warehouse

This post is co-written with Hemant Aggarwal and Naveen Kambhoji from Kaplan.

Kaplan, Inc. provides individuals, educational institutions, and businesses with a broad array of services, supporting our students and partners to meet their diverse and evolving needs throughout their educational and professional journeys. Our Kaplan culture empowers people to achieve their goals. Committed to fostering a learning culture, Kaplan is changing the face of education.

Kaplan data engineers empower data analytics using HAQM Redshift and Tableau. The infrastructure provides an analytics experience to hundreds of in-house analysts, data scientists, and student-facing frontend specialists. The data engineering team is on a mission to modernize its data integration platform to be agile, adaptive, and straightforward to use. To achieve this, they chose the AWS Cloud and its services. There are various types of pipelines that need to be migrated from the existing integration platform to the AWS Cloud, and the pipelines have different types of sources like Oracle, Microsoft SQL Server, MongoDB, HAQM DocumentDB (with MongoDB compatibility), APIs, software as a service (SaaS) applications, and Google Sheets. In terms of scale, at the time of writing over 250 objects are being pulled from three different Salesforce instances.

In this post, we discuss how the Kaplan data engineering team implemented data integration from the Salesforce application to HAQM Redshift. The solution uses HAQM Simple Storage Service as a data lake, HAQM Redshift as a data warehouse, HAQM Managed Workflows for Apache Airflow (HAQM MWAA) as an orchestrator, and Tableau as the presentation layer.

Solution overview

The high-level data flow starts with the source data stored in HAQM S3 and then integrated into HAQM Redshift using various AWS services. The following diagram illustrates this architecture.

HAQM MWAA is our main tool for data pipeline orchestration and is integrated with other tools for data migration. While searching for a tool to migrate data from a SaaS application like Salesforce to HAQM Redshift, we came across HAQM AppFlow. After some research, we found HAQM AppFlow to be well-suited for our requirement to pull data from Salesforce. HAQM AppFlow provides the ability to directly migrate data from Salesforce to HAQM Redshift. However, in our architecture, we chose to separate the data ingestion and storage processes for the following reasons:

  • We needed to store data in HAQM S3 (data lake) as an archive and a centralized location for our data infrastructure.
  • From a future perspective, there might be scenarios where we need to transform the data before storing it in HAQM Redshift. By storing the data in HAQM S3 as an intermediate step, we can integrate transformation logic as a separate module without impacting the overall data flow significantly.
  • Apache Airflow is the central point in our data infrastructure, and other pipelines are being built using various tools like AWS Glue. HAQM AppFlow is one part of our overall infrastructure, and we wanted to maintain a consistent approach across different data sources and targets.

To accommodate these requirements, we divided the pipeline into two parts:

  • Migrate data from Salesforce to HAQM S3 using HAQM AppFlow
  • Load data from HAQM S3 to HAQM Redshift using HAQM MWAA

This approach allows us to take advantage of the strengths of each service while maintaining flexibility and scalability in our data infrastructure. HAQM AppFlow can handle the first part of the pipeline without the need for any other tool, because HAQM AppFlow provides functionalities like creating a connection to source and target, scheduling the data flow, and creating filters, and we can choose the type of flow (incremental and full load). With this, we were able to migrate the data from Salesforce to an S3 bucket. Afterwards, we created a DAG in HAQM MWAA that runs an HAQM Redshift COPY command on the data stored in HAQM S3 and moves the data into HAQM Redshift.

We faced the following challenges with this approach:

  • To do incremental data, we have to manually change the filter dates in the HAQM AppFlow flows, which isn’t elegant. We wanted to automate that date filter change.
  • Both parts of the pipeline were not in sync because there was no way to know if the first part of the pipeline was complete so that the second part of the pipeline could start. We wanted to automate these steps as well.

Implementing the solution

To automate and resolve the aforementioned challenges, we used HAQM MWAA. We created a DAG that acts as the control center for HAQM AppFlow. We developed an Airflow operator that can perform various HAQM AppFlow functions using HAQM AppFlow APIs like creating, updating, deleting, and starting flows, and this operator is used in the DAG. HAQM AppFlow stores the connection data in an AWS Secrets Manager managed secret with the prefix appflow. The cost of storing the secret is included with the charge for HAQM AppFlow. With this, we were able to run the complete data flow using a single DAG.

The complete data flow consists of the following steps:

  1. Create the flow in the HAQM AppFlow using a DAG.
  2. Update the flow with the new filter dates using the DAG.
  3. After updating the flow, the DAG starts the flow.
  4. The DAG waits for the flow complete by checking the flow’s status repeatedly.
  5. A success status indicates that the data has been migrated from Salesforce to HAQM S3.
  6. After the data flow is complete, the DAG calls the COPY command to copy data from HAQM S3 to HAQM Redshift.

This approach helped us resolve the aforementioned issues, and the data pipelines have become more robust, simple to understand, straightforward to use with no manual intervention, and less prone to error because we are controlling everything from a single point (HAQM MWAA). HAQM AppFlow, HAQM S3, and HAQM Redshift are all configured to use encryption to protect the data. We also performed logging and monitoring, and implemented auditing mechanisms to track the data flow and access using AWS CloudTrail and HAQM CloudWatch. The following figure shows a high-level diagram of the final approach we took.

Conclusion

In this post, we shared how Kaplan’s data engineering team successfully implemented a robust and automated data integration pipeline from Salesforce to HAQM Redshift, using AWS services like HAQM AppFlow, HAQM S3, HAQM Redshift, and HAQM MWAA. By creating a custom Airflow operator to control HAQM AppFlow functionalities, we orchestrated the entire data flow seamlessly within a single DAG. This approach has not only resolved the challenges of incremental data loading and synchronization between different pipeline stages, but has also made the data pipelines more resilient, straightforward to maintain, and less error-prone. We reduced the time for creating a pipeline for a new object from an existing instance and a new pipeline for a new source by 50%. This also helped remove the complexity of using a delta column to get the incremental data, which also helped reduce the cost per table by 80–90% compared to a full load of objects every time.

With this modern data integration platform in place, Kaplan is well-positioned to provide its analysts, data scientists, and student-facing teams with timely and reliable data, empowering them to drive informed decisions and foster a culture of learning and growth.

Try out Airflow with HAQM MWAA and other enhancements to improve your data orchestration pipelines.

For additional details and code examples of HAQM MWAA, refer to the HAQM MWAA User Guide and the HAQM MWAA examples GitHub repo.


About the Authors

Hemant Aggarwal is a senior Data Engineer at Kaplan India Pvt Ltd, helping in developing and managing ETL pipelines leveraging AWS and process/strategy development for the team.

Naveen Kambhoji is a Senior Manager at Kaplan Inc. He works with Data Engineers at Kaplan for building data lakes using AWS Services. He is the facilitator for the entire migration process. His passion is building scalable distributed systems for efficiently managing data on cloud.Outside work, he enjoys travelling with his family and exploring new places.

Jimy Matthews is an AWS Solutions Architect, with expertise in AI/ML tech. Jimy is based out of Boston and works with enterprise customers as they transform their business by adopting the cloud and helps them build efficient and sustainable solutions. He is passionate about his family, cars and Mixed martial arts.