Simplifying data processing at Capitec with HAQM Redshift integration for Apache Spark

This post is co-written with Preshen Goobiah and Johan Olivier from Capitec.

Apache Spark is a widely-used open source distributed processing system renowned for handling large-scale data workloads. It finds frequent application among Spark developers working with HAQM EMR, HAQM SageMaker, AWS Glue and custom Spark applications.

HAQM Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both HAQM Redshift provisioned clusters and HAQM Redshift Serverless. This integration expands the possibilities for AWS analytics and machine learning (ML) solutions, making the data warehouse accessible to a broader range of applications.

With the HAQM Redshift integration for Apache Spark, you can quickly get started and effortlessly develop Spark applications using popular languages like Java, Scala, Python, SQL, and R. Your applications can seamlessly read from and write to your HAQM Redshift data warehouse while maintaining optimal performance and transactional consistency. Additionally, you’ll benefit from performance improvements through pushdown optimizations, further enhancing the efficiency of your operations.

Capitec, South Africa’s biggest retail bank with over 21 million retail banking clients, aims to provide simple, affordable and accessible financial services in order to help South Africans bank better so that they can live better. In this post, we discuss the successful integration of the open source HAQM Redshift connector by Capitec’s shared services Feature Platform team. As a result of utilizing the HAQM Redshift integration for Apache Spark, developer productivity increased by a factor of 10, feature generation pipelines were streamlined, and data duplication reduced to zero.

The business opportunity

There are 19 predictive models in scope for utilizing 93 features built with AWS Glue across Capitec’s Retail Credit divisions. Feature records are enriched with facts and dimensions stored in HAQM Redshift. Apache PySpark was selected to create features because it offers a fast, decentralized, and scalable mechanism to wrangle data from diverse sources.

These production features play a crucial role in enabling real-time fixed-term loan applications, credit card applications, batch monthly credit behavior monitoring, and batch daily salary identification within the business.

The data sourcing problem

To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

During feature development, data engineers require a seamless interface to the EDW. This interface allows them to access and integrate the necessary data from the EDW into the data pipelines, enabling efficient development and testing of features.

Previous solution process

In the previous solution, product team data engineers spent 30 minutes per run to manually expose Redshift data to Spark. The steps included the following:

Construct a predicated query in Python.
Submit an UNLOAD query via the HAQM Redshift Data API.
Catalog data in the AWS Glue Data Catalog via the AWS SDK for Pandas using sampling.

This approach posed issues for large datasets, required recurring maintenance from the platform team, and was complex to automate.

Current solution overview

Capitec was able to resolve these problems with the HAQM Redshift integration for Apache Spark within feature generation pipelines. The architecture is defined in the following diagram.

The workflow includes the following steps:

Internal libraries are installed into the AWS Glue PySpark job via AWS CodeArtifact.
An AWS Glue job retrieves Redshift cluster credentials from AWS Secrets Manager and sets up the HAQM Redshift connection (injects cluster credentials, unload locations, file formats) via the shared internal library. The HAQM Redshift integration for Apache Spark also supports using AWS Identity and Access Management (IAM) to retrieve credentials and connect to HAQM Redshift.
The Spark query is translated to an HAQM Redshift optimized query and submitted to the EDW. This is accomplished by the HAQM Redshift integration for Apache Spark.
The EDW dataset is unloaded into a temporary prefix in an HAQM Simple Storage Service (HAQM S3) bucket.
The EDW dataset from the S3 bucket is loaded into Spark executors via the HAQM Redshift integration for Apache Spark.
The EDL dataset is loaded into Spark executors via the AWS Glue Data Catalog.

These components work together to ensure that data engineers and production data pipelines have the necessary tools to implement the HAQM Redshift integration for Apache Spark, run queries, and facilitate the unloading of data from HAQM Redshift to the EDL.

Using the HAQM Redshift integration for Apache Spark in AWS Glue 4.0

In this section, we demonstrate the utility of the HAQM Redshift integration for Apache Spark by enriching a loan application table residing in the S3 data lake with client information from the Redshift data warehouse in PySpark.

The dimclient table in HAQM Redshift contains the following columns:

ClientKey – INT8
ClientAltKey – VARCHAR50
PartyIdentifierNumber – VARCHAR20
ClientCreateDate – DATE
IsCancelled – INT2
RowIsCurrent – INT2

The loanapplication table in the AWS Glue Data Catalog contains the following columns:

RecordID – BIGINT
LogDate – TIMESTAMP
PartyIdentifierNumber – STRING

The Redshift table is read via the HAQM Redshift integration for Apache Spark and cached. See the following code:

import pyspark.sql.functions as F
from pyspark.sql import SQLContext
sc = # existing SparkContext
sql_context = SQLContext(sc)

secretsmanager_client = boto3.client('secretsmanager')
secret_manager_response = secretsmanager_client.get_secret_value(
    SecretId='string',
    VersionId='string',
    VersionStage='string'
)
username = # get username from secret_manager_response
password = # get password from secret_manager_response
url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password

read_config = {
    "url": url,
    "tempdir": "s3://<capitec-redshift-temp-bucket>/<uuid>/",
    "unload_s3_format": "PARQUET"
}

d_client = (
    spark.read.format("io.github.spark_redshift_community.spark.redshift")
    .options(**read_config)
    .option("query", f"select * from edw_core.dimclient")
    .load()
    .where((F.col("RowIsCurrent") == 1) & (F.col("isCancelled") == 0))
    .select(
        F.col("PartyIdentifierNumber"),
        F.col("ClientCreateDate")
    )
    .cache()
)

Loan application records are read in from the S3 data lake and enriched with the dimclient table on HAQM Redshift information:

import pyspark.sql.functions as F
from awsglue.context import GlueContext
from pyspark import SparkContext

glue_ctx = GlueContext(SparkContext.getOrCreate())

push_down_predicate = (
    f"meta_extract_start_utc_ms between "
    f"'2023-07-12"
    f" 18:00:00.000000' and "
    f"'2023-07-13 06:00:00.000000'"
)

database_name="loan_application_system"
table_name="dbo_view_loan_applications"
catalog_id = # Glue Data Catalog

# Selecting only the following columns
initial_select_cols=[
            "RecordID",
            "LogDate",
            "PartyIdentifierNumber"
        ]

d_controller = (glue_ctx.create_dynamic_frame.from_catalog(catalog_id=catalog_id,
                                            database=database_name,
                                            table_name=table_name,
                                            push_down_predicate=push_down_predicate)
                .toDF()
                .select(*initial_select_cols)
                .withColumn("LogDate", F.date_format("LogDate", "yyyy-MM-dd").cast("string"))
                .dropDuplicates())

# Left Join on PartyIdentifierNumber and enriching the loan application record
d_controller_enriched = d_controller.join(d_client, on=["PartyIdentifierNumber"], how="left").cache()

As a result, the loan application record (from the S3 data lake) is enriched with the ClientCreateDate column (from HAQM Redshift).

How the HAQM Redshift integration for Apache Spark solves the data sourcing problem

The HAQM Redshift integration for Apache Spark effectively addresses the data sourcing problem through the following mechanisms:

Just-in-time reading – The HAQM Redshift integration for Apache Spark connector reads Redshift tables in a just-in-time manner, ensuring the consistency of data and schema. This is particularly valuable for Type 2 slowly changing dimension (SCD) and timespan accumulating snapshot facts. By combining these Redshift tables with the source system AWS Glue Data Catalog tables from the EDL within production PySpark pipelines, the connector enables seamless integration of data from multiple sources while maintaining data integrity.
Optimized Redshift queries – The HAQM Redshift integration for Apache Spark plays a crucial role in converting the Spark query plan into an optimized Redshift query. This conversion process simplifies the development experience for the product team by adhering to the data locality principle. The optimized queries use the capabilities and performance optimizations of HAQM Redshift, ensuring efficient data retrieval and processing from HAQM Redshift for the PySpark pipelines. This helps streamline the development process while enhancing the overall performance of the data sourcing operations.

Gaining the best performance

The HAQM Redshift integration for Apache Spark automatically applies predicate and query pushdown to optimize performance. You can gain performance improvements by using the default Parquet format used for unloading with this integration.

For additional details and code samples, refer to New – HAQM Redshift Integration with Apache Spark.

Solution Benefits

The adoption of the integration yielded several significant benefits for the team:

Enhanced developer productivity – The PySpark interface provided by the integration boosted developer productivity by a factor of 10, enabling smoother interaction with HAQM Redshift.
Elimination of data duplication – Duplicate and AWS Glue cataloged Redshift tables in the data lake were eliminated, resulting in a more streamlined data environment.
Reduced EDW load – The integration facilitated selective data unloading, minimizing the load on the EDW by extracting only the necessary data.

By using the HAQM Redshift integration for Apache Spark, Capitec has paved the way for improved data processing, increased productivity, and a more efficient feature engineering ecosystem.

Conclusion

In this post, we discussed how the Capitec team successfully implemented the Apache Spark HAQM Redshift integration for Apache Spark to simplify their feature computation workflows. They emphasized the importance of utilizing decentralized and modular PySpark data pipelines for creating predictive model features.

Currently, the HAQM Redshift integration for Apache Spark is utilized by 7 production data pipelines and 20 development pipelines, showcasing its effectiveness within Capitec’s environment.

Moving forward, the shared services Feature Platform team at Capitec plans to expand the adoption of the HAQM Redshift integration for Apache Spark in different business areas, aiming to further enhance data processing capabilities and promote efficient feature engineering practices.

For additional information on using the HAQM Redshift integration for Apache Spark, refer to the following resources:

About the Authors

Preshen Goobiah is the Lead Machine Learning Engineer for the Feature Platform at Capitec. He is focused on designing and building Feature Store components for enterprise use. In his spare time, he enjoys reading and traveling.

Johan Olivier is a Senior Machine Learning Engineer for Capitec’s Model Platform. He is an entrepreneur and problem-solving enthusiast. He enjoys music and socializing in his spare time.

Sudipta Bagchi is a Senior Specialist Solutions Architect at HAQM Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket. Connect with him on LinkedIn.

Syed Humair is a Senior Analytics Specialist Solutions Architect at HAQM Web Services (AWS). He has over 17 years of experience in enterprise architecture focusing on Data and AI/ML, helping AWS customers globally to address their business and technical requirements. You can connect with him on LinkedIn.

Vuyisa Maswana is a Senior Solutions Architect at AWS, based in Cape Town. Vuyisa has a strong focus on helping customers build technical solutions to solve business problems. He has supported Capitec in their AWS journey since 2019.

AWS Big Data Blog