Gaming Analytics: Leveraging AWS Glue and HAQM Redshift Spectrum for Player Insights

Introduction

In the dynamic landscape of game development, efficient data management and analysis are pivotal for optimizing player experiences and driving business growth. Game developers and analysts often encounter the challenge of aggregating data from diverse sources, ranging from real-time operational metrics to historical analytical records. To address these challenges, AWS provides a robust suite of services. Analytics services include AWS Glue for data preparation and transformation purposes. Additionally, HAQM Redshift Spectrum supports seamless querying of data across data warehouses and data lakes. This article explores the integration of AWS Glue and HAQM Redshift Spectrum to streamline the process of joining operational and analytical data for gaming analytics. By leveraging these services, game developers can extract valuable insights from disparate data sources while minimizing development effort and operational costs.

Redshift Spectrum and AWS Glue setup requirements

To illustrate this integration, you’ll use HAQM Aurora MySQL-Compatible Edition for operational data and HAQM Redshift for analytical data storage. This scenario involves joining player data from HAQM Aurora MySQL with player statistics stored in HAQM Redshift. Before diving into the implementation, you’ll step-through prerequisite set-up, including the creation of HAQM Virtual Private Cloud (HAQM VPC) endpoints, appropriate AWS Identity and Access Management (IAM) roles, and download of a database driver for connectivity.

Security prerequisites

Add a self-referential rule to the HAQM Aurora MySQL security group.

Create an AWS Glue role to call other AWS services.

Create a Redshift Spectrum role to allow HAQM Redshift to call other AWS services. The HAQM Redshift CREATE EXTERNAL SCHEMA command uses this role.

Add an inline policy to MyRedshiftSpectrumRole to allow actions for HAQM Simple Storage Service (S3), AWS Glue and AWS Lake Formation.
a. Choose the Permissions tab, Add permissions and Create inline policy.
b. Under Specify permissions, toggle JSON and paste the below policy in Policy editor.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:Get",
        "s3:List"
      ],
      "Resource": ""
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateDatabase",
        "glue:DeleteDatabase",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:UpdateDatabase",
        "glue:CreateTable",
        "glue:DeleteTable",
        "glue:BatchDeleteTable",
        "glue:UpdateTable",
        "glue:GetTable",
        "glue:GetTables",
        "glue:BatchCreatePartition",
        "glue:CreatePartition",
        "glue:DeletePartition",
        "glue:BatchDeletePartition",
        "glue:UpdatePartition",
        "glue:GetPartition",
        "glue:GetPartition",
        "glue:BatchGetPartition",
        "lakeformaDon:GetDataAccess"
      ],
      "Resource": [
        ""
      ]
    }
  ]
}

c. Select Next and Create policy.

Endpoint prerequisites

To allow AWS Glue access to HAQM S3 from within an HAQM VPC, create an HAQM VPC Gateway Endpoint for HAQM S3.

AWS Glue requires an HAQM VPC interface endpoint to utilize a JDBC connection and is needed to activate networking between AWS Glue and HAQM Aurora MySQL.

JDBC driver prerequisite

1. For the final prerequisite step, download and store the latest MySQL driver in HAQM S3. The JDBC connection requires the driver for AWS Glue to crawl the HAQM Aurora MySQL table.

2. To download the connector, choose mysql-connector-j-8.3.0.zip

3. Create an HAQM S3 bucket to host the preceding driver and upload the driver jar to the bucket. No bucket policy is needed for access as the AWS Glue service role created above has necessary read permissions.

Operational and analytics data stores

Queries use an HAQM Aurora MySQL table:

And an HAQM Redshift provisioned cluster table:

With prerequisites in place and data stores defined, let’s revisit our objectives: extracting operational data from the HAQM Aurora MySQL, conducting transformations by eliminating unnecessary columns, and loading the data into HAQM S3 with date format partitions for seamless querying via Redshift Spectrum. The extract, transform, and load (ETL) operations must be repeatable to accommodate recurring reporting needs. AWS Glue provides built-in Apache Spark and Python environments for executing transformation jobs, and it handles data connectors and workflow orchestration capabilities. The strategy requires deploying connectors, crawlers, jobs, and a workflow to prepare the data for integration with Redshift tables.

Data extraction and transformation with AWS Glue

First, define data sources in AWS Glue by creating crawlers. These crawlers will scan the HAQM Aurora MySQL instance and the data stored in HAQM S3, updating the AWS Glue Data Catalog with schema and partition information.2. Create AWS Glue connectors for HAQM Aurora MySQL and HAQM S3 to allow AWS Glue crawlers to connect to the database instance for data extraction and write data to S3. Add the HAQM S3 bucket hosting the MySQL connector jar file and use: com.mysql.cj.jdbc.Driver for ‘Driver class name’.

To create a connector for reading objects in HAQM S3 and updating the AWS Glue Data Catalog with transformed schema and new partitions, create an AWS Glue network connection.

Configure AWS Glue data crawlers

Define two crawlers to extract data from HAQM Aurora MySQL and catalog data written to HAQM S3. The rds_player_db_crawler uses the JDBC connection and identifies the playerDB database and player table as data source.

The s3_player_db_crawler uses the previously created network connection to support crawling of HAQM S3 objects and updating the AWS Glue Data Catalog with table and partition metadata.

Transform data with AWS Glue jobs

1. Next, create an AWS Glue job to transform the operational data extracted from HAQM Aurora MySQL. The job involves dropping redundant columns, formatting data, and writing the transformed data to HAQM S3 in compressed Apache Parquet format. Additionally, you’ll generate a timestamp parameter to facilitate partitioning for optimized query performance and cost efficiency.

2. Create a transform_player_table job using the Apache Spark runtime environment and the Aurora MySQL connection. Portions of transform_player_table were generated using HAQM Q data integration in AWS Glue.

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from datetime import datetime

# Convert job runtime to timestamp format, format for partition query
job_run_timestamp = datetime.now().strftime("year=%Y/month=%m/day=%d/hour=%H/minute=%M")
runtime_params_string = format(job_run_timestamp.replace("/", ","))

# Set s3 partition path
output_path = f"s3://EXAMPLE-BUCKET/player-db/player/{job_run_timestamp}/"

# Create glue session
glueContext = GlueContext(SparkContext.getOrCreate())

# Get runtime parameters
glue_client = boto3.client("glue")
args = getResolvedOptions(sys.argv, ["JOB_NAME", "WORKFLOW_NAME", "WORKFLOW_RUN_ID"])
workflow_name = args["WORKFLOW_NAME"]
workflow_run_id = args["WORKFLOW_RUN_ID"]
workflow_params = glue_client.get_workflow_run_properties(
    Name=workflow_name, RunId=workflow_run_id
)["RunProperties"]

# Set runtime paramters with formated timestamp
workflow_params["job_run_timestamp"] = runtime_params_string
glue_client.put_workflow_run_properties(
    Name=workflow_name, RunId=workflow_run_id, RunProperties=workflow_params
)

# Create a frame for player table
playerdb_player = glueContext.create_dynamic_frame.from_catalog(
    database="default", table_name="playerdb_player"
)

# Drop unnecessary columns
playerdb_player = playerdb_player.drop(columns=["id"])

# Write frame to s3 in compressed parquet with partitioned path
glueContext.write_dynamic_frame.from_options(
    frame=players,
    connection_type="s3",
    connection_options={"path": output_path, "region": "us-west-2"},
    format="parquet",
    format_options={"compression": "SNAPPY"},
)

The job writes the operational data to HAQM S3 with timestamped prefixes in compressed Parquet format. The HAQM S3 crawler run updates the AWS Glue data catalog with new partition metadata. This partitioning scheme supports reporting requirements and optimizes queries for cost and performance. When querying the external Redshift Spectrum table, constraining by partition will reduce the size of data scanned and costs associated with Redshift Spectrum usage.

AWS Glue workflows for data pipeline orchestration

With the data extraction and transformation processes defined, orchestrate the workflow using AWS Glue workflows (orchestration). By creating a workflow, you automate the execution of AWS Glue crawlers and job, ensuring a seamless and repeatable process for preparing data for analysis.

Redshift Spectrum setup

Before querying the data, set up Redshift Spectrum to access data stored in HAQM S3. This involves creating an external schema in HAQM Redshift that mirrors the schema of the transformed data stored in HAQM S3.

1. The external schema command requires a schema name, a data catalog name, region, and the ‘MySpectrumRole’ created in the prerequisite steps.

CREATE EXTERNAL SCHEMA IF NOT EXISTS player_stats_s
FROM DATA CATALOG DATABASE 'default'
IAM_ROLE 'arn:aws:iam::111111111111:role/MySpectrumRole'
REGION 'us-west-2'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

For the last Redshift Spectrum step, GRANT USAGE on the external schema to required users.

GRANT USAGE on SCHEMA player_stats_s to PUBLIC

Query data with Redshift Spectrum

Once the integration is complete, query the transformed dataset using HAQM Redshift’s SQL capabilities. By leveraging Redshift Spectrum, we can query data stored in HAQM S3 alongside data in an HAQM Redshift cluster, supporting powerful analytics and reporting capabilities.

Select all available partitions:

SELECT schemaname, tablename, VALUES, location FROM svv_external_partitions;

Select all rows from a specific partition:

SELECT * FROM player_stats_s.playerdb_player
WHERE (year = '2024' and month = '3' and day = '14' and hour = '14' and minute = '0');

Select all rows joining the HAQM Redshift provisioned cluster table and the Redshift Spectrum external table for a specific partition:

SELECT COUNT(*)
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0');

With data available from both data sources, we can join HAQM Redshift provisioned cluster tables and Redshift Spectrum external tables. Let’s query to find players and communities with the highest number of seconds played:

SELECT DISTINCT playerdb_player.player_id, player_stats_s.playerdb_player.community_id, playerdb_player.total_seconds_played
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0')
ORDER BY playerdb_player.total_seconds_played DESC;

Lastly, communities with the highest payment amounts:

SELECT player_stats_s.playerdb_player.community_id, playerdb_player.player_id, SUM(player_stats.total_payment_amount) AS total_payment_amount
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0')
GROUP BY player_stats_s.playerdb_player.community_id
ORDER BY total_payment_amount DESC;

Cleaning up

You’ve now successfully created an AWS Glue workflow to join operational and analytics data. To avoid ongoing charges for resources you created following the steps detailed in this article, you should delete:

HAQM Aurora MySQL database and table use as an operational data source.
The HAQM Simple Storage Service (s3) buckets use for the MySQL JDBC driver and AWS Glue job output location.
The HAQM VPC interface and gateways endpoints.
The AWS Glue crawlers, job and workflow.
The HAQM Redshift provisioned cluster table and Redshift Spectrum external table.

Conclusion

AWS Glue and Redshift Spectrum provide game developers and analysts with a robust platform for combining, transforming, and analyzing data from disparate sources. By automating the extract, transform, and load (ETL) processes with AWS Glue, organizations can optimize costs and operational efficiency. Leveraging the querying capabilities of HAQM Redshift Spectrum, they can also derive actionable insights from their data.

In the fast-paced world of game development, where data-driven decisions are paramount, the integration of AWS Glue and Redshift Spectrum offers a scalable and cost-effective solution. This integration unlocks the full potential of gaming analytics by providing a powerful combination of data processing and querying capabilities. By harnessing the power of these AWS services, game developers can gain deeper insights into player behavior, drive engagement, and ultimately, deliver exceptional gaming experiences.

AWS for Games Blog