HAQM Managed Service for Apache Flink FAQs
General
Open allWhat is HAQM Managed Service for Apache Flink?
With HAQM Managed Service for Apache Flink, you can transform and analyze streaming data in real time with Apache Flink. Apache Flink is an open source framework and engine for processing data streams. HAQM Managed Service for Apache Flink reduces the complexity of building, managing, and integrating Apache Flink applications with other AWS services.
HAQM Managed Service for Apache Flink takes care of everything required to continuously run streaming applications and scales automatically to match the volume and throughput of your incoming data. With HAQM Managed Service for Apache Flink, there are no servers to manage, there is no minimum fee or setup cost, and you only pay for the resources your streaming applications consume.
What is real-time stream processing and why do I need it?
What can I do with HAQM Managed Service for Apache Flink?
You can use HAQM Managed Service for Apache Flink for many use cases to process data continuously, getting insights in seconds or minutes rather than waiting days or even weeks. HAQM Managed Service for Apache Flink enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The four most common use cases are streaming extract-transform-load (ETL), continuous metric generation, responsive real-time analytics, and interactive querying of data streams.
Streaming ETL
With streaming ETL applications, you can clean, enrich, organize, and transform raw data prior to loading your data lake or data warehouse in real time, reducing or eliminating batch ETL steps. These applications can buffer small records into larger files prior to delivery and perform sophisticated joins across streams and tables. For example, you can build an application that continuously reads IoT sensor data stored in HAQM Managed Streaming for Apache Kafka (HAQM MSK), organize the data by sensor type, remove duplicate data, normalizes data per a specified schema, and then deliver the data to HAQM Simple Storage Service (HAQM S3).
Continuous metric generation
With continuous metric generation applications, you can monitor and understand how your data is trending over time. Your applications can aggregate streaming data into critical information and seamlessly integrate it with reporting databases and monitoring services to serve your applications and users in real time. With HAQM Managed Service for Apache Flink, you can use Apache Flink code (in Java, Scala, Python, or SQL) to continuously generate time series analytics over time windows. For example, you can build a live leaderboard for a mobile game by computing the top players every minute and then sending it to HAQM DynamoDB. You can also track the traffic to your website by calculating the number of unique website visitors every 5 minutes and then sending the processed results to HAQM Redshift.
Responsive real-time analytics
Responsive real-time analytics applications send real-time alarms or notifications when certain metrics reach predefined thresholds or, in more advanced cases, when your application detects anomalies using machine learning (ML) algorithms. With these applications, you can respond immediately to changes in your business in real time such as predicting user abandonment in mobile apps and identifying degraded systems. For example, an application can compute the availability or success rate of a customer-facing API over time and then send results to HAQM CloudWatch. You can build another application to look for events that meet certain criteria, and then automatically notify the right customers using HAQM Kinesis Data Streams and HAQM Simple Notification Service (HAQM SNS).
Interactive analysis of data streams
Interactive analysis helps you to stream data exploration in real time. With ad hoc queries or programs, you can inspect streams from HAQM MSK or HAQM Kinesis Data Streams and visualize what data looks like within those streams. For example, you can view how a real-time metric that computes the average over a time window behaves and send the aggregated data to a destination of your choice. Interactive analysis also helps with iterative development of stream processing applications. The queries you build continuously update as new data arrives. With HAQM Managed Service for Apache Flink Studio, you can deploy these queries to run continuously with auto scaling and durable state backups enabled.
Getting started
Open allHow do I get started with Apache Flink applications for HAQM Managed Service for Apache Flink?
How do I get started with Apache Beam applications for HAQM Managed Service for Apache Flink?
How do I get started with HAQM Managed Service for Apache Flink Studio?
What are the limits of HAQM Managed Service for Apache Flink?
Does HAQM Managed Service for Apache Flink support schema registration?
Yes, by using Apache Flink DataStream Connectors, HAQM Managed Service for Apache Flink applications can use AWS Glue Schema Registry, a serverless feature of AWS Glue. You can integrate Apache Kafka, HAQM MSK, and HAQM Kinesis Data Streams, as a sink or a source, with your HAQM Managed Service for Apache Flink workloads. Visit the AWS Glue Schema Registry Developer Guide to get started and learn more.
Key concepts
Open allWhat is an HAQM Managed Service for Apache Flink application?
An application is the HAQM Managed Service for Apache Flink entity that you work with. HAQM Managed Service for Apache Flink applications continuously read and process streaming data in real time. You write application code in an Apache Flink–supported language to process the incoming streaming data and produce output. Then, HAQM Managed Service for Apache Flink writes the output to a configured destination.
Each application consists of three primary components:
- Input: Input is the streaming source for your application. In the input configuration, you map the streaming sources to data streams. Data flows from your data sources into your data streams. You process data from these data streams using your application code, sending processed data to subsequent data streams or destinations. You add inputs inside application code for Apache Flink applications and Studio notebooks and through the API for HAQM Managed Service for Apache Flink applications.
- Application code: Application code is a series of Apache Flink operators that process input and produce output. In its simplest form, application code can be a single Apache Flink operator that reads from adata stream associated with a streaming source and writes to another data stream associated with an output. For a Studio notebook, this could be a simple Flink SQL select query, with the results shown in context within the notebook. You can write Apache Flink code in its supported languages for HAQM Managed Service for Apache Flink applications or Studio notebooks.
- Output: You can then optionally configure an application output to persist data to an external destination. You add these outputs inside application code for HAQM Managed Service for Apache Flink applications and Studio notebooks.
What application code is supported?
Managing applications
Open allHow can I monitor the operations and performance of my HAQM Managed Service for Apache Flink applications?
AWS provides various tools that you can use to monitor your HAQM Managed Service for Apache Flink applications including access to the Flink Dashboard for Apache Flink applications. You can configure some of these tools to do the monitoring for you. For more information about how to monitor your application, explore the following developer guides:
- Monitoring HAQM Managed Service for Apache Flink in the HAQM Managed Service for Apache Flink Developer Guide.
- Monitoring HAQM Managed Service for Apache Flink in the HAQM Managed Service for Apache Flink Studio Developer Guide.
How do I manage and control access to my HAQM Managed Service for Apache Flink applications?
HAQM Managed Service for Apache Flink needs permissions to read records from the streaming data sources you specify in your application. HAQM Managed Service for Apache Flink also needs permissions to write your application output to specified destinations in your application output configuration. You can grant these permissions by creating AWS Identity and Access Management (IAM) roles that HAQM Managed Service for Apache Flink can assume. The permissions you grant to this role determine what HAQM Managed Service for Apache Flink can do when the service assumes the role. For more information, see the following developer guides:
- Granting permissions in the HAQM Managed Service for Apache Flink Developer Guide.
- Granting permissions in the HAQM Managed Service for Apache Flink Studio Developer Guide.
How does HAQM Managed Service for Apache Flink scale my application?
HAQM Managed Service for Apache Flink elastically scales your application to accommodate the data throughput of your source stream and your query complexity for most scenarios. HAQM Managed Service for Apache Flink provisions capacity in the form of HAQM KPUs. One KPU provides you with 1 vCPU and 4 GB memory.
For Apache Flink applications and Studio notebooks, HAQM Managed Service for Apache Flink assigns 50 GB of running application storage per KPU that your application uses for checkpoints and is available for you to use through temporary disk. A checkpoint is an up-to-date backup of a running application used to recover immediately from an application disruption. You can also control the parallel execution for your HAQM Managed Service for Apache Flink application tasks (such as reading from a source or executing an operator) using the Parallelism and ParallelismPerKPU parameters in the API. Parallelism defines the number of concurrent instances of a task. All operators, sources, and sinks run with a defined parallelism by default one. Parallelism per KPU defines the amount of the number of parallel tasks that can be scheduled per KPU of your application by default one. For more information, see Scaling in the HAQM Managed Service for Apache Flink Developer Guide.
What are the best practices associated for building and managing my HAQM Managed Service for Apache Flink applications?
For information about best practices for Apache Flink, see the Best Practices section of the HAQM Managed Service for Apache Flink Developer Guide. The section covers best practices for fault tolerance, performance, logging, coding, and more.
For information about best practices for HAQM Managed Service for Apache Flink Studio, see the Best Practices section of the HAQM Managed Service for Apache Flink Studio Developer Guide. In addition to best practices, this section covers samples for SQL, Python, and Scala applications, requirements for deploying your code as a continuously running stream processing application, performance, logging, and more.
Can I access resources behind an HAQM VPC with an HAQM Managed Service for Apache Flink application?
Yes. You can access resources behind an HAQM VPC. You can learn how to configure your application for VPC access in the Using an HAQM VPC section of the HAQM Managed Service for Apache Flink Developer Guide.
Can a single HAQM Managed Service for Apache Flink application have access to multiple VPCs?
Can an HAQM Managed Service for Apache Flink application that’s connected to a VPC access the internet and AWS service endpoints?
HAQM Managed Service for Apache Flink applications and HAQM Managed Service for Apache Flink Studio notebooks that are configured to access resources in a particular VPC do not have access to the internet as a default configuration. You can learn how to configure access to the internet for your application in the Internet and Service Access section of the HAQM Managed Service for Apache Flink Developer Guide.
Pricing and billing
Open allHow much does HAQM Managed Service for Apache Flink cost?
With HAQM Managed Service for Apache Flink, you pay only for what you use. There are no resources to provision or upfront costs associated with HAQM Managed Service for Apache Flink.
You are charged an hourly rate based on the number of HAQM KPUs used to run your streaming application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory. HAQM Managed Service for Apache Flink automatically scales the number of KPUs required by your stream processing application as the demands of memory and compute vary in response to processing complexity and the throughput of streaming data processed.
For Apache Flink and Apache Beam applications, you are charged a single additional KPU per application for application orchestration. Apache Flink and Apache Beam applications are also charged for running application storage and durable application backups. Running application storage is used for stateful processing capabilities in HAQM Managed Service for Apache Flink and charged per GB-month. Durable application backups are optional, charged per GB-month, and provide a point-in-time recovery point for applications.
For HAQM Managed Service for Apache Flink Studio, in development or interactive mode, you are charged an additional KPU for application orchestration and 1 KPU for interactive development. You are also charged for running application storage. You are not charged for durable application backups.
For more pricing information, see the HAQM Managed Service for Apache Flink pricing page.
Am I charged for an HAQM Managed Service for Apache Flink application that is running but not processing any data from the source?
For Apache Flink and Apache Beam applications, you are charged a minimum of 2 KPUs and 50 GB running application storage if your HAQM Managed Service for Apache Flink application is running.
For HAQM Managed Service for Apache Flink Studio notebooks, you are charged a minimum of 3 KPUs and 50 GB running application storage if your application is running.
Other than HAQM Managed Service for Apache Flink costs, are there any other costs that I might incur?
Is HAQM Managed Service for Apache Flink available in the AWS Free Tier?
Building Apache Flink applications
Open allWhat is Apache Flink?
Apache Flink is an open source framework and engine for stream and batch data processing. It makes streaming applications easy to build because it provides powerful operators and solves core streaming problems such as duplicate processing. Apache Flink provides data distribution, communication, and fault tolerance for distributed computations over data streams.
How do I develop applications?
You can start by downloading the open source libraries including the AWS SDK, Apache Flink, and connectors for AWS services. Get instructions on how to download the libraries and create your first application in the HAQM Managed Service for Apache Flink Developer Guide.
What does my application code look like?
You write your Apache Flink code using data streams and stream operators. Application data streams are the data structure you perform processing against using your code. Data continuously flows from the sources into application data streams. One or more stream operators are used to define your processing on the application data streams, including transform, partition, aggregate, join, and window. Data streams and operators can be connected in serial and parallel chains. A short example using pseudo code is shown below.
DataStream <GameEvent> rawEvents = env.addSource(
New KinesisStreamSource(“input_events”));
DataStream <UserPerLevel> gameStream =
rawEvents.map(event - > new UserPerLevel(event.gameMetadata.gameId,
event.gameMetadata.levelId,event.userId));
gameStream.keyBy(event -> event.gameId)
.keyBy(1)
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
.apply(...) - > {...};
gameStream.addSink(new KinesisStreamSink("myGameStateStream"));
How do I use the Apache Flink operators?
Operators take an application data stream as input and send processed data to an application data stream as output. Operators can be connected to build applications with multiple steps and don’t require advanced knowledge of distributed systems to implement and operate.
What operators are supported?
HAQM Managed Service for Apache Flink supports all operators from Apache Flink that can be used to solve a wide variety of use cases including map, KeyBy, aggregations, windows, joins, and more. For example, the map operator allows you to perform arbitrary processing, taking one element from an incoming data stream and producing another element. KeyBy logically organizes data using a specified key so that you can process similar data points together. Aggregations performs processing across multiple keys such as sum, min, and max. Window Join joins two data streams together on a given key and window.
You can build custom operators if these do not meet your needs. Find more examples in the Operators section of the HAQM Managed Service for Apache Flink Developer Guide. You can find a full list of Apache Flink operators in the Apache Flink documentation.
What integrations are supported in an HAQM Managed Service for Apache Flink application?
You can set up prebuilt integrations provided by Apache Flink with minimal code or build your own integration to connect to virtually any data source. The open source libraries based on Apache Flink support streaming sources and destinations, or sinks, to process data delivery. This also includes data enrichment support through asynchronous I/O connectors. Some of these connectors include the following:
- Streaming data sources: HAQM Managed Streaming for Apache Kafka (HAQM MSK), HAQM Kinesis Data Streams Destinations, or sinks: HAQM Kinesis Data Streams
- HAQM Kinesis Data Firehose, HAQM DynamoDB, HAQM Elasticsearch Service, and HAQM S3 (through file sink integrations)
Can HAQM Managed Service for Apache Flink applications replicate data across streams and topics?
Yes. You can use HAQM Managed Service for Apache Flink applications to replicate data between HAQM Kinesis Data Streams, HAQM MSK, and other systems. An example provided in our documentation demonstrates how to read from one HAQM MSK topic and write to another.
Are custom integrations supported?
You can add a source or destination to your application by building upon a set of primitives enabling you to read and write from files, directories, sockets, or anything that you can access over the internet. Apache Flink provides these primitives for data sources and data sinks. The primitives come with configurations like the ability to read and write data continuously or once, asynchronously or synchronously, and much more. For example, you can setup an application to read continuously from HAQM S3 by extending the existing file-based source integration.
What delivery and processing model do HAQM Managed Service for Apache Flink applications provide?
Apache Flink applications in HAQM Managed Service for Apache Flink use an exactly-once delivery model if an application is built using idempotent operators, including sources and sinks. This means the processed data impacts downstream results once and only once.
By default, HAQM Managed Service for Apache Flink applications use the Apache Flink exactly-once semantics. Your application supports exactly-once processing semantics if you design your applications using sources, operators, and sinks that use Apache Flink’s exactly-once semantics.
Do I have access to local storage from my application storage?
How does HAQM Managed Service for Apache Flink automatically back up my application?
HAQM Managed Service for Apache Flink automatically backs up your running application’s state using checkpoints and snapshots. Checkpoints save the current application state and enable HAQM Managed Service for Apache Flink applications to recover the application position to provide the same semantics as a failure-free execution. Checkpoints use running application storage. Checkpoints for Apache Flink applications are provided through Apache Flink’s checkpointing functionality. Snapshots save a point-in-time recovery point for applications and use durable application backups. Snapshots are analogous to Flink savepoints.
What are application snapshots?
What versions of Apache Flink are supported?
To learn more about supported Apache Flink versions, visit the HAQM Managed Service for Apache Flink Release Notes page. This page also includes the versions of Apache Beam, Java, Scala, Python, and AWS SDKs that HAQM Managed Service for Apache Flink supports.
Can HAQM Managed Service for Apache Flink applications run Apache Beam?
Yes, HAQM Managed Service for Apache Flink supports streaming applications built using Apache Beam. You can build Apache Beam streaming applications in Java and run them in different engines and services including using Apache Flink on HAQM Managed Service for Apache Flink. You can find information regarding supported Apache Flink and Apache Beam versions in the HAQM Managed Service for Apache Flink Developer Guide.
Building HAQM Managed Service for Apache Flink Studio applications in a managed notebook
Open allHow do I develop a Studio application?
You can start from the HAQM Managed Service for Apache Flink Studio, HAQM Kinesis Data Streams, or HAQM MSK consoles in a few steps to launch a serverless notebook to immediately query data streams and perform interactive data analytics.
Interactive data analytics: You can write code in the notebook in SQL, Python, or Scala to interact with your streaming data, with query response times in seconds. You can use built-in visualizations to explore the data, view real-time insights on your streaming data from within your notebook, and develop stream processing applications powered by Apache Flink.
Once your code is ready to run as a production application, you can transition with a single step to a stream processing application that processes gigabytes of data per second, without servers.
Stream processing application: Once you are ready to promote your code to production, you can build your code by clicking “Deploy as stream processing application” in the notebook interface or issue a single command in the CLI. Studio takes care of all the infrastructure management necessary for you to run your stream processing application at scale, with auto scaling and durable state enabled, just as in an HAQM Managed Service for Apache Flink application.
What does my application code look like?
What SQL operations are supported?
You can perform SQL operations such as the following:
- Scan and filter (SELECT, WHERE)
- Aggregations (GROUP BY, GROUP BY WINDOW, HAVING)
- Set (UNION, UNIONALL, INTERSECT, IN, EXISTS)
- Order (ORDER BY, LIMIT)
- Joins (INNER, OUTER, Timed Window – BETWEEN, AND, Joining with Temporal Tables – tables that track changes over time)
- Top-N
- Deduplication
- Pattern recognition
Some of these queries, such as GROUP BY, OUTER JOIN, and Top-N, are results updating for streaming data, which means that the results are continuously updating as the streaming data is processed. Other DDL statements, such as CREATE, ALTER, and DROP, are also supported. For a complete list of queries and samples, see the Apache Flink Queries documentation.
How are Python and Scala supported?
Apache Flink’s Table API supports Python and Scala through language integration using Python strings and Scala expressions. The operations supported are very similar to the SQL operations supported, including select, order, group, join, filter, and windowing. A full list of operations and samples are included in our developer guide.
What versions of Apache Flink and Apache Zeppelin are supported?
To learn more about supported Apache Flink versions, visit the HAQM Managed Service for Apache Flink Release Notes page. This page also includes the versions of Apache Zeppelin, Apache Beam, Java, Scala, Python, and AWS SDKs that HAQM Managed Service for Apache Flink supports.
What integrations are supported by default in an HAQM Managed Service for Apache Flink Studio application?
- Data sources: HAQM Managed Streaming for Apache Kafka (HAQM MSK), HAQM Kinesis Data Streams, HAQM S3
- Destinations, or sinks: HAQM MSK, HAQM Kinesis Data Streams, and HAQM S3
Are custom integrations supported?
Service Level Agreement
Open allWhat does the HAQM Managed Service for Apache Flink SLA guarantee?
How do I know if I qualify for an SLA Service Credit?
You are eligible for an SLA Service Credit for HAQM Managed Service for Apache Flink under the HAQM Managed Service for Apache Flink SLA if more than one Availability Zone in which you are running a task, within the same AWS Region, has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle. For full details on all the SLA terms and conditions as well as details on how to submit a claim, visit the HAQM Managed Service for Apache Flink SLA details page.