course: Building Apache Flink® Applications in Java

Datastream Programming

4 min

Wade Waldron

Principal Software Practice Lead

Datastream Programming

Overview

The software landscape has changed. Users are no longer satisfied with waiting for data. They want instant results. Meanwhile, the amount of data being collected has grown to staggering proportions. The result is that many of the batch-driven processes we used to rely on are no longer sufficient. Businesses have been forced to transition to event-driven systems and datastream programming. In this new environment, the tool of choice is increasingly becoming Apache Flink. This video will discuss the nature of the changing software landscape and introduce Apache Flink as a way to bridge the gaps.

Topics:

Increases in data collection
Changes in user expectations
Datastream programming
Challenges with streaming data
Apache Flink

Resources

Do you have questions or comments? Join us in the #confluent-developer community Slack channel to engage in discussions with the creators of this content.

Use the promo codes FLINKJAVA101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Get Started

Datastream Programming

Hi, I'm Wade from Confluent. In this video, we're going to discuss some of the evolutions that have occurred in the software industry and how they lead us to datastream programming tools such as Flink. A Mountain of Data If we were to peel back the surface of many modern computer systems, we'd find a mountain of data underneath. This can be database tables, but it can also include logs, metrics, and events. However, it wasn't always that way. Go back a decade or even less, and the amount of data is orders of magnitude smaller. Batch Jobs In those days, data was often processed by periodic batch jobs. Initially, those jobs might have run once a day or even once a week, but as the amount of data increased, the time intervals needed to shrink in order to avoid costly failures. Imagine if processing a full day of data took 13 hours. That means that if you fail on the 12th hour of processing, there's not enough time left in the day to repeat all of the work. By shortening the cycle and reducing the batch size, we can reduce the impact of the failures. Rather than having to reprocess a full day, maybe we only need to redo an hour. That leaves us plenty of breathing room to process the rest. But those cycle times, User Expectations even if they're shortened to hours or even minutes, are still too long. Modern users have been conditioned to expect instant results. If they log into Facebook or TikTok and comment on a video, they don't want to wait an hour for a batch process to pick up their comment. They want to see the results immediately. Micro-batching Of course, we could shrink the batch sizes even further. Rather than processing batches every minute or every hour, we could process them every second or even less. This is sometimes known as micro-batching. But it raises an important question: if we keep shrinking our batches to smaller and smaller sizes, eventually we're going to end up with batches that contain only a single record. If single-record batches are the natural endpoint of micro-batching, why wouldn't we skip batching completely and process our records one at a time? This is one of the goals of datastream programming. Datastream Programming Rather than dealing with batches of data, datastream programming prefers to process records as they arrive. These records are transmitted in the form of events. They're pushed through a series of operators connected to a directed graph. Each operator performs a transformation to the data and emits the result as a new event. From an outside perspective, the operators can be treated as black boxes with a set of inputs and outputs. One of the advantages to datastream programming is that it tends to be easier to parallelize and distribute. This makes it an excellent candidate for building cloud-native applications. However, it doesn't come without some complexity. Figuring out how to manage and distribute all of the operations can take a lot of expertise, and as the number of operations increases, it only gets more difficult. Managing State We also run into challenges managing the state across operations. In a perfect world, each event could be treated as an independent unit. It would not rely on any other events, but that's often not the case. Instead, events are interconnected, and sorting out those connections requires us to keep track of state. We could keep that state in a database, but at scale, this tends to be slow. We often try to speed it up within memory caching, but in memory, caches are difficult to distribute effectively. This is where Flink steps in to help us. Apache Flink Flink is a distributed datastream programming engine. It includes facilities for managing and distributing your datastream operations. It provides built-in support for stateful operations that can be distributed using clearly defined rules. As a result, it has rapidly become one of the key tools for building large-scale streaming platforms. Throughout this course, we will be diving deeper into various aspects of Flink and we'll see how to implement a simple Flink datastream pipeline using Java. If you aren't already on Confluent Developer, head there now using the link in the video description to access the rest of this course and its hands-on exercises.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

Modules: Start from lesson 1
Total 21