Get Started Free
Wade Waldron

Wade Waldron

Staff Software Practice Lead

Datastream Programming

Overview

The software landscape has changed. Users are no longer satisfied with waiting for data. They want instant results. Meanwhile, the amount of data being collected has grown to staggering proportions. The result is that many of the batch-driven processes we used to rely on are no longer sufficient. Businesses have been forced to transition to event-driven systems and datastream programming. In this new environment, the tool of choice is increasingly becoming Apache Flink. This video will discuss the nature of the changing software landscape and introduce Apache Flink as a way to bridge the gaps.

Topics:

  • Increases in data collection
  • Changes in user expectations
  • Datastream programming
  • Challenges with streaming data
  • Apache Flink

Resources

Use the promo code FLINKJAVA101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Datastream Programming

Hi, I'm Wade from Confluent. In this video, we're going to discuss some of the evolutions that have occurred in the software industry and how they lead us to datastream programming tools such as Flink. A Mountain of Data If we were to peel back the surface of many modern computer systems, we'd find a mountain of data underneath. This can be database tables, but it can also include logs, metrics, and events. However, it wasn't always that way. Go back a decade or even less, and the amount of data is orders of magnitude smaller. Batch Jobs In those days, data was often processed by periodic batch jobs. Initially, those jobs might have run once a day or even once a week, but as the amount of data increased, the time intervals needed to shrink in order to avoid costly failures. Imagine if processing a full day of data took 13 hours. That means that if you fail on the 12th hour of processing, there's not enough time left in the day to repeat all of the work. By shortening the cycle and reducing the batch size, we can reduce the impact of the failures. Rather than having to reprocess a full day, maybe we only need to redo an hour. That leaves us plenty of breathing room to process the rest. But those cycle times, User Expectations even if they're shortened to hours or even minutes, are still too long. Modern users have been conditioned to expect instant results. If they log into Facebook or TikTok and comment on a video, they don't want to wait an hour for a batch process to pick up their comment. They want to see the results immediately. Micro-batching Of course, we could shrink the batch sizes even further. Rather than processing batches every minute or every hour, we could process them every second or even less. This is sometimes known as micro-batching. But it raises an important question: if we keep shrinking our batches to smaller and smaller sizes, eventually we're going to end up with batches that contain only a single record. If single-record batches are the natural endpoint of micro-batching, why wouldn't we skip batching completely and process our records one at a time? This is one of the goals of datastream programming. Datastream Programming Rather than dealing with batches of data, datastream programming prefers to process records as they arrive. These records are transmitted in the form of events. They're pushed through a series of operators connected to a directed graph. Each operator performs a transformation to the data and emits the result as a new event. From an outside perspective, the operators can be treated as black boxes with a set of inputs and outputs. One of the advantages to datastream programming is that it tends to be easier to parallelize and distribute. This makes it an excellent candidate for building cloud-native applications. However, it doesn't come without some complexity. Figuring out how to manage and distribute all of the operations can take a lot of expertise, and as the number of operations increases, it only gets more difficult. Managing State We also run into challenges managing the state across operations. In a perfect world, each event could be treated as an independent unit. It would not rely on any other events, but that's often not the case. Instead, events are interconnected, and sorting out those connections requires us to keep track of state. We could keep that state in a database, but at scale, this tends to be slow. We often try to speed it up within memory caching, but in memory, caches are difficult to distribute effectively. This is where Flink steps in to help us. Apache Flink Flink is a distributed datastream programming engine. It includes facilities for managing and distributing your datastream operations. It provides built-in support for stateful operations that can be distributed using clearly defined rules. As a result, it has rapidly become one of the key tools for building large-scale streaming platforms. Throughout this course, we will be diving deeper into various aspects of Flink and we'll see how to implement a simple Flink datastream pipeline using Java. If you aren't already on Confluent Developer, head there now using the link in the video description to access the rest of this course and its hands-on exercises.