Staff Software Practice Lead
The Table API for Apache Flink provides an easy-to-learn DSL (Domain-Specific Language) for querying data streams. It allows developers to write Flink queries in common languages like Java and Python but leverages an easy-to-understand syntax that resembles SQL.
In Confluent Cloud, the underlying implementation of the Table API differs from what you might expect if you were used to the open-source version. It takes the SQL-like syntax of the Table API and translates it into actual SQL statements. These statements are then executed in Confluent Cloud.
This translation and execution abstracts away some of the complexities of working with languages like Java and Python. Statements will be more resilient to breaking changes that result from library or SDK upgrades. Meanwhile, security breaches in dependencies become a thing of the past since Confluent Cloud never sees that code.
Learn more about how the Flink Table API works with Confluent Cloud in this video.
I've spent the last several weeks looking at Confluent Cloud's support for the Apache Flink Table API which provides interesting new functionality for developers. For one, we can now write Flink code in languages like Python or Java. That said, it doesn't work quite the way I expected. Let me show you what I mean. Flink is a stream processing engine that supports two modes of operation. It can handle batch processing, or stateful stream processing using the same API. That's pretty great, but, honestly, I'm done with batch processing, and you probably should be too. Let's focus on stream processing. Stream processing takes an unbounded sequence of events and runs them through a pipeline, potentially emitting new events along the way. Each step can store state that can be retrieved by future events. In the event the stream fails, The state can also be used to recover the stream. This allows Flink to be resilient to failures. Originally, this was all done using the DataStream API for Flink. It allows you to write stateful functions that operate on individual events, and connect them into rich pipelines. However, while the DataStream API is powerful, it isn't very approachable. But, what if we flipped things around and treated a stream of events like a database table? Could we allow developers to use the same tools to access a data stream as they use to read a database? This is the origin of the SQL API for Flink. It uses standard SQL to write queries against data streams by treating them like tables. This helps simplify some of the complexities of stream processing. Although SQL is familiar, it's a declarative language and lacks features found in imperative languages such as Java or Python. Developers often reach for these languages because they want features like: Auto completion Automated testing Complex coding structures like loops, functions, and recursion Or a robust type system For example, if you wanted to create a query that touched 500 different columns, in SQL you'd have to write that query manually. But what if you could leverage tools like loops, functions, and variables to generate the query for you. So for those developers who are looking for a rich development experience, the Table API was created. It provides a domain-specific language for processing data streams as tables. It resembles SQL and provides access to familiar constructs like select, join, and window, but it's implemented in languages like Java and Python. This means it can leverage common libraries, functions, and programming constructs from those languages. However, it's important to remember that these are not, in fact, database tables. They are potentially infinite streams of data. That has consequences that we'll dive into in a moment. But, before we get to that, I made a promise at the beginning of the video. I said I would explain how the Flink Table API didn't work the way I expected. What did I mean? Code appears. it produces a set of low-level Flink instructions. These are run through an optimizer and planner to determine how to execute them in the Flink cluster. This process is largely invisible. You call the flink run command, point it at your code or compiled artifact, and Flink does the rest. The impression is that the Flink cluster is executing your code. In reality, it's more complicated than that. Portions of your code might be executed locally by the Flink command line interface. It will create the low-level instructions and upload them to the cluster where they will be executed by Flink. So, Flink isn't necessarily executing your code. Instead, it uses your code to build a set of instructions and executes those. This is the behavior I was expecting. However, in Confluent Cloud, it works a little bit differently. Remember when I said that the Table API resembles SQL? It turns out that rather than using the Table API to build low-level instructions, we can instead use it to construct an actual SQL statement. This is what the Confluent Cloud Plugin does. When you execute a Table API statement, it is converted to the equivalent SQL. The SQL statement is sent to Confluent Cloud to be executed. Essentially, in Confluent Cloud, those low-level instructions have been replaced by actual SQL code. It's important to recognize that Confluent Cloud isn't executing your original code. You are responsible for running your application in an appropriate environment. And, since Confluent Cloud only sees the SQL code, anything you do in your application must be compatible with an equivalent SQL statement. Otherwise, Confluent Cloud won't be able to run it. DataStream API appears connected to the code. Line from the DataStream API connects to a red X next to the SQL code. However, this has some significant benefits. Languages like Java and Python are prone to version upgrades that cause breaking changes. I'm sure you have encountered scenarios where an upgrade to a Java or Python library forces you to rewrite swaths of code. Or you attempt to deploy your code and discover it was built with a different version than the host system and won't run as a result. Not to mention, security flaws discovered in common libraries can force you to upgrade your versions, even though you might not be ready. If Confluent Cloud worked directly with your Java or Python application, you'd be subject to these limitations. However, because it only sees the resulting SQL, you can generate that SQL with whatever library or SDK versions you want. As long as the SQL is compatible, how it's generated doesn't matter. And security flaws, if they existed, could be patched in the underlying SQL engine, rather than forcing you to upgrade your code. This means your Flink jobs will likely remain compatible for much longer than they would if you had to worry about keeping up with the latest versions. Some of this will change with user-defined functions, but we'll leave those out of the equation for now. Just remember, although we use the Table API and SQL to process data streams, it's not the same as working with a database table. Database tables are finite. The entire table is available and can be indexed to allow rapid searches. Meanwhile, many queries tend to be read-only. Data streams, on the other hand, are often infinite, and only process one event at a time. We almost always write the results to another data stream which results in long-running operations. This has consequences. For example, if you perform a search on a database table, it could use an index to find the result, and if it isn't found, it could return null. Performing the same search on a data stream could mean reading the entire stream from the beginning. Furthermore, if the element isn't found, that doesn't mean it won't show up eventually. So rather than returning a null, the query may not return at all, as it waits for future elements to arrive. So, although we can use similar tools to query both database tables and data streams, it is important to understand that there are some fundamental differences between them. If we recognize and acknowledge those differences, the tools can be quite powerful. But if we ignore them, we might find ourselves chewing through system resources and making an awful mess. The Table API is a great starting point if you're a developer who wants to get into Flink. It provides both power and flexibility and is still relatively easy to learn. And I hope this video was helpful. The team and I are already working on more videos that go into greater depth. Feel free to drop us a comment below if you have specific requests. Thanks for watching.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.