Software Practice Lead
This video positions Flink SQL as a query processor that executes SQL queries in a way that is similar to how some relational databases perform incremental materialized view maintenance.
For more background on these concepts, see the documentation. To get started using Flink SQL, make your way to the next module.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
Flink SQL is a lot like a database, in that • it is based on the SQL standard; • it uses many of the same abstractions, such as tables and catalogs, and • it provides strong guarantees you can rely on to build applications that will behave correctly.
In this video I want to talk about what Flink SQL is, and why you are likely to find it useful.
I'm going to do this by comparing Flink SQL to something more familiar: namely, SQL databases.
And ultimately, the real question is probably this:
What can you do with Flink SQL, and how does that relate to what you are doing now with databases?
Stick around; I'll answer this question at the end of the video.
SQL is a programming language originally developed in the 1970s for managing and processing data in relational databases.
My perspective on SQL databases comes from more than a decade of experience building things like web apps, microservices, and BI tools using databases like Postgres, MySQL, and Oracle.
However, for the past 8 years I’ve been focused on Apache Flink.
Apache Flink is a framework for doing stream processing, which means it’s software that helps you harness the power of a cluster of computers to process data in real-time, as that data is being produced.
For example, this data might be events describing the orders and payments for goods related to your business, and you may want to process this data in real-time in order to provide your customers with better experiences.
I believe that Flink SQL is the best way to get started with Flink.
And even though Flink SQL is "just an API for building Flink applications", it has a lot in common with SQL databases, so it's going to be interesting to take a close look at this comparison.
Turning our attention to SQL databases, database applications have traditionally been categorized into one of two domains.
The operational domain is for handling real-time applications such as payment and order processing, and the analytical domain involves processing large volumes of data, typically for business reporting and forecasting.
Databases hold data that is stored in tables, using storage managed by the database.
Tables have names, like this "orders" table, and tables store rows of data organized into columns that have names and data types.
This orders table has columns for the order ID, the customer ID, the price, and the time the order was placed.
Databases respond to queries, like this one, and the result of this query is a number telling us how many rows of data are in the orders table right now.
To compute this result, the database will scan through the table, counting the rows.
This a brute force, batch-oriented approach that touches all of the relevant data and produces a result.
Here's an interesting variation on that query. Now instead of counting all of the orders, we are asking the database to partition the orders by customer, and count them on a per-customer basis.
Now we can see that those 3 orders are divided across two customers, as shown in the table of results.
The most important component in a database is the query processor, which receives queries, and produces results.
Query processing involves two key phases: first, a planning phase, where the query is parsed and compiled into an execution plan.
In order to create this execution plan, the query planner must have access to metadata about the tables, functions, and other database objects used by the query. This information lives in a catalog.
And in the second phase of query processing, the plan is executed.
This execution phase needs access to the storage used by the input and output tables used by the query. Handling this data storage is the responsibility of a storage engine.
Before connecting this background on SQL databases to what's happening inside of the Flink SQL runtime, we need to look at one more concept from traditional databases.
Building on the observation that the result of a SQL query is itself a table, most SQL databases will let you treat the result of a query as a logical table, called a view. I've found this to be especially convenient when building complex reports on a data warehouse, because you can decompose a complex query into reusable building blocks without having to physically store those intermediate results.
Some databases take this one step further, with a concept called materialized views. A materialized view is like a view, except that the result of the query is materialized, meaning it is stored as a physical table that serves as a cache. This is helpful in situations where a query is going to be used over and over.
Here's the same query we were just looking at, which counts the orders placed by each customer. Each time this query is executed, the database scans the orders table, counting orders.
What we can try instead is to define a materialized view based on this query. The first time this view is accessed, the table will be scanned -- but now the database will cache the result.
If this view is accessed again before the underlying table is updated, the database can immediately respond with the cached result.
And as the underlying orders table is modified -- for example, by inserting new orders -- then the materialized view can update the cache accordingly.
The simplest way for a database to refresh a materialized view is to re-execute the query in its entirety after each update.
This is expensive, but straightforward. And it's still worthwhile if the view is read more often than it is written.
This is the complete refresh strategy for materialized view maintenance.
More interesting, though, is the "incremental update" strategy, where the cached results are automatically kept in sync with the underlying tables, and this is done incrementally, thus requiring less total effort than a complete refresh.
So how does this work?
When the materialized view is first created, the view is bootstrapped with a complete refresh, just as before.
But then after each subsequent update to the underlying table, that change to the base table is combined with the previous version of the materialized view to produce an updated view.
In our example that is counting orders for each customer, each time an order is added to the orders table, the counter for that specific customer is incremented in the materialized view. If the orders table is large, this incremental update strategy will be dramatically more efficient than scanning the table each time.
Incremental materialized views are complex to implement, and few databases support them. And even when they are available, this feature can come with limitations or caveats. If this is something you have experience with, leave a comment below and tell us about it.
Switching gears now to talk about how Flink SQL works, Flink executes what we in the Flink community describe as continuous queries on dynamic tables, or continuous queries on streams.
This is very similar to incremental updates on materialized views.
When Flink SQL is used in streaming mode, it treats a query like this one as a description of a continuous process — in this case, counting orders for each customer from a never-ending stream of orders.
Flink compiles this query into a new Flink application, and then streams the order events through that application, in much the same way that a database would scan through an orders table to collect results.
As the first order is processed, the Flink runtime records internally the fact that one order has been processed for customer 3069, and emits an event describing how the end result has now been changed — namely, that an entry for this customer has been inserted with a value of 1.
And for the next order we do the same thing, this time for customer 1834.
With the third order, the effect is a bit different. This time the result is an update of the count for customer 1834 from 1 to 2, expressed as a retraction of the previous entry, along with an update.
What you're seeing here is that Flink SQL is designed to consume and produce streams, and those streams are, in general, changelog streams.
While Flink SQL is addressing the same problem as what databases face when they want to incrementally update materialized views, it's important to realize that Flink SQL doesn't actually maintain materialized views internally.
Flink SQL does maintains state, which it keeps in a distributed key/value store. The runtime is optimized to keep as little state as possible, while still producing correct results. For a grouped aggregation like this query, that's exactly the same state that a materialized view would keep around, but for some queries, Flink can get away with storing less.
I'll explore this topic in more detail in an upcoming video.
Now it's interesting to consider what would happen if Flink were to try to use the complete refresh strategy instead. This would mean that Flink would either have to re-ingest the entire input stream as it processes each event, which would just be ridiculous, or Flink could mirror into its own state store all of that data, would also be unworkable. This is why Flink SQL always does incremental updates.
So, is Flink SQL a database?
No, but it is a query processor.
The Flink SQL query planner compiles a query into a Flink job, and that job is then executed in a Flink cluster.
The details of how Flink executes SQL queries, and how this scales, are quite interesting, and as a developer you'll want to understand how this works. I'll get into this in an upcoming video, so be sure to subscribe and hit the bell icon so you don't miss it.
In Flink, the catalog and storage layers are pluggable.
In Confluent Cloud, the Schema Registry acts as the Flink catalog, and Kora supplies the storage layer. Kora is Confluent's cloud-native kafka storage engine.
A very important characteristic of a database is that we can rely on it to produce correct results.
Flink SQL can provide either exactly-once or at-least-once guarantees, depending on how Flink is configured, and which external systems are storing the input and output tables.
At-least-once is weaker than exactly-once, and it indicates that duplicates are possible.
Exactly-once means you can expect complete and accurate results: nothing will be missing, nothing will be duplicated. To be more precise, the guarantee Flink makes is that each event will affect the state Flink is managing exactly once, a guarantee that is more accurately captured by the phrase "effectively exactly once".
The state we're talking about is primarily the application state your business logic needs, such as those per-customer order counters we just saw.
For both at-least-once and exactly-once guarantees, Flink requires source streams that can be rewound and replayed, and for the output streams, exactly-once requires transactional writes.
In both cases, something like Apache Kafka, or a file system, will do the job.
Since Confluent Cloud for Apache Flink is based on Kafka, it can always do exactly-once, if that's what you want. Or if you'd rather reduce latency and can live with the limitations of an at-least-once guarantee, all it takes is to configure your kafka consumers to not use read-committed.
To learn more about how Flink implements exactly-once in the face of failures and restarts, see the video linked in the description below.
In the world of databases, the community has carefully defined what correctness means, and they talk about ACID guarantees.
Comparing exactly-once to ACID is tricky, and could easily be the subject of a much longer video.
But roughly speaking, ACID guarantees give you one thing that's missing from Flink, and that's transactions that operate over multiple keys -- so something like a money transfer from one bank account to another.
For single-key operations, the combination of Flink plus Kafka is just as robust as a database with ACID guarantees.
But a more straightforward comparison is with a distributed key/value store, since that's exactly what Flink is under the hood.
Now before I forget, I want to mention that I've prepared hands-on exercises you can explore to deepen your understanding of the concepts I'm presenting here. Those exercises are on Confluent Developer, and you'll find a link in the description below.
Now it's time return to the question I asked at the beginning, about what you can do with Flink SQL and how that compares to how SQL databases are used.
Stream processing is what makes Flink special, but Flink is also a capable batch processor.
When used in batch mode, Flink SQL operates in a query/response fashion, just like a database.
The main difference is that with Flink, the underlying data lives outside of Flink in places like Kafka clusters and data lakes, rather than in storage directly managed by the database.
In streaming mode, Flink SQL executes continuous queries that consume and produce streams -- and in the general case, these streams can be changelog streams.
So with Flink SQL, the queries have streams coming in, and streams going out.
The Flink documentation refers to these streams as "dynamic tables", which helps to make the connection with databases, where the queries have tables as their inputs and outputs.
Flink SQL provides effectively exactly-once guarantees, which is roughly equivalent to ACID guarantees on single-key transactions.
Flink can maintain this exactly-once guarantee at almost any scale you might need, up to clusters with hundreds or thousands of compute nodes.
On the other hand, SQL databases can struggle to provide strict consistency and isolation guarantees as they scale up.
As with SQL databases, most use cases for Flink SQL fall into one of two categories:
first -- reactive, event-driven applications, such as fraud or anomaly detection, and second, building reusable data products. This involves cleaning, enriching, and schematizing data before storing it in a data lake or feature store for use by other applications.
I'm working on more videos about Flink SQL. Please let me know in the comments below what you'd like to see covered.