Get Started Free
‹ Back to courses
course: Apache Flink® 101

Using Kafka with Flink

5 min
David Anderson

David Anderson

Software Practice Lead

Using Kafka with Flink

Overview

Flink has first-class support for developing applications that use Kafka. This video includes a quick introduction to Kafka, and shows how Kafka can be used with Flink SQL.

Topics:

  • Apache Kafka
  • Kafka Connect, Kafka Streams, ksqlDB, Schema Registry
  • Producers and Consumers
  • Topics and Partitions
  • Kafka Records: Metadata, Header, Key, and Value
  • Using Kafka with Flink SQL

Resources

Use the promo code FLINK101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Using Kafka with Flink

Hi, David from Confluent here to tell you about using Kafka with Flink. Flink includes support for using Kafka as both a source and sink for your Flink applications. Flink is also interoperable with Kafka Connect, Kafka Streams, ksqlDB, and the Schema Registry. To be more precise, I should explain that while Flink explicitly supports Kafka, it is actually unaware of these other tools in the Kafka ecosystem, but it turns out that that doesn't matter. For example, Flink can be used to process data written to Kafka by Kafka connect or Kafka streams, so long as Flink can deserialize the events written by those other frameworks. And that's generally not a problem because Flink includes support for many popular formats out of the box, including JSON, Confluent Avro, debezium, protobuf, et cetera. And you can always supply your own serializers if you need something custom. In case you're not familiar with Kafka, I'll give you a quick, whirlwind tour of the highlights, starting with this birds-eye view of a Kafka cluster. Kafka producers publish events to topics, such as the orders topic in this diagram. Kafka consumers subscribe to topics. In Flink, the Kafka producer is called the KafkaSink, and the consumer is the KafkaSource. For scalability, Kafka topics are partitioned, with each partition being handled by a different server. This diagram shows a topic with three partitions, but most production environments will use many more partitions than this. Internally, Kafka events have some structure to them. The primary payload of each event is a key/value pair. The key and value are serialized by the producer in some format, such as JSON or avro. Kafka also writes some metadata into each event, which includes the topic, the partition, the offset of the event within the partition, and the event's timestamp. It's also possible to include some additional application-specific metadata in each event in the form of key/value pairs; these are the event headers. Now given that background, how does this all fit into Flink SQL? When you use Kafka with Flink SQL, you map tables onto Kafka topics. Flink doesn't store any of the table data itself. Instead, the Orders Table shown here is backed by the orders topic in a Kafka cluster. As events are appended to that orders topic, it will be as though they had been appended to this Orders Table, and these event records will then be processed by whatever SQL query logic you've attached to the table. Flink needs to know the format of the data you are working with. You can separately specify the formats used by the key and the value. In this case, the key and value are encoded using JSON. Sometimes the information you need isn't in the value part of the Kafka record, but is part of the metadata or the headers instead. That's fine, Flink SQL gives you access to every part of the Kafka record. A common use for this is to map the timestamps in the metadata onto a timestamp column. This involves the special METADATA FROM 'timestamp' syntax shown here, where an order_time column in the Orders Table is mapped onto the Kafka timestamp in the metadata part of the Kafka record. The other columns shown here, namely the order_id, price, and quantity, are all coming from the record's value. In summary, Flink acts as a compute layer for Kafka, powering real-time applications and pipelines, with Kafka providing the core stream data storage layer, and interoperability with the many hundreds of other services and tools that work with Kafka. For more information about using Kafka with Flink, see the hands-on exercise associated with this video on Confluent Developer. There you will find more explanations, working examples, and pointers to the documentation. These exercises take advantage of Confluent Cloud. If you haven't already done so, follow the link to get started, and be sure to use the promo code in the description below. If you aren't already on Confluent developer, head there now using the link in the video description to access other courses, hands-on exercises, and many other resources for continuing your learning journey.