Enhance your career, get your certificate as a Data Streaming Engineer | Get your Certificate

August 6, 2020 | Episode 113

Apache Kafka 2.6 - Overview of Latest Features, Updates, and KIPs

Transcript
Notes

Tim Berglund (00:00):

Welcome to Streaming Audio, a podcast about Kafka, Confluent, and the cloud. We've got some great stuff for you today. Hope you enjoy.

Tim Berglund (00:12):

Hi. I am Tim Berglund with Confluent. I'm here by Clear Creek in the scenic Guanella Pass in the mountains of Colorado to tell you all about Apache Kafka® 2.6. As usual, I'd like to divide the update into three parts. That's Core Kafka, Kafka Connect, and Kafka Streams, and I want to highlight the most important KIPs that have been merged in for this release. And remember, if you don't know, a KIP is a Kafka Improvement Proposal, that's like a big package of change, sometimes it's small, sometimes it's big that gets merged into each new release of Kafka. There are a lot of great KIPs in 2.6, so let's get started. KIP-546 makes it a little bit easier to administer quotas. Now, quotas are fundamentally complex to administer because you've got this combinatorial matrix of user and client and you can have different quotas at each intersection in that matrix.

Tim Berglund (01:00):

So there can be a lot of things to keep track of. Basically, what this KIP does is add that functionality to admin client. So it's now a native API, a part of the native admin API as you see here, and there's the shell command Kafka client quotas that you can use to do this from the command line if you don't want to code it from Java directly. KIP-551 exposes some new disk read and write metrics. Disk activity can be a fundamentally limiting factor for bandwidth and for latency, it's a thing you need to keep your eye on, so it's just improves our visibility there. As you see, we can see the total number of bytes read, the total number of bytes written by a broker, and that's from all disks that that broker may have to babysit. We don't know how many disks that is for any given broker, but this is an aggregate of all that activity.

Tim Berglund (01:50):

KIP-568 gives us an API for triggering consumer group rebalance. Now, normally, if you're a consumer, you're always part of a group, whether you're a group of one or a group of many; if you're a consumer, you kind of want a hands-off attitude to consumer group rebalance. That is a thing that's supposed to happen for you. It's driven by the client library on the consumer side, but you don't touch it a lot from an API standpoint. That's how it's supposed to be. But sometimes you have to, fundamental design principle, abstractions always leak. So all that stuff that's abstracted away that's that layer below you in the API, now with this KIP you've got a way to reach down there and basically just trigger a rebalance by yourself. There's this method you see there called enforced rebalance. There could be some condition in the system that you detect that tells you, you have to trigger a rebalance.

Tim Berglund (02:40):

Now you have an API for doing that. KIP-574 is an improvement to the Kafka configs shell script. Now, this has always let us set individual config parameters, one at a time. So you've got this fairly interesting array of command line switches that lets you pick which configure you want to set, and then other switches that let you set that config as it gets set in according to its own particular idiom. But if you wanted to like script a bunch of those, maybe, I don't know, have your configuration act like code, you'd have to have a shell script that just called Kafka configs over and over and over again. One might ask, what year is this? Well, the opinion of KIP-574 is that it's this year, it's 2020, configuration might be code, now we can point to a file that will pull in those configs rather than invoking the config command over and over again

Tim Berglund (03:29):

just like suck in the configs from this file. It's a much easier way to read and write and collaborate around that little bundle of config. So nice improvement to Kafka configs. KIP-606 adds metadata to metrics reporting. Now, this is the thing that is cross cutting, right? It affects clients, the broker, streams connect, really all parts of the system, but we put it in Core Kafka cause I think that's really where it belongs chiefly. The idea though is that when metrics are being reported you might not know where that event is coming from. There could be a lot of things that you're looking at, it could be something from connect, it could be something from a streams application. And now with this KIP, that context is reported, and so you don't have to go fishing or you don't have to find hacks and ways of injecting your own context to kind of back that out. This KIP makes that official as it should.

Tim Berglund (04:21):

Now let's talk about some Kafka Connect KIPs. First up is KIP-158. This deals with systems where topic auto creation is enabled. Now, a lot of admins shut that off. Right? So if you want a topic to be created, you have to create it through some administrative process, maybe there's a forum you have to apply and get permission, every organization has their own approach to this. But some clusters have that turned on; and so if connect is able to auto create a topic, when a new connector starts up or a connector detects some new input and needs to create a new topic, you'd like maybe to apply some metadata to some configuration to those new topics you create. Now with this KIP, you can put those new configuration parameters in the connector configuration itself and those will be applied to any new topics that the connector creates.

Tim Berglund (05:09):

KIP-585 gives us the ability to apply predicates to SMTs. Now, an SMT, if you don't know, is a Single Message Transform, and that's a little function basically you can apply to every message as it passes through Kafka Connect, whether from a source or a sink perspective coming in or going out, we can modify that in some stateless way, super handy thing to do. And what this KIP lets us do now is say, well, here's some logic that we're going to use to determine whether any given single message transform should be applied to a message. And you're able to define custom predicates, there's an interface that you can implement to do that. The predicates that are included are topic name matches; so if the message is from a particular or to a particular topic, then this SMT kicks in, has header key to see whether a particular Kafka message header is present.

Tim Berglund (05:59):

Or of course, if the record is a tombstone, then we'll apply the SMT or not. Those are the ones that come out of the bag. You can write your own. One note, do be careful with this kind of thing. SMTs are an absolutely essential part of Kafka Connect as a data integration technology, but you don't want to go like writing code in your SMTs. If you find yourself doing that or having trouble wrestling with the fact that it's stateless, you want to look to something like ksqlDB or Kafka Streams to actually do stream processing and not do that in connect. But predicates on SMTs, huge improvement, great thing for a lot of use cases. KIP-610 gives us a little bit more flexibility with the so called dead letter cues in Kafka Connect. Now previously, if there was a problem processing an SMT, a Single Message Transform or deserializing on the way out, then you could take that message and automatically write it to a topic you designate called a dead letter queue or some people say dead letter topic.

Tim Berglund (06:57):

If there were problems in another part of the sinking chain, not in the SMT, and not in the deserializing, then the dead letter queue wasn't an option for you. So now, KIP-610 says, any problem dealing with a message in a sync connector, whether it's in those two parts or somewhere else in the chain, you could still write it to a dead letter queue. All right, let's look at some Kafka Streams KIPs. First up is KIP-441, which helps to scale out of a Kafka Streams application work a little more smoothly. Now, when you are scaling out a Kafka Streams application, there are kind of two things that are going on. One, there is something very much akin to a consumer group rebalance where you're taking the partitions that had been assigned to say these three nodes and reshuffling them among the new five nodes that you've got in your new Kafka Streams cluster as you're scaling it out.

Tim Berglund (07:47):

So you have to rebalance partitions. You also have to move state. Right? Since most streams’ operations are stateful, you've got your RocksDB instances or, just in general, your state store holding that state in each node of the Kafka Streams cluster, and that state has to be moved to the new nodes. You've got these two nodes in this hypothetical three to five node scale that we're talking about here. And as those partitions move from the three to the five, the state has to go with them. Previously, with the KIP-441, you actually had to pause the availability of the cluster. You had to stop processing messages while you're doing that reprocessing. Now what happens is, while that data is being moved from its previous node to its new node, the previous one is still able to serve requests. So you're still handling traffic and you don't have to stop stream processing while you're scaling out.

Tim Berglund (08:38):

So huge benefit to elasticity in Kafka Streams clusters. KIP-447 is an improvement to the scalability of Kafka Streams applications with exactly one semantics enabled. This is kind of a subtle thing, and I encourage you to read the KIP for the details. But basically a streams application that's dealing with a large number of source partitions on an input topic, there could just be scalability issues there in terms of memory usage in the streams cluster and things like that. So again, dig into the details here, I'm not going to take you through all of that, but this KIP-447 makes that work a little bit better. So if you've got a big input topic and kind of a beefy streams application, you might see better results now. So check it out. All right, that's all I got for you in this summary. But what do you do if you want to know more? You always read the release notes.

Tim Berglund (09:27):

That's going to have all the changes in 2.6. You won't miss anything there. There'll be a release blog post. That's a good summary that might go into a little more detail on a few KIPs. So you want to check that out too. But no matter what you do, get started, download 2.6, start putting some of these changes into effect, and we always look forward to hearing what you build. So connect with us, however you do that, Twitter, Slack, we'd love to hear from you. Thanks.

And there you have it, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at, @tlberglund on Twitter. That's @tlberglund or you can leave a comment on YouTube video or reach out in community Slack. There's a Slack sign-up link in the show notes if you want to register there. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through iTunes, be sure to leave us a review there. That helps other people discover the podcast, which we think is a good thing. So thanks for your support, and we'll see you next time.

Apache Kafka® 2.6 is out! This release includes progress toward removing ZooKeeper dependency, adding client quota APIs to the admin client, and exposing disk read and write metrics, and support for Java 14.

In addition, there are improvements to Kafka Connect, such as allowing source connectors to set topic-specific settings for new topics and expanding Connect worker internal topic settings. Kafka 2.6 also augments metrics for Kafka Streams and adds emit-on-change support for Kafka Streams, as well as other updates.

EPISODE LINKS

Continue Listening

Episode 114August 12, 2020 | 41 min

Developer Advocacy (and Kafka Summit) in the Pandemic Era

COVID-19 has changed the face of meetings and events, halting all in-person gatherings and forcing companies to adapt on the fly. In today's episode of Streaming Audio, the Confluent developer advocates come together to discuss how their jobs have changed during the worldwide pandemic.

Listen Now

Episode 115August 17, 2020 | 43 min

Disaster Recovery with Multi-Region Clusters in Confluent Platform ft. Anna McDonald and Mitch Henderson

Multi-Region Clusters improve high availability in Apache Kafka® and ensure cluster replication across multiple zones. In this episode, Anna McDonald and Mitch Henderson share common challenges that users often run into with Multi-Region Clusters, what causes them, and what to keep in mind when considering replication.

Listen Now

Episode 116August 26, 2020 | 47 min

Championing Serverless Eventing at Google Cloud ft. Jay Smith

Jay Smith discusses the definition of serverless, serverless eventing, data-driven vs. event-driven architecture, sources and sinks, and hybrid cloud with on-prem components.

Listen Now

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free

Apache Iceberg ™

Kafka® 101

Apache Flink® SQL

Apache Flink® Table API: Processing Data Streams in Java

Designing Event-Driven Microservices

Apache Flink® 101

Building Flink® Apps in Java

Kafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Articles

Patterns

FAQs

Blog

Streamables

Learn More

Language Guides

Tutorials

Demos

Meetups

Community Slack

Community Catalysts

Community Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2026

Past Current and Kafka Summit events

Apache Kafka 2.6 - Overview of Latest Features, Updates, and KIPs

Tim Berglund (00:00):

Tim Berglund (00:12):

Tim Berglund (01:00):

Tim Berglund (01:50):

Tim Berglund (02:40):

Tim Berglund (03:29):

Tim Berglund (04:21):

Tim Berglund (05:09):

Tim Berglund (05:59):

Tim Berglund (06:57):

Tim Berglund (07:47):

Tim Berglund (08:38):

Tim Berglund (09:27):

Continue Listening

Developer Advocacy (and Kafka Summit) in the Pandemic Era

Disaster Recovery with Multi-Region Clusters in Confluent Platform ft. Anna McDonald and Mitch Henderson

Championing Serverless Eventing at Google Cloud ft. Jay Smith

Got questions?

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.