Do you want to get started using Kafka Connect? That'd be a very reasonable goal. If you did, the person you'd want to ask would be Robin Moffatt. I've got Robin on the show today to help us do just that. Now Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io. It's a website that's got video courses, executable tutorials, event-driven architecture patterns, long-form articles, all kinds of materials to help you get started using Kafka and Confluent Cloud, and when you do sign up on Confluent Cloud to do some of those exercises, use the code PODCAST100 for an extra free hundred dollars usage credit. Check it out. Now let's get to Robin.
Hello and welcome to another episode of Streaming Audio. I am your host, Tim Bergland and I'm joined in the virtual studio by my friend and coworker, Robin Moffatt. Robin, welcome to the show.
Thanks, Tim. Good to be back.
Yes, I should say welcome back. I don't even know how many times you've been on the show. I have a hard time keeping track of things like that, but it's been, I would say a lot, but not, not enough. It's always good to have you here.
It's always fun to chat.
If you're brand new to the show, Robin is a developer advocate here at Confluent. He and I work together on helping people understand Kafka and all parts of the related ecosystem and Confluent Cloud and all that kind of stuff. And Robin tends to be found in the more likely case that you do know who he is, you know he tends to be found talking about Kafka Connect and ksqIDB and making those two things do interesting things and helping people get spun upon them. And we've recently published some material of Robin's on Kafka Connect, some video content. Now I say it's material of Robin's. If you know the video content, it's got my face, but you know, dirty secret. I actually don't write all the material that I teach on camera. That was Robin's material. So yeah, Robin, I just wanted to talk kind of in honor of, in promotion of, that course and in support of the great thing that Kafka Connect is. This is an episode introducing the listener to Kafka Connect and all its finer points. Let's talk about it.
Let's do that. Kafka Connect is a wonderful thing. More people should use it.
Agreed. What is it I'm going to sip my coffee and let you explain what it is.
I shall sip my tea. Kafka Connect is how you do integration, streaming integration, between external systems and Kafka. So you've got the producer and consumer API. Your applications will write data into Kafka and consume data from Kafka. But anytime you've got another system, like a database or a message queue or a flat file or somewhere where you want to get data from into Kafka or from Kafka down to another place, that's where Kafka Connect comes in because it solves all, if we're actually quite tricky integration problems. When you start doing this thing and with different sets of technology, and you ought to do it at scale and reliably and all that kind of stuff. It's in a sense, trivial, to do it bespoke. Like one piece into Kafka and one Kafka Connect to somewhere else. You're like, I'll just write a program to do that, but when you start doing that again and again and again, and people tend to. That's where Kafka Connect comes into its own because it's solving this problem and you're kind of reinventing the wheel if you start trying to do that yourself otherwise.
Yes. And that reinventing the wheel is a nasty trap because it seems easy. And sort of the basic principle of I'm going to consume stuff from a Kafka topic and dump it into an S3 bucket. And you know, well, I'm already familiar with that API and I've got a nice little wrapper here, Java, and I can just kind of do all that. And I'm super efficient with all that. So, you know, why not? I'll just code that up real quick and Kafka Connect. That's the thing I have to deploy and at the fiddle with JSON config and like, I'd rather just code it. You want it to-
Sorry, I was just thinking as you describe it like that, it's one of those decisions we always have to make with computers. Because there is stuff we know, and there is the new stuff. And it's like, is this the new stuff I should be learning, or is this the new stuff that's going to distract me from just getting my job done? And it's one of those which path do you take, and Kafka Connect is definitely a path to be treading.
Yes. Yes, because it's a deceptively difficult problem. It's that thing that I just described, you know, I can consume from a topic and I've got an S3 API that I like and I'm super efficient with. I can write things to buckets. You can, and, and that's not an advanced development skill, but it's actually hard to do this in general and solve the problem in general. And when you find yourself wanting to write that, wanting to build your own, you start to feel that. Like you need to ask your crew to lash you to the mast as you sail by that, you take a cue from Ulysses, we're going past the sirens. You're not going to be able to control yourself. So somebody needs to restrain you so that you don't write that yourself. Because it is actually in its corner cases, and its full development over the last, six years old now, it's a complex piece of code.
So I guess tell us about that. It's not built into Kafka. It's not like a thing I can configure on a broker. It has some independent life of its own. So what is that?
Well, it's not on a broker, but it is part of Kafka. So a common misunderstanding people have is that like you've got Apache Kafka and I've got my brokers and there's this thing called Kafka Connect. And that'd be a separate piece of technology. And Connect API is part of Apache Kafka. But as you say, Tim, you deploy it separately. So you've got your brokers that do their brokering and hold the data and receive and distribute the messages. But then you've got your Kafka Connect workers, which is a separate JVM process. If you decide to run it yourself, you don't have to run it yourself. But if you are going to, it's a JVM process, you typically run it separately from your brokers in production.
Like I'm just mucking around with it. It'll just run on your laptop or a docker or whatever. And in terms of actually using it, you just pass its configuration files. So as well as being one of those things that kind of, if you're a coder and you're itching to write some Java or something, you have to be restrained from doing so, Kafka Connect also broadens the audience of people that could actually do this kind of integration because if you're from a data engineering background like I am, I don't actually write any Java. I couldn't write some Java even if I wanted to get data from here to there. And so Kafka Connect here's a lump of JSON and it tells Kafka Connect, where to get the data from and where to put it. And it makes it much more accessible during that kind of integration to a much wider base of users.
Of course, your go is decent. So maybe, you'd be doing a lot of Googling and be extremely-
I wouldn't even claim decent, but I started mucking around with it last year.
Yeah. So that's key, it's a declarative thing. You configure it, you deploy it. And so rather than data integration, being a development exercise, it becomes more of an operational one. So there's an infrastructure that gets deployed and configured and off you go. And, and I guess, let me underscore what you said. You know, speaking for myself, most of the Connect I've ever run has just been in, in Docker Compose. That's, that's the most frequent connect deployment for me is demoing things, illustrating things, teaching things. And it is a JVM process. So if you're running your own infrastructure, you run it as your own infrastructure. However one does in Kubernetes or whatever you're doing. And of course the Confluent Cloud option, you just click on the thing that says connect, and then you select a connector and you say, I'd like to add this and type some config in, and then you've got a connector. So that's by far the lowest touch way of running connect.
Yeah, definitely. And you can also have there a hybrid approach and let Confluent Cloud run some of them and run your own ones if you want to split it like that. So you have that option as well,
Or the production case is in Confluent Cloud, but if you've got a high amount of fiddling and debugging and log reading to do because you've got some super finicky system, you can bring the prototype up to speed locally and then migrate it to the cloud because that's a way more fun sort of operational lifestyle in the term.
Okay. So tell us about connectors. The whole idea is that I'm not actually writing the code to talk to whatever external system I want. So what are connectors?
So Kafka Connect is like a pluggable framework and it ships with a couple of example connectors for getting data to and from files, which you shouldn't actually use in production, but just example ones. But when it comes to actually integrate technology into Kafka or Kafka to another technology, you'll be reaching for a particular connector plugin, which is a bunch of code that an actual coder has gone and written, but written once and once only. That tells Kafka Connect here is how you actually talk to this particular technology.
So if you're integrating with Elasticsearch, it understands the Elasticsearch APIs. And it will say like, here's how I'm going to write messages into these indices. And it understands documents and all the kind of Elasticsearch specific stuff. Or if you're pulling in data from IBM MQ, you can use the connector for IBM MQ, which understands the APIs so you just drop in these connectors, which are JAR files into your Kafka Connect workers. And then you configure them using this lump of JSON. So it's this really clever pluggable architecture, which means that if you have a technology, for which a connector doesn't [inaudible 00:10:53] which is fairly few and far between because there are so many different connectors. You or someone else could write a connector, which describes once, here's how you talk to us, technology, here's how you extract data or put data into it. And then it gets packaged up into a JAR file, which then everyone can use for working with that technology in Kafka Connect.
Yeah, so don't miss that. They are JAR files so this is all JVM stuff and you probably won't have to write a connector. You know how these things go, there's a power law here where 90% of your integration is a small percentage of the actual things you could integrate with. It's S3, it's elastic, it's a relational database, and maybe Salesforce or something. I don't know. There's a small number of usual suspects, but there is a long tail of then strange and unusual things. And at this point, most of those things have connectors. So if you write one, you're doing it in Java. I can tell you a really simple API and any difficulty in writing a connector is going to be difficulty in the external system being a pain in the butt.
And honestly, at this point, it's kind of likely if you find yourself writing a connector, it's probably something rare and unusual and somewhat unfortunate in its own API. So that could happen, but the Connect part of the API is, is dead simple. I mean, you couldn't ask for better. And then again, if you're running your own, you package it up as a JAR file, dump it into a place that can be class loaded from, and off you go. Of course, you can't do that in Confluent Cloud. Confluent Cloud, we offer an ever-growing list of managed connectors that really are ever-growing and pretty substantial at this point, but if you have a custom connector that you need to run, that's one of those cases where you're going to be running your own stuff.
And past the point with connectors. You can run your work yourself, but use managed brokers, like Confluent Cloud, so it doesn't have to be all or nothing. Or if you're in an organization that's got the Kafka brokers over there, you can still run your Kafka Connect workers over here. And for example, manage those yourself and use the brokers that someone else is managing. As long as the worker can talk to the brokers, then you can kind of architect it as appropriate.
And I hesitate to be too prescriptive, but one might even say that you should if you're running your own connect cluster still use managed Kafka if you can. I mean, you want to use as much managed stuff as you can get away with probably as a general principle. Okay. So that's custom connectors, that's connectors in general. Where do I find connectors? Suppose I'm not yet running in the cloud or I just want to know what's out there. I've got my weird thing I want to integrate with, how do I find connectors?
You can head over to Confluent Hub. So confluent.io slash hub, I think. And one of the great things about Kafka in general, and particularly Kafka Connect, is the ecosystem around it. So you've got tons of connectors that Confluent writes or you got tons of connectors from the community more broadly. And there's a bunch of different like licenses around them and all sorts of different stuff like that. But if you go to Confluent Hub, you'll find most connectors on there. And there are over 200 different plugins that you can search through for that.
I'm interested in integrating with influx DB, you type it in there, and here's your list of different connectors. That comes with the clients as well, to help you install them. You mentioned classpaths a moment ago, which kind of scared me, Tim. But if you've got your worker or you say Confluent Hub, install this connector, and it kind of goes and works out where to put it and where to set the classpath thing that you talked about. And so it does all of that for you, which makes it much easier. And if you have gone your own route and wrote your own connector, you can submit it to Confluent Hub as well. So it's like a community resource for finding these different connectors.
Yep. And that will put it in a directory that has things that are class loadable and that should strike fear into even experienced Java operators' hearts. That's not stuff you want to mess with if you can avoid it, not as bad as it used to be, but still. [crosstalk 00:15:17]
Sorry, when you do install them, make sure you restart your worker. Because that catches out people a lot. I've got this thing here, my worker is like saying it doesn't have it. So you install it and then you restart your worker.
The voice of a man who's answered a question about Kafka Connect before.
Yeah, that's put me to fire as well.
Okay so that's connectors. That's where to find them.
Other types of plugins?
Yeah. What else is a plugin?
We have transforms and we have converters. So this is one of those things where, like we were talking earlier about the fork in the road and do I like write my own or learn this new thing. And another fork in the road is there's this Kafka Connect thing. Do I just hack away at these things and copy and paste a bunch of stuff that I found on stack overflow or do I understand a little bit more about it? And this is something that we cover in the course that we've just published. What you need to understand to have a happy time with it and things like a broad understanding of which bits you're plugging in, because it helps things run much more smoothly because a connector talks to like the source or target technology.
And then we have these things called converters, which are another pluggable thing. So Kafka Connect ships with a bunch and you can write your own if you want to. But the converters are responsible for the serialization, de-serialization of data from a particular connector onto a Kafka topic or from a Kafka topic out into another connector that's pushing it downstream somewhere. So they are pluggable and getting those bits set up right is also ready in parcels. And then the other plugin that I mentioned, I said, the single message transforms, and you don't have to use them, but they also do some cool stuff as well.
Is there, I want to dig into SMTs or single message transforms, in a minute, but is there a lot of converter action? Like it strikes me that there are only a few realization formats in the world, and those are kind of covered. Is that an interesting place of innovation and community activity or is it sort of finished?
That's a good question. In the old days, going back a few years, you've got a string converter, you've got a raw JSON converter, and an Avro which integrates on scheme registry, which you can talk about in a bit if you want to. And then I don't know, my memory is not so good, but we added in Protobuf off on JSON schema converters that were additional ones. I'm sure there are other ways of serializing data that people have come up with that would be nice if Kafka Connect supported. And this is why it's so great that it's pluggable because if someone wants to use Protobuf today, they can use that with a converter that was written before the Protobuf converter existed. So when you're writing a connector and the problem is that connectors and converters, I always mix them up when I'm talking, but a connector pulls in data from an external system and it doesn't serialize the data. It passes it to another base of Kafka Connect, which serializes it, which is the converter bit
So it's not responsible for doing it. If you're writing a connector, which is doing the serialization, you're probably doing it wrong. There are a few large cases where people do that, but generally, it passes just like a generic representation of the object, which the converter then serializes to write to the Kafka topic. And the same thing happens in reverse when you're using async. So it will de-serialize it with a converter that passes a generic representation to the connector to push down to a target system. So the converters are super important. I've not seen a ton of new ones, there probably are, but this is a great thing, right? Someone in the community could say, I wish we had support for, I don't know what. I'm sure there are other serialization methods and they could write that and then put it onto Confluent Hub or distribute it. And people could now start to serialize data in that way.
Yeah, so when the connector is proper. Its job is to talk to something, either reading data from some external system or writing data in the read or the source side of things, it does that interface and then creates a representation of that data in the Java type system and passes that off to another layer in the connect APIs. I'm done being a connector, I'll pass this on to the next layer. On the sync side, it will have received that internal Java object and then have to do what it takes to write it to the external system. But the translation between the generalized representation of that data in the Java type system and the bytes that go into Kafka, that's what this does. And so it's substantially a solved problem, but one can always have different serialization formats that are custom to your own deployment.
But it's also probably the one thing that you kind of need to wrap your head around when you start using Kafka Connect and you see tons of people posting problems with stack overflow and stuff. And the RMS has got a lot better in recent releases and they kind of point you to where the problem is. But if you imagine with a hose pipe system, when you're connecting things together, if you're trying to connect something that doesn't match, it's just going to not work horribly. So the way in which you're writing it onto the Kafka topic, you're serializing it, is you've got to set the converter when you're reading from the Kafka topic in the same way otherwise things don't match up.
I keep talking about the kind of like connectors going in and connectors coming out. A lot of the time it's just one side that Kafka Connect handles. So you've got an application that pushes data onto a Kafka topic, Kafka Connect reads that data and pushes it down to like a document store or something like that. Or the same thing in reverse. Like we've got a database has got some data they want to use for an event-driven application that we're writing to Kafka Connect, pushes the data onto the topic, which then microservice, or whatever reads directly. So it doesn't have to be Kafka Connect all the way through, but what does have to be all the way through is your serialization methods. They've got to match up.
Got it, got it. So the other third and final pluggable things, single message transforms.
SMTs. What is an SMT? What isn't an SMT?
So single message transforms. They transform single messages as they pass through. So if you imagine, and if you go look at the course you'll actually see a slide that shows it, you got the Kafka Connect box and you've got these three pluggable things within it. You've got in terms of a source getting data in from another system, you've got your connector. You've optionally then got a single message transform or multiple ones, and then you've got the converter. On the sync side of things you're pushing data out to another system from Kafka, you've got your converter, one or more single message transforms, and then the connector. And so those single message transforms, they transform a message as it passes through that pipeline. So we're pulling in data from another message queue, for example. And you could say actually within that payload that we've pulled in, I only want these particular fields, like we want to just drop out a bunch of stuff that perhaps is personally identifiable information.
We don't want that data on our Kafka cluster. We don't want to have it there, it's more trouble than it's worth. And like loads up to us all sorts of regulations. You can use the single message transform to drop those fields at ingests, just so they never actually hit the Kafka cluster and you can change data types and you can add in fields, if you want to, and do all sorts of kinds of conversions. And the single message transforms are probably the most common type of plugin which people will offer because you can start to actually write specific business logic into those and then build it into your pipeline. So they're really, really useful because unless you apply them using configuration, or you can write your own if you want to go with that route using Java, but as an end user, it's just part of the configuration.
So like a lump of JSON. It's not particularly elegant to look at, but once you've got your head around that, it's okay to use. So you say this pipeline here, where we get our message in, we want to add in a field which says, this is the source system name, this is the timestamp of, whatever we want to do with it. So single message transforms, they're really neat. One news of caution is they don't do everything. And they shouldn't do everything. So if you think, oh, what I want to do is I want to start doing look-ups on my data as it comes in. So there's field comes in, we want to do a lookup and make a rest call out to another service to enrich that message. And you could probably do that, but you almost certainly shouldn't do that. So you start to veer off into the stream processing proper worlds with ksqIDB, with Kafka Streams. To do that kind of stuff and aggregations and joins, when things get more complex than just transforming a single message. So the name kind of says it all.
Yeah. You say, look, well, we've got this in-memory data grid that's adjacent to the Kafka cluster. I could perform aggregations in there. Again, your crew should tie you to the mast. You're not capable of controlling yourself for the next little bit. That's bad. When I see things that you do in pipelines with SMTs, super important part of the ecosystem. Custom SMTs again, you're not a bad person for writing a custom SMT. That API is there. And again-
I think they're potentially under-used as well. Spinning up an entire stream processing cluster to kind of like drop a couple of fields or to kind of remap something feels to me like a bit of architectural overkill if you're not run that thing already, if you are then, okay, do it there. But if you've got your Kafka Connect worker and you've got your data passing through and just want to make simple modifications to it. Change the schema, do this, do that, I think single message transforms are a perfect way to do it.
Yes. Where simple, I would say, is defined as stateless. They're fundamentally stateless things and if you do anything to try to make them stateful, you have departed from the proper purpose of the API and you're taking on a bunch of problems. It's going to be a nightmare. Those problems are solved in ksqIDB, they're solved in Kafka Streams, but that stateful stuff that properly happens before data lands in your cluster before it exits. SMTs are great for that, even custom SMTs, this is a nice little part of the system that Java or nearby JVM language is a great way to put that code in there.
Yeah. And you were asking earlier about converters and innovation? I've not seen so much of that, but SMTs, single message transforms, I definitely have. I did a little video series at the end of 2020, which you can find on YouTube. Shameless plug. But when I was digging into that there's a bunch which ship with Apache Kafka, there are ones that Confluent provides. And there are really, really interesting ones that come from the community. So that was ones around encryption with keys. And there's a bunch of this one from iconic Jeremy, which is just like this kind of fantastic treasure trove of like all these sensible little things like changing the case of a topic because you need to make it camel case, all lower case or those little things that make a big difference in a proper deployment, by which spinning up a Kafka Streams, a ksqIDB application, just not really makes so much sense. So there's a lot of really interesting stuff in the community for those.
Yeah. That 12 days of SMT series will be linked in the show notes. And Robin just mentioned a few other things. Basically, we're going to link in the show notes to everything Robin thinks you should know about if you're trying to learn Kafka Connect because here we are talking about it. So there'll be a lot of links but that series on SMTs in particular, if you need to understand SMTs deeply, you need to watch it. That's at this point, the definitive reference in the universe, I think maybe a little bit of overstatement. But you put a lot of work into that and that's everything all in one place. So I think that's important.
Can we have the overview on the table, the most important tip, or if you want to pick two or three tips in conclusion, if you're sending people into their Connect journey, what should they remember?
That's a good question. So we've not really talked about deployment models and stuff like that. So we've talked about Confluent Cloud and it's got managed connectors. If you are running it yourself then you have two different ways of doing it. It's called distributed and standalone. Again, the course that we're going to reference has got an explanation of the differences, but distributed is a much better way to start because you can do it on a single node. Like the name is a bit ambiguous, but a distributed Kafka Connect worker you can have one instance of, it's not a scary thing. But you can then scale it out when it's tolerant and scalable and all this kind of good stuff. And if you go the standalone route, you sometimes have to kind of rebuild parts of your understanding of how things are working.
You can, there's nothing wrong with it, but start off with distributed and your life is generally easier. Make sure you understand what converters are doing. Don't just run them, rejiggle things until they unbreak because they won't unbreak, it'll just get worse. With converters, you have a key converter and a value converter. Like with Apache Kafka you have messages like keys and values and you can have different converters, which may seem ridiculous, but actually does make sense sometimes to serialize things in different ways. But you've got your key converter. You've got your value converter. So make sure they're set correctly to your understanding of what the data actually is. And other than that, have fun with it. Go and check out Confluent Hub, go and check out all the community contributions as well. Check out the Confluent community forum as well, just like a very active section there on Kafka Connect or advice on which connectors to use or configuration problems or all that kind of stuff.
My guest today has been Robin Moffatt, Robin, thanks for being a part of Streaming Audio.
My pleasure as always.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io. A website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials, covering ksqIDB Kafka Streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage. Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter.
That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our Community Slack or Forum, both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel. And to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support and we'll see you next time.
Kafka Connect is a streaming integration framework between Apache Kafka® and external systems, such as databases and cloud services. With expertise in ksqlDB and Kafka Connect, Robin Moffatt (Staff Developer Advocate, Confluent) helps and supports the developer community in understanding Kafka and its ecosystem. Recently, Robin authored a Kafka Connect 101 course that will help you understand the basic concepts of Kafka Connect, its key features, and how it works.
What’s Kafka Connect, and how does it work with Kafka and brokers? Robin explains that Kafka Connect is a Kafka API that runs separately from the Kafka brokers, running on its own Java virtual machine (JVM) process known as the Kafka Connect worker. Kafka Connect is essential for streaming data from different sources into Kafka and from Kafka to various targets. With Connect, you don’t have to write programs using Java and instead specify your pipeline using configuration. Kafka Connect.
As a pluggable framework, Kafka Connect has a broad set of more than 200 different connectors available on Confluent Hub, including but not limited to:
Robin and Tim also discuss single message transform (SMTs), as well as distributed and standalone deployment modes Kafka Connect. Tune in to learn more about Kafka Connect, and get a preview of the Kafka Connect 101 course.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us