VP Developer Relations
In the world of information storage and retrieval, some systems are not Apache Kafka. Sometimes you would like the data in those other systems to get into Kafka topics, and sometimes you would like data in Kafka topics to get into those systems. As Apache Kafka's integration API, this is exactly what Kafka Connect does.
On the one hand, Kafka Connect is an ecosystem of pluggable connectors, and on the other, a client application. As a client application, Connect is a server process that runs on hardware independent of the Kafka brokers themselves. It is scalable and fault-tolerant, meaning you can run not just one single Connect worker but a cluster of Connect workers that share the load of moving data in and out of Kafka from and to external systems. Kafka Connect also abstracts the business of code away from the user and instead requires only JSON configuration to run. For example, here’s how you’d stream data from Kafka to Elasticsearch:
{
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics" : "my_topic",
"connection.url" : "http://elasticsearch:9200",
"type.name" : "_doc",
"key.ignore" : "true",
"schema.ignore" : "true"
}
A Connect worker runs one or more connectors. A connector is a pluggable component that is responsible for interfacing with the external system. A source connector reads data from an external system and produces it to a Kafka topic. A sink connector subscribes to one or more Kafka topics and writes the messages it reads to an external system. Each connector is either a source or a sink connector, but it is worthwhile to remember that the Kafka cluster only sees a producer or a consumer in either case. Everything that is not a broker is a producer or a consumer.
One of the primary advantages of Kafka Connect is its large ecosystem of connectors. Writing the code that moves data to a cloud blob store, or writes to Elasticsearch, or inserts records into a relational database is code that is unlikely to vary from one business to the next. Likewise, reading from a relational database, Salesforce, or a legacy HDFS filesystem is the same operation no matter what sort of application does it. You can definitely write this code, but spending your time doing that doesn’t add any kind of unique value to your customers or make your business more uniquely competitive.
All of these are examples of Kafka connectors available in the Confluent Hub, a curated collection of connectors of all sorts and most importantly, all licenses and levels of support. Some are commercially licensed and some can be used for free. Confluent Hub lets you search for source and sink connectors of all kinds and clearly shows the license of each connector. Of course, connectors need not come from the Hub and can be found on GitHub or elsewhere in the marketplace. And if after all that you still can’t find a connector that does what you need, you can write your own using a fairly simple API.
Now, it might seem straightforward to build this kind of functionality on your own: If an external source system is easy to read from, it would be easy enough to read from it and produce to a destination topic. If an external sink system is easy to write to, it would again be easy enough to consume from a topic and write to that system. But any number of complexities arise, including how to handle failover, horizontally scale, manage commonplace transformation operations on inbound or outbound data, distribute common connector code, configure and operate this through a standard interface, and more.
Connect seems deceptively simple on its surface, but it is in fact a complex distributed system and plugin ecosystem in its own right. And if that plugin ecosystem happens not to have what you need, the open-source Connect framework makes it simple to build your own connector and inherit all the scalability and fault tolerance properties Connect offers.
For a more detailed introduction to Kafka Connect, check out the Kafka Connect 101 course.
Now it is a fact in the world of information storage and retrieval that some systems are not Kafka. Some say that's an unfortunate fact, but we all agree at minimum that it is true. And sometimes you'd like the data that are in those other systems to get into Kafka topics. And sometimes you'd like the data in Kafka topics to get into those systems. This is the job of Kafka Connect. Kafka is integration API, and really subs system. Connect is on the one hand, an ecosystem of pluggable connectors. On the other hand, it's a client application that is to a Kafka cluster Connect looks like a producer or a consumer or both because remember everything that's not a broker is one of those things. So ecosystem of pluggable connectors, client application, let's dig into this as a client application Connect is a server process that runs on hardware independent of the Kafka brokers themselves. This is an application running outside the cluster. It is designed to be scalable and fault tolerant, meaning you can run not just one single connect worker but you can have a cluster of Connect workers of, of individual nodes, individual instances running the Connect process to share the load of moving data in and out of Kafka between Kafka and these external systems. Connect also abstracts a lot of this data integration code proper away from the user that user in this case being you. And instead requires you just to write some JSON configuration to run a connector. So for example, this would be how you would stream data from Kafka into elastic search. You'd use this little bit of JSON configuration to configure a connector. Would you write the code that subscribes to a topic gets messages uses the elastic API? No, you would not do any of that. That's the business you want to get out of. You want some one or some subset of the community to have written that elastic search connector for you simply deploy that connector to your connect cluster. And then declaratively announced this is where the cluster is. This is the topic, any other parameters that might be involved in that sourcing or sinking of data. A Connect worker, which is one of these nodes and the connect cluster runs one or more connectors and connector kind of gets two different senses here. Also a connector is a pluggable component. That's responsible for interfacing with that external system. And in its simplest definition, a connector is a jar file with all of that, that JVM connect code in it. So it's that component. Connector also has a runtime sense. So when I've got that, that little snippet of JSON and I've posted that to the rest end point on the connect cluster and said, okay, run this well then that jar that's deployed there, that kind of physical connector becomes as a runtime connector. Some key terms, I said these a moment ago, but to be precise a source connector reads data from an external system and produces it to a Kafka topic. A sync connector, subscribes to one or more Kafka topics and then writes those messages. The messages that that reads from Kafka writes them to an external system. Each connector is either a source or a sync connector but it's worthwhile to remember that the Kafka cluster only sees a producer or a consumer in either case that's all you can be to Kafka. So if you're producing that means you're a source connector. If you're consuming, that means you're a sync connector. One of the primary advantages of connect is the gigantic ecosystem of connectors that someone else has written. And maybe lots of people have deployed and tested writing the code that moves data to a cloud Blobstore or writes it to elastic search or inserts records into a relational database or whatever that's code that is, we'll say unlikely to vary from one business to the next way. Everybody does that in the same way that does not differentiate you to write that code. Likewise, reading from relational database, getting messages from salesforce.com a legacy HDFS file system or something like that. That's the same operation no matter what sort of application does it, you can definitely write this code if it's fun and nobody's looking, but spending your time doing that doesn't add any kind of unique value to your customers or make your business more uniquely competitive. It's what we call undifferentiated code. And when you can avoid writing that kind of code, you should. And by the way, all of those examples of connectors that I just gave, those are the kinds of connectors that are available on the Confluent Hub which is a curated connection of connectors of all sorts and importantly of all kinds of licenses as well. Some of them are commercially licensed and maybe even really expensive. Some can be used for free. Some have standard open source licenses. Some have community licenses. It's a variety of things there and the license is disclosed in every case. So you're able to see what you're getting. In addition, Connect Hub exposes a search functions so that you can search for the kind of connector you're looking for. And of course, connectors don't need to come from the Hub. It's nice, it's centrally curated, it's searchable. There's a command line tool for downloading things from there and installing it into your Kafka Connect instance. There's lots of convenient tooling around it but you can just find them on GitHub or elsewhere. Someone could hand you a thumb drive in a parking lot that had a connector on it and you could build it and run it. And I'm sure that would be fine. There are other commercial connectors in the marketplace that aren't on Connect Hub and that you get through other means this is really in the broadest sense of ecosystem Connect offers an API, people code against that API. And if you can't find a connector on Connect Hub on GitHub, generally out in the marketplace handed to you on a drive in a parking lot, you know, whatever your distribution mechanism is, Connects API that it exposes is frankly pretty friendly. There's not a lot to it. So you can pretty easily write your own connector code. I've just made this case about how you definitely no matter what don't want to write this kind of code. Sometimes you have to write maybe your thing doesn't have a connector and you have to do it at the difficulty. This point I wanna make carefully the difficulty of writing a connector is gonna come in that external interface. So if that is painful for some reason or if there's just something that's fundamentally mismatched about event data and that interface like that external thing is inherently synchronous somehow. I mean, there there's various forms of impedance mismatch that can happen. There are various forms of unpleasant integration APIs in the world. So connectors can give you friction but the Connect API sort of thing, let me assure you is, is really not that bad. And you know, when I talk about Connect like this it seems to septically simple on the surface. And I've certainly had folks ask me in person I've been talking about this, you know why would I use that? Why wouldn't I just write my own? And I say, well, you'll learn. You know, you can kind of tell this is a fairly complex distributed system in its own right. And the value of the plugin ecosystem is difficult to overstate. So all those things, I mean, just to give you a quick idea, say you're reading from a relational database and you know, you're keeping track of the last ID that you saw. Well, you need to put that last ID somewhere, right? You need to store that state. Where's that gonna go? And what if your worker fails suddenly and a new worker comes up where is it gonna go look to get that IDs? There's this kind of distributed state management problem that even in a super trivial connector is obvious. And all these things are things that Connect has solved. It's provided this ecosystem. So all the usual suspects of interfacing that you're gonna have to do, you're gonna have at least one connector to choose from. So you definitely want to rely on that community effect and just the existing functionality and scalability and fault tolerance that Connect gives you.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.