Kafka Connect is a component of Apache Kafka for performing streaming integration between Kafka and other systems like databases, cloud services, search indexes, file systems, and key-value stores.
If you’re new to Apache Kafka, you can read this beginner’s tutorial to get started.
Kafka Connect makes it easy to stream data from numerous sources into Kafka, and stream data from Kafka to numerous targets. There are literally hundreds of different connectors available for Kafka Connect. Some of the most popular ones include:
You can run Kafka Connect yourself or take advantage of the numerous managed connectors provided in Confluent Cloud for a fully cloud-based integration solution. In addition to managed connectors, Confluent provides fully managed Apache Kafka, Schema Registry, and ksqlDB.
Kafka Connect runs in its own process, separate from the Kafka brokers. It is distributed, scalable, and fault tolerant, just like Kafka itself. Using Kafka Connect requires no programming, because it is driven by JSON configuration alone. This makes it available to a wide range of users. As well as ingest and egress of data, Kafka Connect can also perform lightweight transformations on the data as it passes through.
Anytime you are looking to stream data into Kafka from another system, or stream data from Kafka to elsewhere, Kafka Connect should be your first port of call. Here are a few common ways Kafka Connect is used:
Kafka Connect can be used to ingest real-time streams of events from a source such as a transactional database and stream it to a target system for analytics. Because Kafka stores data up to a configurable time interval per data entity (topic), it is possible to stream the same original data down to multiple targets. This could be to use different technologies for different business requirements or to make the same data available to different areas in a business that has their own systems in which to hold it.
In your application, you may create data that you want to write to a target system. This could be a series of logging events to write to a document store or data to persist to a relational database. By writing the data to Kafka and using Kafka Connect to take responsibility for writing that data to the target, you simplify the footprint.
Before the advent of more recent technologies, such as NoSQL stores, event streaming platforms, and microservices, relational databases (RDBMS) were the de facto place to which all data from an application was written. RDBMS still have a hugely important role to play in the systems that we build—but not always. Sometimes we will want to use Kafka as the message broker between independent services as well as the permanent system of record. These two approaches are very different, but unlike technology changes in the past, there is a seamless route from one to the other.
By utilizing change data capture (CDC), it is possible to extract in near real time every INSERT, UPDATE, and even DELETE from a database into a stream of events in Kafka. CDC has a very low impact on the source database, meaning that the existing application can continue running (and requires no changes to be made to it) while new applications can be built, driven by the stream of events captured from a database. When the original application records something in the database—for example, an order gets accepted—any application subscribed to the stream of events in Kafka will be able to take an action based on the events—for example, a new order fulfilment service.
Many organizations have data at rest in their databases, such as Postgres, MySQL, or Oracle, and can get value from that existing data using Kafka Connect to turn it into a stream of events. You can see this in action in the streaming pipeline example, driving analytics using the existing data.
Apache Kafka has its own very capable producer and consumer APIs and client libraries available in many languages, including C/C++, Java, Python, and Go. So you would be quite right to wonder why you don’t just write your own code to go and get data from a system and write it to Kafka—doesn’t it make sense to write a quick bit of consumer code to read from a topic and push it to a target system?
The problem is that if you are going to do this properly, then you will realize that you need to cater for failures, for restarts, for logging, for scaling out and back down again elastically, and for running across multiple nodes. That’s before we’ve thought about serialization and data formats. Once you’ve done all of these things, you’ve written something that is probably rather like Kafka Connect, but without the many years of development, testing, production validation, and community. Even if you have built a better mousetrap, is all the time that you’ve spent writing that code resulting in something that significantly differentiates your business from anyone else doing similar integration?
Streaming integration with Kafka is a solved problem. There are perhaps a few edge cases where a bespoke solution is appropriate, but by and large, you’ll find that Kafka Connect should be your first port of call for integration with Kafka.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.