Senior Developer Advocate (Presenter)
In this module, you will learn about the distributed and standalone deployment methods for Kafka Connect. But the requirement to deploy Kafka Connect infrastructure only applies if you are implementing self-managed Kafka Connect. If you use Confluent managed connectors, all infrastructure deployment is taken care of by Confluent.
So now that we have learned a bit about the components within Kafka Connect, let’s now turn to how we can actually run a connector. When we add a connector instance, we specify its logical configuration. It’s physically executed by a thread known as a task. In the diagram above we see two logical connectors, each with one task.
The execution of a connector’s ingest or egress of data can also be parallelized (if the connector supports it). In that case, additional tasks are spawned. This could mean that when ingesting from a database, multiple tables are read at once, or when writing to a target data store, data is read concurrently from multiple partitions of the underlying Kafka topic to increase throughput.
But where do the tasks actually run? Kafka Connect runs under the Java virtual machine (JVM) as a process known as a worker. Each worker can execute multiple connectors. When you look to see if Kafka Connect is running, or want to look at its log file, it's the worker process that you're looking at. Tasks are executed by Kafka Connect workers.
A Kafka Connect worker can be run in one of two deployment modes: standalone or distributed. The way in which you configure and operate Kafka Connect in these two modes is different and each has its pros and cons.
Despite its name, the distributed deployment mode is equally valid for a single worker deployed in a sandbox or development environment. In this mode, Kafka Connect uses Kafka topics to store state pertaining to connector configuration, connector status, and more. The topics are configured to retain this information indefinitely, known as compacted topics. Connector instances are created and managed via the REST API that Kafka Connect offers.
The distributed mode is the recommended best practice for most use cases.
Since all offsets, configs, and status information for the distributed mode cluster is maintained in Kafka topics, this means that you can add additional workers easily, as they can read everything that they need from Kafka. When you add workers from a Kafka Connect cluster, the tasks are rebalanced across the available workers to distribute the workload. If you decide to scale down your cluster (or even if something outside your control happens and a worker crashes), Kafka Connect will rebalance again to ensure that all the connector tasks are still executed.
The minimum number of workers recommended is two so that you have fault tolerance. But of course, you can add additional workers to the cluster as your throughput needs increase. You can opt to have fewer, bigger clusters of workers, or you may choose to deploy a greater number of smaller clusters in order to physically isolate workloads. Both are valid approaches and are usually dictated by organizational structure and responsibility for the respective pipelines implemented in Kafka Connect.
On the other hand, in standalone mode, the Kafka Connect worker uses files to store its state. Connectors are created from local configuration files, not the REST API. Consequently, you cannot cluster workers together, meaning that you cannot scale for throughput or have fault-tolerant behavior.
Because there is just a single worker, you know for certain on which machine a connector’s task will be executing (i.e., the machine on which you’ve deployed the standalone worker). This means that standalone mode is appropriate if you have a connector that needs to execute with server locality, for example, reading from files on a particular machine or ingesting data sent to a network port at a fixed address.
You can satisfy this same requirement using Kafka Connect in the distributed mode with a single worker Connect cluster. This provides the benefit of having offsets, configs, and status information stored in a Kafka topic.
Hi there, Danica Fine here. Up until now, we've learned a ton about Kafka Connect, its components, and how to interact with it. But now let's see how to deploy it. In this module, you'll learn about the distributed and standalone deployment methods for Kafka Connect, but the requirement to deploy Kafka Connect infrastructure only applies if you're implementing self-managed Kafka Connect. If you use Confluent managed connectors, all infrastructure deployment is taken care of by Confluent. So now that we've learned a bit about the components within Kafka Connect, let's now turn to how we can actually run a connector. When we add a connector instance, we specify its logical configuration. It's physically executed by a thread known as a task. In this diagram, we see two logical connectors, each with a single task. The execution of a connector's ingest or egress of data can also be parallelized if the connector supports it. In that case, additional tasks are spawned. This could mean that when ingesting from a database, multiple tables are read at once. Or when writing to a target data store, the data is read concurrently from multiple partitions of the underlying Kafka topic to increase throughput. But where do the tasks actually run? Kafka Connect runs under the Java virtual machine, or JVM, as a process known as a worker. Each worker can execute multiple connectors. When you look to see if Kafka Connect is running, or want to look at its log file, it's the worker process that you're looking at. Tasks are executed by Kafka Connect workers. A Kafka Connect worker can be run in one of two deployment methods, standalone or distributed. Now each has its pros and cons, which we'll now discuss. The way in which you configure and operate Kafka Connect in these two modes is different. Despite its name, this deployment mode is equally valid for a single worker deployed in a sandbox or development environment. In the distributed mode, Kafka Connect uses Kafka topics to store state pertaining to connector configuration, connector status, and more. The topics are configured to retain this information indefinitely, known as compacted topics. Connector instances are created and managed via the REST API that Kafka Connect offers. The distributed mode is the recommended best practice for most use cases. Since all offsets, configs, and status information for the distributed mode cluster is maintained in Kafka topics, this means that you can add additional workers pretty easily as they can read everything that they need from Kafka. When you add workers from a Kafka Connect cluster, the tasks are rebalanced across the available workers to distribute the workload. And if you decide to scale down your cluster, or even if something outside your control happens and a worker crashes, Kafka Connect will rebalance again, to ensure that all the connector tasks are still being executed. The minimum number of workers recommended is two, so that you have fault tolerance, but of course you can add additional workers to the cluster, as your throughput needs increase. You can opt to have fewer bigger clusters of workers, or you may choose to deploy a greater number of smaller clusters, in order to physically isolate your workloads. Now, both of these are valid approaches, and are usually dictated by organizational structure, and responsibility for the respective pipelines implemented in Kafka Connect. On the other hand, in standalone mode, the Kafka Connect worker uses files to store its state. The connectors are created from local configuration files, not using the REST API. Consequently, you cannot cluster workers together, meaning that you cannot scale for throughput, or have fault-tolerant behavior. Because there's no clustering, you can know for certain on which machine the connector's task will be executing. It's just gonna be on the machine that you've deployed the standalone worker on. This means that standalone mode is appropriate if you have a connector that needs to execute with server locality, for example, reading from files on a particular machine, or ingesting data sent to a network port at a fixed address. You can satisfy this same requirement using Kafka Connect in the distributed mode, with a single worker connect cluster. This provides the benefit of having offsets, configurations, and status information stored in a Kafka topic. And with that, now you know how to deploy self-managed Kafka Connect.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.