Having learned a bit about the components within Kafka Connect, let’s now turn to how we can actually run a connector. When we create a connector, we specify its logical configuration. It’s physically executed by a thread known as a task. Here are two logical connectors, each with one task:
The execution of a connector’s ingest or egress of data can also be parallelized (if the connector supports it). In that case, additional tasks are spawned. This could mean that when ingesting from a database, multiple tables are read at once, or when writing to a target data store, data is read concurrently from multiple partitions to increase throughput.
But where do the tasks actually run? Kafka Connect runs under the Java virtual machine (JVM) as a process known as a worker. Each worker can execute multiple connectors.
A Kafka Connect worker can be run in one of two deployment methods: standalone or distributed. Each has its pros and cons, which we will now discuss. The way in which you configure and operate Kafka Connect in these two modes is different.
Despite its name, this deployment mode is equally valid for a single worker deployed in a sandbox or development environment. In the distributed mode, Kafka Connect uses Kafka topics to store state—configuration, connector status, etc. The topics are configured to retain this information indefinitely, known as compacted topics. Connectors are created and managed via the REST API that Kafka Connect offers.
This means that you can then add additional workers easily, as they can read everything that they need from Kafka. When you add workers from a Kafka Connect cluster, the tasks are rebalanced across the available workers to distribute the workload. If you decide to scale down your cluster (or a worker crashes), Kafka Connect will rebalance again to ensure that all the connector tasks are still executed.
The minimum number of workers recommended is two so that you have fault tolerance. You can add additional workers to the cluster as your throughput needs increase. You can opt to have fewer, bigger clusters of workers, or you may choose to deploy a greater number of smaller clusters in order to physically isolate workloads. Both are valid approaches and usually dictated by organizational structure and responsibility for the respective pipelines implemented in Kafka Connect.
In standalone mode, the Kafka Connect worker uses file storage for its state. Connectors are created from local configuration files, not the REST API.
Consequently, you cannot cluster workers together, meaning that you cannot scale for throughput nor have fault-tolerant behavior.
Because there is no clustering, you can know for certain on which machine a connector’s task will be executing (i.e., the machine on which you’ve deployed the standalone worker). This means that standalone mode is appropriate if you have a connector that needs to execute with server locality, for example, reading from files on a particular machine or ingesting data sent to a network port at a fixed address.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.