Course: Building Data Pipelines with Apache Kafka® and Confluent

Ingest Data from Databases into Kafka with Change Data Capture (CDC)

8 min
Tim BerglundSr. Director, Developer Advocacy (Course Presenter)
Robin MoffattStaff Developer Advocate (Course Author)

Ingest Data from Databases into Kafka with Change Data Capture (CDC)

We saw previously the outline of the data sources that we're going to use in our pipeline. We set up a data generator to create the ratings events. Now let's see about accessing our database to get the reference information about the customers who are leaving these ratings.

kafka-connect-cdc-usecase

We'll use something called change data capture (often called CDC), plus snapshots. This approach enables us to capture everything already in the database, along with new changes made to the data. There are two flavors of CDC:

  • Query-based CDC
  • Log-based CDC

Which one you should use depends on several factors. First off, we need to understand a little bit about how each style of CDC works.

Query-Based CDC

Query-based CDC uses a database query to pull new data from the database. The query will include a predicate to identify what has changed. This will be based on a timestamp field or an incrementing identifier column (or both).

kafka-connect-querybased-cdc

Query-based CDC is provided by the JDBC connector for Kafka Connect, available as a fully managed service in Confluent, or as a self-managed connector.

Log-Based CDC

Log-based CDC uses the database's transaction log to extract details of every change made. The particular transaction log implementation and specifics will vary by database, but all are based on the same principles. Every change made in the database is written to its transaction log (known in various different implementations as the redo log, binlog, etc.).

kafka-connect-logbased-cdc

The changes written to the transaction log include inserts, updates, and even deletes. So when two rows are written to the database, two entries are added to the transaction log. Those two entries from the transaction log are decoded and the actual data from the database row is written to two new events in Apache Kafka.

kafka-connect-capture-database-deletes-logbased-cdc

One of the several benefits of log-based CDC is that it can capture not just what the table rows look like now, but also what they looked like before they were changed.

Popular implementations of log-based CDC are the connectors from the Debezium project, which are available fully managed on Confluent and support several databases, including PostgreSQL, MySQL, and SQL Server, as well as the Confluent Oracle CDC Source Connector.

Use the promo code PIPELINES101 to receive $101 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.