Sr. Director, Developer Advocacy (Presenter)
Principal Developer Advocate (Author)
We saw previously the outline of the data sources that we're going to use in our pipeline. We set up a data generator to create the ratings
events. Now let's see about accessing our database to get the reference information about the customers who are leaving these ratings.
We'll use something called change data capture (often called CDC), plus snapshots. This approach enables us to capture everything already in the database, along with new changes made to the data. There are two flavors of CDC:
Query-based CDC
Log-based CDC
Which one you should use depends on several factors. First off, we need to understand a little bit about how each style of CDC works.
Query-based CDC uses a database query to pull new data from the database. The query will include a predicate to identify what has changed. This will be based on a timestamp field or an incrementing identifier column (or both).
Query-based CDC is provided by the JDBC connector for Kafka Connect, available as a fully managed service in Confluent, or as a self-managed connector.
Log-based CDC uses the database's transaction log to extract details of every change made. The particular transaction log implementation and specifics will vary by database, but all are based on the same principles. Every change made in the database is written to its transaction log (known in various different implementations as the redo log, binlog, etc.).
The changes written to the transaction log include inserts, updates, and even deletes. So when two rows are written to the database, two entries are added to the transaction log. Those two entries from the transaction log are decoded and the actual data from the database row is written to two new events in Apache Kafka.
One of the several benefits of log-based CDC is that it can capture not just what the table rows look like now, but also what they looked like before they were changed.
Popular implementations of log-based CDC are the connectors from the Debezium project, which are available fully managed on Confluent and support several databases, including PostgreSQL, MySQL, and SQL Server, as well as the Confluent Oracle CDC Source Connector.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.