January 11, 2021 | Episode 138

Change Data Capture and Kafka Connect on Microsoft Azure ft. Abhishek Gupta

  • Transcript
  • Notes

Tim Berglund:

I got to sit down with Microsoft Azure Cloud advocate, Abhishek Gupta, to talk about change data capture, Kafka Connect and all things Azure. It's all on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.

Tim Berglund:

Hello and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and I'm joined here today in the virtual studio, each of us in separate rooms in our homes, miles and miles apart, by Abhishek Gupta. Abhishek is a cloud advocate at Microsoft who focuses his time on Azure, Kafka, databases, Kubernetes, open-source technologies, basically everything a decent person would want to do, I think. That's just a wholesome list. I like it. Abhishek, welcome to the show.

Abhishek Gupta:

Hi, Tim, thanks for having me. Really excited to be on the show.

Tim Berglund:

Awesome. Awesome. It's great to have you. Hey, so I mean, if you listen to the show you know I ask this question a lot just because I want to know, but how did you get to be a cloud advocate at Microsoft? And tell us as much or as little of that story as you want.

Abhishek Gupta:

Absolutely. So actually I tend to get this question quite often. And in fact many developer advocates who tend to have a background in software development also get similar questions, so thanks for bringing this up. I am more than happy to share insights. So personally I've had the good fortune of experiencing a wide variety of roles ranging from PO development engineering work to consulting product management, and then now it's developer advocacy. But the key thing is that the common theme across all these roles has been the fact that most of my time has been spent on working on developer aspects, developer concerns. So, for example, if I were to think back to my last role as a product manager, so I mostly worked on these developer focused PaaS or platform-as-a-service product [inaudible 00:02:13] serverless application development.

Abhishek Gupta:

But the important thing was that in addition to being a product manager and executing all those bread and butter responsibilities, I was also very closely working on areas, like I said, the developer aspects, right. The developer experience, building end-to-end solutions for customers, developers, technical content, and really advocating to developers for developers, right. And I had the good fortune, like I said, of being a product manager and doing that firsthand. So, yeah, sort of the undercurrent of my roles in my career so far has been sort of working closely with developers and developer products. And when I had this opportunity to convert this into a full-time role with Microsoft, I grabbed it with both hands and fortunately here I am.

Tim Berglund:

Awesome. I'm glad you are. That sounds like a journey that a lot of us who are in developer advocacy have taken.

Abhishek Gupta:

Yeah.

Tim Berglund:

You start as a developer and then you kind of transition into this, "How can I help my fellow developers?" And then you realize, "Oh, wait, I love doing that. I just want to do that all the time." And you get to do it. It's pretty great.

Abhishek Gupta:

Exactly. Yeah.

Tim Berglund:

Now in that introduction I gave, we talked about Kafka and databases. You work with both Kafka and databases, so that implies a few things. There's kind of some usual suspects there, and CDC is one of them. And we've done some episodes on CDC in the past but it's been a while, so if you're a new listener, this could be a new topic to you. But I want to ask you, first of all, what does CDC stand for, and just kind of walk us through the basics of CDC and Kafka. I mean, that's a huge question, so I think that's the only question I'm going to ask you today. You can probably just take the rest of the time to talk about that. I'm going to put myself on mute and... No, tell us what CDC is.

Abhishek Gupta:

Absolutely. So CDC is a short form for change data capture, right. I'll just take a step back and sort of break this down. If you really think about it, right, what's the typical way you interact with databases, right? So it's primarily through query language, right, using either that query language directly or via client drivers. So that relational databases, it's SQL, it's Cassandra, it's Cassandra Query Language, it's MongoDB. It's a different way of interacting with it and so on and so forth, right.

Abhishek Gupta:

Now, it's relatively well understood but not always, is the fact that almost all databases have this notion of logs or write-ahead logs, right, and essentially what these are used for is to capture changes to the data within your database. Now, it could be data within your tables or schemas, right, and these changes are captured within these transaction logs or write-ahead logs, even before they're actually committed to the database, right, committed to the desk.

Abhishek Gupta:

Now, different databases, again, have different terminologies, right, so these logs are often called, for example, oplog in MongoDB or transaction logs or commit logs. I think it's transaction logs in Postgres and commit logs in Cassandra, or for MySQL developers on DBs it's bin log, right. Yeah, so different databases have different terminologies. But essentially if you were to now think about change data capture, now it kind of turns things around, right, when you compare it to the traditional way of querying the database for accessing the data, right. So think of it like this way. You don't call the database, it calls you. That's sort of a mental model.

Tim Berglund:

It's like an audition, right.

Abhishek Gupta:

Yeah.

Tim Berglund:

You don't call it, it calls you. Because normally, just before you go there, backing up a step, that write-ahead log, which goes by all those wonderful names just to keep, you know if you're new to this, just to keep you on your toes. It's like in Lord of the Rings how Aragorn had seven different names and if you're trying to keep track of like, "Is that Elessar? Is that him? I don't know. What?" So it's the same thing where there's all these names for the one thing. But the basic idea there is that when you do something to the database, you issue a query with its query language, like you said, that's the thing that you do, you mutate its state. That mutation gets written to the log and then there's a process in the database that sort of deals with that. It doesn't stay in the log. It gets to go into a useful form in the database. But it gets put in the log and then the database takes it from there and updates a collection or a table or a row or whatever.

Abhishek Gupta:

Yeah.

Tim Berglund:

That's that fundamental component of every database. And, as I've argued elsewhere, of every modern software architecture. There's always a log somewhere. But I interrupted you. You've got that log and now you've got this situation in CDC where, "Don't call us, we'll call you."

Abhishek Gupta:

Yeah.

Tim Berglund:

The database calls you. So keep going with that.

Abhishek Gupta:

Exactly. So, like I said, at the end of the day if I were to just try and summarize this, right, so essentially change data capture is a way to detect all these data changes, updates, deletes, which are happening within a database, and then extracting them in a way or in a form wherein they can be actually useful to you. When I say useful, and I'll just talk about this a little more, right. And let me give you an example here, just to understand this a little better.

Abhishek Gupta:

Now, this is going to be Postgres specific, but you can extend this concept to other databases. So, for example, Postgres has this notion of logical decoding, right. Now, this is a specific feature. I think it was in 9.5 and above. Sorry if I get the version wrong. But it has this feature called logical decoding, which is actually meant to extract these transaction logs and process them using an extensible layer called output plugins. So, for example, you can use a plugin called wal2json to be able to consume these change log events within the Postgres transaction logs as json, or it is, yeah. And another output login, and this is developed by the Debezium community, it's called decoderbuff. It is just a way of consuming these as protocol buffers, for example, right.

Abhishek Gupta:

And not to forget that, by default, Postgres has again from version, I think it's, 10 onwards. I keep forgetting it. So by default it has an output plugin which is shipped with Postgres, and it's called pgoutput. So just to give you an example of how Postgres handles this and how important it is at the end of the day to be able to have these change log in an intelligible format, in a format where consumers, where applications can actually use it, right. And any discussion of change in a capture is incomplete without, not exactly quoting, but at least remembering Martin Kleppmann. It's like turning the database inside out, so to speak, right. So that's where the sense of change to the capture is.

Abhishek Gupta:

And if you were to think about how these are typically implemented, right, so you could think about database triggers, for example, to implement a CDC-like solution, which is sort of similar but not exactly the same. So database triggers could do something like that. But typically these write-ahead logs or commit logs are what are used to sort of implement and realize these change data capture scenarios end-to-end, from an end-to-end perspective.

Tim Berglund:

Right. So talk to us about tooling. There're ways to get this done. There are commercial ways to get this done or open-source ways to get this done, and the databases, it sounds like, are making it easier and easier, like I actually didn't know that about Postgres and those recent versions where it'll volunteer to publish its write-ahead log as [dangjason 00:11:07] messages to [inaudible 00:11:09]. What else can it give you? Right? This is as easy as it could be, at least architecturally. But turning that into a Kafka topic, I'm still going to have to handle some plumbing and do some things in their edge cases, so what's that tooling like?

Abhishek Gupta:

Yeah, exactly. So there are a lot of options out there, right. If I were to step back a second and think about where I just left off. So these databases are there. They're making it easier for us to tap into those change events, right, which are happening in real time within the database. Now, what actually makes this useful or powerful is it really depends upon how we are able to leverage it, right. So internally these change data capture mechanisms actually form a foundation for functionalities like recovery and replication, so this is used in MongoDB. I think in Postgres as well. But they sort of form the bedrock, right.

Tim Berglund:

Internally. You mean internally inside a given database, like Mongo.

Abhishek Gupta:

Exactly.

Tim Berglund:

It sort of goes to an internal CDC where it can do the replication to another [crosstalk 00:12:21]

Abhishek Gupta:

Exactly. Exactly.

Tim Berglund:

Yeah, even from leader to follower, I guess. That's what that would be. You're shipping a log from the place where the mutations land to a place where you would like those mutations to be readable later on.

Abhishek Gupta:

Precisely, yes. So if you're talking about MongoDB, so the primary node is capturing all these change events and it's oplog and the secondary nodes or the followers are simply reading off of that log and sort of replicating that data within themselves. Yes, exactly how you put it. And like I said, this is for the functionality within the database itself. Now, if you think about application level concerns, right, typical use cases of this are to be able to create materialized views or transfer data into external systems, refresh caches, build data pipelines, so on and so forth, right.

Abhishek Gupta:

So going back to the point of it is only powerful if you know how to leverage it, right, and if we were to have the capability, and we have the capability of course, to stream these change events into Kafka, then literally the possibilities are endless, because now once we have all these change events within Kafka topics, we can reap all the benefits of Kafka, right, as a scalable platform and its entire ecosystem, including Kafka Streams, Kafka Connect, ksqlDB, right.

Abhishek Gupta:

So essentially, going back to the point which you mentioned, right, at the core of many software pieces is the log, right. So of course I want to mention Jay Kreps and I Heart Logs and how it forms the fundamental... Kafka itself is fundamentally a write-ahead log itself, right, conceptually speaking. So yeah, it's essentially, CDC and change data capture and Kafka, they're like closely related conceptually, and if you combine them it ends up being a powerful solution.

Abhishek Gupta:

Now, to your point of solutions in general, I tend to work a lot with a project called Debezium, which is an open-source platform, which sort of helps abstract all these complexity of change data capture and how each individual database actually handles it. So Debezium sort of helps abstract all of these. Now, at its core, Debezium, you can simply think of it as a bunch of Kafka Connect connectors, right, and each of these connectors is dedicated to handling a single database. So Debezium has support for PostgreSQL, MySQL, SQL Server, MongoDB, and I think Cassandra as well, which I think is a work in progress. It's in incubating state. And lots of other databases as well.

Abhishek Gupta:

So, yeah, Debezium is this key technology which I have had exposure with mostly and I've worked with. There are other solutions, of course, of change data capture. If you think from an Oracle ecosystem perspective, there's GoldenGate and there are other solutions as well. So change data capture, yes, it's a rich ecosystem and Debezium happens to be in the center of it if you think about open-source, credible open-source community-driven solutions.

Tim Berglund:

And most definitely, just from a community conversation standpoint, when people talk about CDC it's like Debezium is in the same breath. And as you say there are some solutions which are not open-source, like GoldenGate, which is, I don't know what the opposite of open-source is, but I know that's it. And that's a thing in the Oracle world where that's what you're going to do, but outside the Oracle world, usually the conversation starts with Debezium. And there are other technologies. I mean, there are people who are Confluent partners who offer commercial solutions with their own differentiation, so I should just give that notionally some love. But usually we're talking about Debezium.

Abhishek Gupta:

Yeah.

Tim Berglund:

And I like your definition of Debezium as a set of Kafka connectors because I remember when I first encountered it. It wasn't clear to me. I'm like, "What is the thing? Is there like a server process, or what?" I always need to know physically what the thing is. Is it an API? Is it a program I run? Is it a set of concepts? But it's really a set of connectors, and they're connectors that know specifically how to have the database call them in that, "Don't call us, we'll call you" sort of paradigm you laid out.

Tim Berglund:

Because they all have, and I just love when you went through the different names that different databases call their write-ahead log or their commit log, also they all have different APIs for that, and they are at times radically different from one another in the way they work. So Debezium is a nice project that takes a collection of the usual database suspects and implements that API for you, like a connector does, right. The work of a connector is talk to the API that's not Kafka, translate stuff into messages and produce, or consume, translate stuff into the outside format, and then talk to the external API, and that's just what it does.

Abhishek Gupta:

Exactly. Yeah. And I just want to mention one more point here, and this is something which I was sort of confused with initially when I started and when I looked at Debezium in the sense that how it differs from the JDBC world, or strictly speaking, the JDBC Kafka Connect connector, right. So that's actually a question which folks have, I used to have, and people have asked me as well, right. And of course it's relatively simple to understand now, but to put it in context, you know, just to be clear.

Abhishek Gupta:

So JDBC connectors has this query-based approach, right, so it calls the database. It polls the database at a particular frequency, as opposed to Debezium which is purely based on change data capture, which again, like I said, the database sort of, it makes the database call your application and well, in this case it's Debezium, and then it ships these events to Kafka. So that's I think an important distinction to have, especially for beginners if you're just starting out on trying to understand and make sense of these things and this ecosystem in general.

Tim Berglund:

Yeah. And I guess the advantage of the JDBC connector, which is another, if uninitiated, that's another Kafka Connect connector that lets you get stuff out of databases and put stuff into databases from Kafka, so there's a source and a sink version, is that it uses JDBC to be the normalizing layer since every database has got a different wire protocol and blah, blah, blah. JDBC does that and has been doing that for decades. And so you can have one connector with one implementation and let JDBC sort everything out.

Tim Berglund:

Now, there are some trade-offs for that. So maybe, Abhishek, tell us why it's easy and it's kind of ubiquitous and everybody knows JDBC and how to make that connection work. Debezium has maybe some extra stuff to set up. But what am I losing by using the JDBC connector that I gain through CDC proper?

Abhishek Gupta:

Yeah, good question. So like you said, one of the pros is, at least compared to CDC, JDBC is easier to set up, configure and probably administer in the long term. But what you sort of lose out if you were to use it as... You know, one of the things is that you most likely have to fiddle with your data model, right, so add something like a timestamp column so that the connector can compare it and figure out where it is, which records have been processed and which ones it should process in the next iteration and so on and so forth, right. So it needs that column. Mostly it's timestamp, so you have to fiddle with your data model. And all this constant querying, right, it does create a load on your database, and of course the DBAs are not going to be happy if that happens.

Abhishek Gupta:

But on the other hand, if you were to try to please them, by say for example reducing your polling frequency and thereby reducing the load on your database, but you do have this risk of missing updates, right. So if you poll say every 10 seconds, there are chances that you will miss some. There are chances there, right. And typically JDBC, I have seen instances where handling deletion is sort of tricky. So these are some of the cons which I have seen in the wild and experienced it myself as well. I'm sure there are a few others, pros. But, yeah, there is a balance and you have to think about when to use the CDC approach versus JDBC.

Tim Berglund:

Right. And sometimes missing those updates matters, sometimes it doesn't, right. You just have to be aware of the way it behaves. And it might be completely fatal if you were to miss those updates. It might say, "Well, no, that's okay if things change and I just get an update every five seconds. That's fine." And plus I just need to say, you said that the DBAs won't be happy, and I thought, "Well, but Abhishek," but then you repeat yourself. Are the DBAs ever really going to be happy with you, a developer? They're probably not. And honestly I think if we were to look at the situation openly and non-defensively, we'd have to acknowledge that we've given them reasons not to be happy with us. They have jobs to do, and I don't think we've really been good partners with them, so anyway, which is not going to stop me from making DBA jokes. I shall go on making DBA jokes and scholar jokes until I'm pulled off the air. So anyway. There's also deletes, right. I don't know if you said deletes. You can't see deletes with a JDBC connector-

Abhishek Gupta:

Yeah.

Tim Berglund:

... unless you separately maintain the entire state of the database, like that's the logical requirement. And if you could do that, well, you wouldn't be using Connect, would you?

Abhishek Gupta:

Yeah.

Tim Berglund:

So that's another key limitation.

Abhishek Gupta:

It's a tricky one, like I said. It is possible but then you'll have to work your way around it. So it might defeat the purpose at the end of the day.

Tim Berglund:

Exactly, it would defeat it soundly, I think. So take us into the cloud. I mean, that's Debezium. How about connectors that are related to Azure. You work with Azure, so I usually say this. I don't want you to be afraid of being a shill or something like that. It's a cloud service and it's a cloud service you know a lot about and that you represent. So, I mean, tell us about what we can do between Kafka and the other Azure services. That's important stuff to know.

Abhishek Gupta:

Yeah, absolutely. So connectors are the work horse of Kafka Connect, right, these source and sink connectors. There are a lot of connectors obviously out there. Kafka and Kafka Connect, the ecosystem is so rich. But I definitely want to highlight a few ones for Azure, and specifically the ones which are available on both Confluent Hub and Confluent Cloud, right, because Confluent Hub is sort of, from my point of view at least, the defacto place of these collection of trustworthy and solid and well-implemented connectors.

Tim Berglund:

Absolutely. Please go on. Keep saying this. This is good.

Abhishek Gupta:

No, and if you think about the Azure ecosystem for connectors, I was actually pleasantly surprised. I recently took a look at it and Confluent, I think it's Confluent Hub, it also includes a connector for Azure Synapse Analytics, which is, I wouldn't say it's very new, but it's like a relatively newish offering. So there is a sink connector for Synapse Analytics, like I mentioned. The other data ecosystem, like the ADLS, the Azure Data Lake, is also supported as a sink connector. There is a connector for Blob storage on the Service Bus and Azure Event Hubs as well. Both of these are source connectors. And I think I remember seeing an Azure Functions connector as well, so essentially, and I think it is in preview right now. So essentially you could invoke your serverless functions, Azure functions, off of your data in topics and Confluent Cloud.

Tim Berglund:

As a sink? As a sink connector?

Abhishek Gupta:

That's right.

Tim Berglund:

You see, this is why I am a podcast host, so I can learn how the product works. That's cool. I did not know that was available, is what I'm saying. But that's very cool because there is this extremely obvious, the word that's coming to mind is, synergy, and I'm trying not to say synergy. I must have just been in a planning meeting before we recorded this, extremely obvious happy confluence of two good things between an event streaming system. You know, you've got stuff in a topic and serverless functions, right. Like yes, you're going to want to execute serverless functions on messages in the topic. That's just going to happen. And it's chocolate and peanut butter. It's peaches and cream. It's better when they're together.

Abhishek Gupta:

Exactly.

Tim Berglund:

So that's nice to know that that's there. And is that in Connect Hub, you said?

Abhishek Gupta:

I think this is, if my memory serves me right, this is in preview and it is both in Hub-

Tim Berglund:

You preview in cloud? Hey.

Abhishek Gupta:

Yeah. Yeah, it is, both on Hub and Confluent Cloud. But yeah, it is in either or both of these places. Yeah. Either way it's a good thing. [crosstalk 00:26:55]

Abhishek Gupta:

... are going the right way from preview to [crosstalk 00:26:57]

Tim Berglund:

Cloud project managers, Confluent, I apologize. I just should have known that. It's because you're shipping so much dang stuff. But no, that again is very cool, because that's like an extremely obvious thing.

Abhishek Gupta:

Yeah.

Tim Berglund:

You're going to want to use your cloud provider's serverless functions mechanism to process data.

Abhishek Gupta:

Absolutely. And if you think about these other connectors also, which I mentioned, right. For example, Event Hubs, so it sort of gives you a lot of permutations and combinations. You might think, hey, you know what, there is some overlap between say for example Confluent Cloud, again, managed Kafka Event Hubs, Azure Event Hubs, which also offers a Kafka end point. But if you think about use cases and scenarios, there are overlaps, but then there are also advantages to be reaped by having these options available to you at the end of the day.

Tim Berglund:

Yeah. Tell us about Azure Data Explorer. What's that all about?

Abhishek Gupta:

Yeah, so of late I have been doing a fair amount of work with Data Explorer, and, for those who might not know, Data Explorer is actually this fully managed big data analytic service. It's kind of similar to or, yeah, and I'll go into that. So, yeah, it's a PaaS at the end of the day, right. It's a fully managed analytic service, and some of the key aspects of Data Explorer are it allows you ingest all kinds of data, unstructured, structured, avro, JSON, CSV, text, Parquet, so on and so forth. And it has the ability to do so from a variety of input sources, and of course Kafka is obviously one of them.

Abhishek Gupta:

And when you have this data within Data Explorer, you can run your Data Analytics, your Data Exploration jobs, using something called KQL. So I should mention that Azure Data Explorer is also sometimes referred to as Kusto, K-U-S-T-O, and the query language, it's called KQL. It's Kusto Query Language. And I'll be honest with you, when I first experienced or started working with Data Explorer and KQL, to be specific, I was blown away. And I'm not just saying that because I work for Microsoft and Azure.

Tim Berglund:

Right.

Abhishek Gupta:

But literally, I was blown away. When I was working with this KQL in the inline editor on the Azure portal, first of all, it's very intuitive. And second, the autocomplete features, it's so simple. Imagine you've just started working, or you're new to some query language, and imagine how difficult it is, right. But the fact that it is so intuitive and all this autocompletion and all these fancy features, which are available to you right there, make it a joy to work with, like really trust me. And so you can obviously use the query language. You can apply some machine learning. I'm not an expert there but you can do that and sort of analyzing data, visualize them in dashboards so on and so forth, right.

Abhishek Gupta:

Yeah, and actually customers are using this huge, huge scales. We're shipping terabytes of data to Data Explorer per day. And in simple terms, if you look at it hard enough, so Data Explorer is actually like an append-only immutable database. Using KQL, for example, you cannot mutate the data. The only way you can get data into Data Explorer is through the injection capabilities, and the injection part itself is actually quite extensive. Like I said, it supports a variety of sources, including Kafka. But then of course it has very rich integration within Azure itself, as is expected, right. So from an injection standpoint you can pull in data from native Azure services like Event Hubs, Event Grid, IoT Hub, right, and it has SDKs to let you do the same things in Java, Pipeline, Go, REST APIs so on and so forth.

Abhishek Gupta:

And not to forget these connectors which it has, so it has one for Spark. I think one for Logstash as well. And of course, not to forget the Kafka Connect connector. So it has a Kusto sink connector, so it essentially picks up data from Kafka topics and sort of pushes them into Azure Data Explorer, so yeah, it's a pretty solid ecosystem if you look at it from an end-to-end, right, from ingesting data from so many sources and being able to have these huge amounts of data and having the ability to be able to run your exploration and analytics using, again, KQL or other mechanisms. So yeah, end-to-end pretty [crosstalk 00:32:10]

Tim Berglund:

There's just no substitute for being able to look at stuff, right, particularly when you've got multiple places, multiple potentially systems of record for different kinds of data in the system, or just multiple places the data might be at any given time in the execution of a process or a saga or however you look at things. There're just places data can be and you need to see it. That does sound really, really nice to have one UI, one query language, KQL, not to be confused with KSQL. KQL, to be able to project into the view that you want and group and aggregate and just do the read SQL-ly kinds of things.

Abhishek Gupta:

Yeah.

Tim Berglund:

That sounds pretty cool. And there's a sink connector to get from Kafka into Kusto or Data Explorer.

Abhishek Gupta:

That's right. And I shouldn't forget, and that's because you mentioned it, right, Azure Data Explorer actually forms the foundation of many other Azure services, right, which push their telemetry log data. It's used externally of course by customers, but it is used by many, many products within Azure as well. So just thought I should mention that. And thanks for reminding me that indirectly. It sort of popped into my head.

Tim Berglund:

Good. No, that sounds fantastic. I want to play with that at some point.

Abhishek Gupta:

Yeah.

Tim Berglund:

Tell me about some things that you see people doing. So we've talked about CDC and Connect and visualization. What do you see, and you're a developer advocate, you're not a sales engineer, but in our line of work we still see customers, particularly if you primarily work with a cloud service. Everybody is a customer. So what do you see people doing? What are some use cases?

Abhishek Gupta:

A couple of them really stand out to me. Of course there are a lot of use cases and a lot of ways to slice and dice them. But at least, like let's focus on the actual Data Explorer for a bit. Typically the analytics and the on-prem to cloud integration are the two big buckets of use cases which I see. So, like I mentioned, customers are using Explorer to pump in huge amounts of data, like I said, in the order of terabytes per day. And these are infrastructure logs, application logs, device telemetry data, IoT data, you name it, right. And the Kusto sink connector, which I just mentioned, sort of forms the foundation for that.

Abhishek Gupta:

Yeah, so one bucket of use cases is definitely these analytic type of scenarios. And the other one, like I mentioned, is sort of this bridge to cloud, right, connecting the on-prem environment to the cloud. And one of the key ones which comes to mind is connecting your on-premises Kafka clusters to the cloud. And when I say cloud I mean specifically the Azure Event Hubs, because Event Hubs sort of acts as this window to Azure for a lot of use cases, right. So it's this large scale big data injection platform. And how the connection between on-prem Kafka clusters and Event Hubs works is through MirrorMaker which sits in the middle.

Abhishek Gupta:

And I alluded to this earlier as well. Once the data is within Event Hubs, within Azure, it opens up a lot of possibilities, a lot of opportunities, right, because Event Hubs integrates with a lot of services in the Azure ecosystem, including Azure Data Explorer, right. So one way to think about this is that so you have your on-prem data. It could be your on-prem systems, you want to monitor them, or it could be, again, like I said, application logs, so on and so forth. You use MirrorMaker, pump the data into Event Hubs, and from Event Hubs you would use that injection capability within Azure Data Explorer or native integration within services like Azure Stream Analytics or Blob storage and so on and so forth.

Abhishek Gupta:

Like I said, the possibilities are endless. And then you could do pretty much anything you would want to. So this is also a pretty common use case which I have seen being implemented at relatively large scales. And I should not betray my original bread and butter background, and I admit that I have seen, maybe in my experience have seen, relatively lesser use cases, but they are, I'm pretty sure, pretty common. As the typical bread and butter active microservices oriented use cases. And change data capture also happens to be a very integral part of it, so if you were to build architectures based on event sourcing and CQRS, Command Query Responsibility Segregation and using these concepts, right, Domain-Driven Design. I have not directly, like I said, been a part of these, but I have had some experience and looked at use cases where customers are trying to build on these, using these concepts, and connecting these to change data capture, and of course Kafka is in the center of all these.

Tim Berglund:

Yes. You mentioned Azure Event Hubs. How should we think about that vis-a-vis Kafka? How is it like Kafka? How is it different? Why are both of them a part of your world?

Abhishek Gupta:

Yeah, so interesting question, and I tend to get that a lot. So Event Hubs, like I said, there's this vendor to Azure so to speak, right, a big data injection service. But it has two protocols, and one of them is it's native, sort of internally. It's based on AMQP, or like AMQP protocol, which obviously abstracts out through SDKs and client libraries. And the other one is the Kafka protocol, right. So when I say Kafka protocol, essentially there are no clusters which you're looking at. You're not setting up anything. You can simply plant your existing Kafka clients and other Kafka ecosystem technologies, for example, Kafka Connect or MirrorMaker for that matter, right, point them to Event Hubs, make sure you have the Kafka protocol enabled and boom. That's pretty much it, right. So there's this dual protocol support in Event Hubs and that naturally makes it a huge part of what I do in my day-to-day work as an advocate.

Abhishek Gupta:

But when I mention the Kafka protocol, I do have to be honest about it, that it is still a work in progress, right. There are things which are supported right now and there are things which are work in progress in the roadmap so on and so forth, right. For example, if you were to try and use Kafka Streams, you were to say, "Hey, I have Kafka protocol enabled with Azure Event Hubs. Let me try and point my Kafka Streams application." Right now it won't work because that's not supported. At the end of the day it's a protocol layer, right. But still customers are using it, your Kafka consumers, Kafka producers, Connect, MirrorMaker, all the traditional Kafka clients are going to work, but yes there are some things which are still work in progress. Just for the sake of public transparency there.

Tim Berglund:

Sure. Sure, that makes sense. What are you excited about going forward? What's in your immediate future that you're able to talk about?

Abhishek Gupta:

So just to clarify, by the immediate future you meant...? I'm just trying to understand your question better. Sorry.

Tim Berglund:

Oh, right, right. I don't mean today, but I mean like coming up in the next few months. What are some things, within your technology horizon, that you might get to play with that are exciting to you?

Abhishek Gupta:

I have been working with the... It's weird, right. People talk a lot about Kafka on communities, and I have also worked on it a bit, but Kafka Connect on communities does not get that much of love, or at least in my experience. So that is something which I am sort of currently working on for the past some time, and even in the near future I'm going to dig into this more. This area, I think, is pretty rich. And, like I said, Kafka Connect on communities probably doesn't get that much attention or love, maybe because people think it's easy. Kafka Connect is [inaudible 00:41:02] process. It's essentially stateless. But there are nuances to it and this area is something which I'm really interested and excited about in general. And, like I said, I've been working on it and I'll continue to sort of dig into this more, both in terms of exploring the integration with Azure as well as some of the open-source components.

Tim Berglund:

My guest today has been Abhishek Gupta. Abhishek, thanks for being a part of Streaming Audio.

Abhishek Gupta:

Thank you so much, Tim. Thanks for having me. It was a pleasure chatting with you today.

Tim Berglund:

Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST, that's 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and to use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit, and there are a limited number of codes available, so don't miss out. Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community, Slack. There's a Slack signup link in the show notes if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support and we'll see you next time.

What’s it like being a Microsoft Azure Cloud advocate working with Apache Kafka® and change data capture (CDC) solutions? Abhishek Gupta would know! At Microsoft, Abhishek focuses his time on Kafka, databases, Kubernetes, and open source projects. His experience in a wide variety of roles ranging from engineering, consulting, and product management for developer-focused products has positioned him well for developer advocacy, where he is now.

Switching gears, Abhishek proceeds to break down the concept of CDC starting off with some of the core concepts such as "commit logs." Abhishek then explains how CDC can turn data around when you compare it to the traditional way of querying the database to access data—you don't call the database; it calls you. He then goes on to discuss Debezium, which is an open source change data capture solution for Kafka. He also covers Azure connectors, Azure Data Explorer, and use cases powered by the Azure Data Explorer Sink Connector for Kafka.

Continue Listening

Episode 139January 20, 2021 | 34 min

Scaling Developer Productivity with Apache Kafka ft. Mohinish Shaikh

Confluent Platform and Confluent Cloud run efficiently largely because of the dedication of the Developer Productivity team. Mohinish Shaikh (Developer, Confluent) talks about how his team builds the product pipelines for the entire event streaming platform and ensures seamless delivery of several engineering processes across engineering and the rest of the org.

Episode 140January 25, 2021 | 44 min

Distributed Systems Engineering with Apache Kafka ft. Guozhang Wang

Tim Berglund picks the brain of a distributed systems engineer, Guozhang Wang, tech lead in the Streaming department of Confluent. Guozhang explains what compelled him to join the Stream Processing team at Confluent coming from the Apache Kafka® core infrastructure.

Episode 141February 1, 2021 | 50 min

Examining Apache Kafka Performance Metrics ft. Alok Nikhil

Coming up with an honest test built on open source tools in an easily documented, replicable environment for a distributed system like Apache Kafka is not simple. Alok Nikhil (Cloud Native Engineer, Confluent) shares about getting Kafka in the cloud and how best to leverage Confluent Cloud for high performance and scalability.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free