Most of the time when you want to do the integration with Kafka Connect, the connector you need already exists, but sometimes it doesn't. Today, I talked to Microsoft Ryan CrawCour, about building a connector for Cosmos DB as a multi-model distributed database available in Azure. We talk about what Cosmos DB is all about and some cool things that came up in building that connector all on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.
Hello, and welcome to another episode of Streaming Audio. I am your host once again, Tim Bergland. And I'm joined in the virtual studio today by Ryan CrawCour. Ryan is a customer engineer at Microsoft, and he's going to talk to us about Cosmos DB, and Azure Service, and Kafka Connect. Ryan, welcome to the show.
Thanks Tim. Good to be on the show. Long time listener, first time speaker.
Love it, love it. Well, Ryan, thanks for being a listener. Tell us about actually yourself before we get into Cosmos and all that.
Sure.
What does a customer engineer do at Azure anyway?
Great question. I've been an engineer at Microsoft for 10 plus years, and I've worked in behind the scenes in product engineering teams. And we have a section in Microsoft called Customer Engineering, where, as engineers we work closely with our partners and our customers to build solutions to their problems, right? So potentially our first party services don't have the exact capabilities that they need.
So we'll build a bridge in between that enables a much easier adoption of our platforms and services for them, or we'll work with them and co-engineer a solution with them that works best in our cloud-based one kind of experience that we've had with other large customers. That's generally what we do, we're an engineer that works for customers and with customers, I guess, is the most important one.
And is that the kind of work you do, is it usually infrastructure oriented, or do you do full on app development?
Well, so we've got different parts of our stuff. And I guess, with a lot more infrastructure being code these days with infrastructure as code. We do get involved a lot in setting up infrastructure, and how to kind of create a cloud solution that is secure, and scalable, and flexible enough. And how you reuse resources in your subscriptions and those kinds of things.
We do get a little bit involved with that, but we're typically much more involved in the specific solution. It's the data analytics side of things, or it's messaging solutions, or it's integration type scenarios, or digital features that we may need. It's a bit of both, but I would say it's more solution development of our customers.
Cool, cool. And there is always that to be done if you're a cloud provider, fundamentally you're an infrastructure provider. Your pieces build a thing out of them.
Oh, exactly. Here are some hardware running in our data center, do something with it, right? And guess my job is to help the customers do more with it.
Right. Everything you hear is a database, but it could be a high level of abstraction like that. Or we're Confluent, we have Confluent Cloud, here's Kafka in the sky. You still have things to build on top of that, that's still a-
Oh, you have all sorts of stuff to bowl on top of that. And you need to make sure that... And it could be just a simple thing of this particular customer likes to develop in Rust. Oh, look, there's no Rust Azure servers SDKs, right? All right, so let's sit down and work with the customer and build out an SDK in Rust potentially, right? Or whatever the case may be.
But yeah, there is always stuff to kind of work with because every single customer is unique, and every customer does something slightly differently. And sometimes that's their competitive edge, and their differentiator, and you can't expect every customer to snap to exactly the same solution. We take the approach of meet the customer where they're at, and work from there. And that's what my team does.
Love it. Now we're going to talk about Connect, and the process of integrating Kafka and Cosmos DB. But I was thinking I actually am perfectly willing to admit, I don't know, Azure Cosmos DB, all that well, and I bet I'm not entirely alone in the audience. So tell us about that thing.
Cosmos DB is an exciting product. And before I joined this team, I was actually part of the product engineering team that took Cosmos DB live as a product, right?
You know it fairly well.
I know it pretty well, but back in the day, we had a number of internal Microsoft products, or services, or areas that came to the data team and said, "Hey, we're looking for a database that can do the following things, right?" And we set on and we looked, and we were like, "Well-
There isn't one.
"There isn't one." And we said luckily we've got an R&D team, and we threw it over the wall to the R&D team, and said, "Can you help us?" And they came up with some really interesting things to solve some pretty awesome challenges. And that was the start of a, of a service, right? Cosmos DB, it's a NoSQL database, it's designed from the ground up to be cloud.
We didn't take some box product and kind of Cloudify it, it was built from the ground up to be multi-region distributed cloud database with super low latencies in terms of reads and writes. And I think we're still one of the any cloud service databases that provide an SLA on actual performance, right? So we'll give you an SLA that say, "It's percentile," I can't remember the numbers off hand, "But it's percentile of your reads will always be within five milliseconds, and X percent of your writes will always be within two milliseconds, or whatever those numbers are." And so we actually guarantee that at an SLA level, which is pretty unique I think.
It's always strange to hear about imposing extremely popular ubiquitous cloud data service. There's no SLA on that.
Exactly. I mean, not only is there an uptime SLA, I mean, uptime SLA is great, but there's actually a performance SLA, which we think is really huge-
It's typical.
It was ridiculously difficult. And I think that's where it comes back to the engineering at the start. It was really engineered from the start to be that, right? It was designed to be this incredibly cool, responsive, low latency kind of service. And it's super easy to... It runs in every single Azure center that we have, because it's one of ours ring zero services, and by ring zero service we mean, many other services rely on it, so it has to be there for other services to exist and operate.
It's in every single Azure region, super easy to switch on multi-region reads, super easy to switch on multi-region writes. You just go to the portal or your CLI command in dash or whatever you want and enable it through that. And within a couple of seconds, you have your database in multiple regions and you can even write to multiple different regions, and it'll sort out conflicts, and replications, and stuff for you. So it's really, really super cool database. And one of the other cool things that we did with it is we started out as a core JSON database, right? So JSON documents, same as I was-
I was going to ask, data model API.
Data model, so we started out as a JSON NoSQL database. And then we thought, well, what we're doing under the covers in terms of how we're scoring the data, and how we're modeling, and how we're replicating it, and how we keeping the high variable in the uptime, it kind of makes sense to have other database models on the same platform, right?
We've actually added on top of Cosmos DB, so it's the same core engine, but we can now run a Gremlin graph database. So you can talk to Cosmos as Gremlin, you can talk to Cosmos with a Mongo API, you can talk to Cosmos with a Cassandra API, you can talk to Cosmos with a key-value API. In terms of the database, there's a bunch of APIs that now sit on top of the database, including our core JSON's SQL API.
But there's a bunch of these other APIs that sit on top of the core engine. So you get the same multi-region read-write capabilities now across multiple different kinds of models. And you don't have to go figure out the infrastructure under the covers for you, it's done for you. All you do is click a button and use it.
Right, right. And use the API of your choosing. Pretty cool.
Yeah, it's pretty good.
But there is an underlying representation of the data that at some level you can peel some onion layers back and you can get to that. And there's-
You can peel couple of layers back, and I mean, the data is all stored in a common data model. And for the most part, if we talk the core of Cosmos DB, it's thought as JSON documents.
Gotcha, because that's how it started.
Yeah, that's how it started.
Good, good, because I'm sitting here thinking, wow, we have 20 minutes left to talk. How are we going to talk about the connector? Because that sounds hard with all those models, but you the connector, I'm assuming picks one model. There's like one API to get to "Real Cosmos data." And you go through-
Oh, yeah, I mean, and the nice thing with this, and the way the multi-models work is that, if you're using the Mongo API against Cosmos DB, right? We don't have to build a connected for it, because guess what? Mongo is built to Mongo Database Connector.
Fair enough, there is one.
If you're using the Mongo API, use the Mongo Connector, right? Use your Mongo SDKs, use your Mongo kind of drivers, use your Mongo tooling, and your Mongo ecosystem. That's why we built it, right? It bridges that kind of gap, and you don't have to then shift all your code away from your Mongo driver to a Cosmos DB specific driver. You can just use a different API. The same with Cassandra, I mean, if you're using the Cassandra API on top of Cosmos DB, and you want to pump data from Kafka into Cassandra, use the Cassandra Connector, right?
We built the SQL core connector for our core API, and the reason we call it SQL, it's not because it represents a relational database system. It's not SQL server, right? It's just the representation of the query language that you use to talk to your JSON documents, right? Because all SQL really is as a structured query language.
So select from some container where some condition is true, order by something, right? And we take that as typical SQL syntax, and we will then use it to kind of go and query out our back end store. Other JSON databases have a very domain specific language that you use when you kind of talk to them. We thought people are fairly comfortable with SQL. So why not give them a SQL way to talk to their JSON database.
Which is not how most of the JSON databases do it, but sounds downright pleasant.
Yeah, if you're familiar with SQL, it is certainly very pleasant. If you're not familiar with SQL, well, it's not that hard to learn.
Exactly. And you're soon going to be, if you're not familiar with SQL, then sometimes welcome to the professional. We're very glad you're here. And you're about to learn.
That's it, I mean, if you're working with a database at some point in your life, you're probably going to come across SQL some way, right? In some flavor, whether it's Oracle, or MySQL, or Maria, or Postgres, or SQL server, you're going to come across SQLs some point in your life.
Indeed you are. Okay, that's cool. I actually did not know that about Cosmos DB and there is a lot of cool stuff there. Let's get onto the connector. So you built a connector as a part of your customer engineering day-to-day, this sounds like precisely the kind of thing that would come up if the connector doesn't exist, customer needs to get data into or out of Kafka or both. Here we are.
That's right. I mean, we were working with a customer, and the solution architects went in, and spoke to the customer, and it was great. And the customer had a very long relationship with Kafka, and loved Kafka, and did not want to move away from Kafka at all, but wanted to run in Azure.
I like them already.
They came to us and said, "Well, what do we do?" The customer's got these messages, has an entire ecosystem, and tooling built around Kafka. And they absolutely want to keep that. But they've seen a lot of Azure Cosmos DB, how do we build this integration in between them? And we kind of looked at it and we went, "Well, you can write a bunch of custom code, and then you can have then a bunch of apps that are deployed in Kafka readers, and writers, and you can do all of that stuff yourself."
And we spoke to the customer and they were like, "We don't really want that, our whole Kafka environment is managed, and we don't now want to have this unmanaged thing that-
Not how you do it.
"We don't want to have to go and employ a bunch of developers, to develop, and maintain code, and worry about versions." They weren't interested. We're like, we've got this cool thing already that we're using called Kafka Connect. We just want to keep using Kafka Connect, because we use it, we love it, it's great, it's low touch for us, it sits up, and works to help us.
We looked in, well, there was no connector for Cosmos DB, right? No fault of anybody, but new database, not our normal wheelhouse. We thought, "Okay, let's go take a look." And luckily there's some good documentation by Confluent online, and you can go and look, and there's a sample connector that gets you started. And we thought this doesn't look too difficult.
It's not, you look at the API. And you're like, "Oh, come on, that's a thing to solve in a day."
That was my jackpot. I almost said customer. I was like, "Yeah, don't worry, by... On a Friday, I almost said, "Look, by Monday, I'll have a connector for you." Because you look at it, then there's one interface for reflect for like four methods and another interface for like four methods. And I'm like, "Easy, right?" I'm just going to get [crosstalk 00:15:41].
And just the concept, this is why the first play was, why don't we just write this? Who needs a connector? It seems so stable.
And that was the thing, right? We first thought, well, we don't even need Kafka Connect and connector, but then you start digging into the weeds and you think, "Oh, okay, hang on." It gets a little more difficult, right? You've got partitions, and you've got checkpoints, and you've got offsets, and you've got restartability, and you've got-
Scale.
Guaranteed once delivery of messaging, you've got ordered messaging delivery, and it's like, "Oh, hang on. Now it's starting to get a little bit complicated, right?"
Right, right.
I didn't really want to go and rebuild a whole bunch of that stuff. When there was already a platform that did most of it for us, right? I thought, all right, cool, Kafka Connect. Again foolishly, I thought, "Yeah, I'll knock it out in a weekend." Well, let's say it didn't take a weekend, it took a little bit longer. I had probably a working prototype in about a week, that got them a message out of Kafka, and stuck it into Cosmos, and took a change on a Cosmos, and stuck it into Kafka. And I thought, "Great, proof of concept proved it works." Customer saw, they said, "When can we have it?" And that's where the journey with Kafka Connects started. And it's been quite a journey, but it's been a fun journey.
What got interesting after that first week, because this is like always the data integration story. It's not just a Kafka Connect stories. Anytime you're connecting thing to other thing, it's usually conceptually simple, JSON documents getting them into messages, or messages into JSON. And you're like, come on. [crosstalk 00:17:27] explain that?
[inaudible 00:17:29].
But the answer is difficult.
I don't think it's difficult, conceptually it's easy. And like I said, I got that first version of the connector out really quickly, but then you start finding those little nuances and those things that trip you up, right? And one of them for instance, was quite simple at the time. And we got stuck and luckily I found Kafka Connect already had other people stuck on that, and I've got a good solution for that is, message formats.
You've got JSON, well, that's great. But you also have JSON with schemers, without schemers. You have JSON as binary in the terms of Avro, and you have all these other kinds of formats and it's like, "Well, how do I deal with that?" And if I'm writing a connector that can be used by anybody and Kafka takes any of those kinds of message types, I'm going to need to be able to deal with those message types.
And I thought, "Oh, this is going to be a pain." But luckily the way Kafka Connect is structured when you build these things, there's the connector, there's transforms, and there's converters, right? And so you could actually just go and pull one of the converters, if there's a converter for Avro, or for JSON, with schema or whatever it is. You can go pull one of those converters and use them.
And if there isn't a converter that suits, then there is a model that you can actually go and build custom converters, right? If you've got some funky format that you're using, and you can then just go build a custom converter, and plug your connector into that. The way Kafka Connect is kind of structured that it makes it easy to have converters, transformers, connectors makes your life as a developer so much easier, right? Because you can go in and target the piece that you need.
Those are kind of the three chunks of architecture that probably matter in that order, convert or transform, sorry, connector, transform, converter thinking from an input source connector coming into Kafka in the reverse order for a sync connector. But yeah, those are the three pieces. And by the way, if we're just talking about Kafka Connect, like everybody knows what Kafka Connect is and gee, what's your problem?
If you don't, there's a link in the show notes to an episode, we're going to try to include a link in the show notes to one or more episodes that are some good intros or just other materials online. If you're brand new, and you don't know what connect is, it's actually very normal for a human being to not know a Kafka Connect is. Until some of the details [crosstalk 00:20:07].
I didn't know what Kafka Connect was, I'd worked with Kafka for many, many years. And I had no idea what Kafka Connect was.
Right, you can visit, So follow the links we're intentionally-
[inaudible 00:20:15] forget.
We're intentionally not giving all the background on that, but there's links in the show notes for you to check out.
And Robin Moffit as a spoiler.
And it's probably going to have Robin Moffitt's name on it. That's probably all of them basically.
Absolutely.
Give us an example of something that got interesting, because you said, you deploy it, here's the basic thing. And then you discover, "Oh, wait, it's difficult." What's the thing that tripped you up.
The customer was running it for a while, and kind of came back to us and said, "Well, we're losing messages, or we're getting duplicate messages." And we were like, "Oh, this is not what I want to hear." I mean, Kafka's got it's awesome way of managing the checkpoints, and watermarks, and kind of knowing where you are in reading a Kafka topic, right?
And Kafka Connect does a really good job with Kafka of managing those offsets and those kinds of checkpoints, right? So when the connector restarts, it can say, "Hey, I was at this point when I failed, give me all the stuff from this point on, right." And that's awesome. That worked really, really well. But going the other way from Cosmos DB into Kafka, they started missing messages, or they started getting duplicates, and we were like, "Oh, what's going on?"
We dug into it a little bit more, and Cosmos DB has got this awesome concept called, the Change Feed, right? And you can subscribe to the change feed-
That seems handy.
And change it back in the database, it pumps those changes into a change for feed, right? Kind of like change data capture, but it's an actual feed that you can subscribe to, and all the changes to a document, just appear in order in that change feed.
It seems like one third if one were writing a source connector.
Absolutely. If that wasn't there, I wouldn't have even thought about writing a source connector, right? Because it would be really, really difficult to do that. I'd have to go and figure out what's changed since the [inaudible 00:22:22] , and that I would have just bought a sync connector. But luckily there is that change feed. So I wired up the source connector on top of the change feed.
Now the change feeds also got its own way of managing checkpoints, right? When you talk to the change feed, you're telling it, this is where you got to give me all the changes from this point forward, and you can set the checkpoint back, and you can set it forward and all sorts of cool things. Turns out when you've got a system on the left that has checkpoints, and you have a system on the right that has checkpoints. It's probably a good idea to keep those checkpoints in sync with each other, right? Because if they aren't in sync with each other, all sorts of cool things can happen.
And digging through those details, that's kind of what happened. We were getting those checkpoints out of sync with each other, and that's where we were getting skipped messages, or duplicate messages. It took a while to kind of get that right. And the concept of exactly once delivery when you're a distributed worker platform, as well is really hard.
Cosmos DB being a distributed database, Kafka Connect having multiple workers, and multiple workers trying to kind of read the changes out of your database into a Kafka topic. That distributed nature and having exactly once kind of delivery gets quite tricky, right? Those message are delivered, and those worker says, "Oh, that message I'm delivering, Oh, now, hang on that." But that work has already delivered this message, don't go and deliver it again, kind of thing.
Those semantics of making sure that exactly once worked in a distributed manner, that became a little bit challenging and that took us a considerable amount of time to engineer that definitely longer than building the initial connector doc. Was just figuring out how to get it to do exactly once in a distributed manner. Super hard to do.
I don't know if there's time to dig into all of it, but was item potency key to that, making writes from the standpoint of the source connector, making writes item potent, is that the magic?
Yeah, I think we looked at that, and it's super difficult to do, but item potency is definitely one of those. I guess the challenge was more when let's say, the connector failed for some reason. Someone shut the Kafka Connect cluster or for whatever the case your connection broke. And you restart your connector up, it was really difficult to kind of know what had been written to Kafka already, and what has changed in the database, and making sure that we were kind of in sync with each other, so we started at the same point.
And that's where the challenge kind of got, sometimes the connector would be ahead, or behind where it thought it was in Kafka. And then it would start sending duplicate databases, or documents, or it would think it was further ahead and it would start at a point further than Kafka, and then we were missing messages effectively kind of thing.
Gotcha.
It was more around that kind of resiliency of having distributed multiple workers starting and stopping independently of your database and your Kafka, and then being able to kind of resynchronize those things. There was definitely an item potency issue, but there was a little bit more as well. And there were some other challenges, when you're looking at a partition system like Kafka, and then you're looking at a partition system like Cosmos DB, because Cosmos DB is infinitely scalable horizontally, right? By just partitioning out the data, same as Kafka, you can have thousands and thousands of petitions. You have the same thing in cosmos DB.
And now writing something that's efficiently kind of doing those multi-partition story and keeping things in order and in sync that got a little challenging as well. Being able to say, "All right, this worker, you're dedicated to that partition, or this worker you're working with that partition, or you're working with this set of partitions."
I guess the biggest challenge for us was at scale, and a distributed nature making the thing resilient, and stable, right? It was really, really simple to get it out single topic with low partitioning in a small Cosmos DB environment, getting messages flowing both directions, trivial. It was weekends with work, getting it to do that at scale reliably without missing messages, that took a little bit more engineering.
It's always weeks harder than it seems for any development task. It's always funny to me how deceptively simple integration code is. It's exactly the time code, you should be able to knock this out right away and some unseemingly number of hours later, you're feeling comfortable.
80% of the functionality takes 20% of the time, and 20% of the functionality texts 80% of time.
Right, right. That definitely an odd sort of Parado distribution thing happening here.
Well, it took you a week to write the whole connector. Now it's taking you six weeks to kind of just do this one feature, what's going on?
Yes, yes, it is.
Let me explain that to you.
Right. And this is out in the wild and available now. I mean, I know it's on a Confluent hub as an option, so it's being used others.
It's being used by a good number of our customers at the moment. I'd probably say 10 customers is a good guess, but there are large customers in production. You'll see if you go to the Confluent hub and you look for the connector, it's there. You can grab it, you can use it, it's called verified or called certified. It says it's still in preview, which it is. And even though it says, it's in preview, it's fully supported by Microsoft, we'll support it, we'll add features, we'll address issues.
That's okay.
And it's also open source, so go to GitHub, search for Kafka Connect, Cosmos DB, and you'll find it. But we can put a link in the show notes to the repo as well.
Absolutely.
If you have any issues, just open issues on the repo, they come to my desk, and we triaged them, and we'll deal with them. And of course we've already had contributions from the community, so thanks for that, but it's fully open source. If you see something and you want to contribute, and you want to add features, or you want to change something, you're free to go and submit a pull request, and it again, lands on our desk and we will review and merge as we need to.
We've already had just yesterday, I think, someone from Confluent wrote a config validator. Now when you actually entering in your configuration, it does real time validation of the configuration. So you know whether your configuration's valid or not, which was a cool contribution. So thank you.
And maybe it logs descriptive error messages, if it's not, I mean, I'm just hoping.
It does. It logs very nice descriptive error messages now. That's a good thing about it, is even though it's beta now, but we've handed it over to the Cosmos DB engineering team. So it's going to become a full supported official product from them. You'll be able to open support cases against Microsoft support if you need to. But it's also just we want to kind of keep it open and part of the community. We love to see kind of contributions, and if you've got issues, or feature requests, or whatever, just go log them in GitHub, and we'll take them from there.
My guest today has been Ryan CrawCour. Ryan, thanks so much for being a part of Streaming Audio.
Thank you very much Tim, for having me on the show.
And there you have it. Hey, you know what you get for listening to the end, some free Confluent Cloud. Use the promo code 60PDCAST, that's 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit, and there are a limited number of codes available, so don't miss out.
Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund, that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video, or reach out on Community Slack, or on the Community Forum. There are sign-up links for those things in the show notes if you'd like to sign up.
And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five star review, and we think that's a good thing. So thanks for your support, and we'll see you next time.
When building solutions for customers in Microsoft Azure, it is not uncommon to come across customers who are deeply entrenched in the Apache Kafka® ecosystem and want to continue expanding within it. Thus, figuring out how to connect Azure first-party services to this ecosystem is of the utmost importance.
Ryan CrawCour is a Microsoft engineer who has been working on all things data and analytics for the past 10+ years, including building out services like Azure Cosmos DB, which is used by millions of people around the globe. More recently, Ryan has taken a customer-facing role where he gets to help customers build the best solutions possible using Microsoft Azure’s cloud platform and development tools.
In one case, Ryan helped a customer leverage their existing Kafka investments and persist event messages in a durable managed database system in Azure. They chose Azure Cosmos DB, a fully managed, distributed, modern NoSQL database service as their preferred database, but the question remained as to how they would feed events from their Kafka infrastructure into Azure Cosmos DB, as well as how they could get changes from their database system back into their Kafka topics.
Although integration is in his blood, Ryan confesses that he is relatively new to the world of Kafka and has learned to adjust to what he finds in his customers’ environments. Oftentimes this is Kafka, and for many good reasons, customers don’t want to change this core part of their solution infrastructure. This has led him to embrace Kafka and the ecosystem around it, enabling him to better serve customers.
He’s been closely tracking the development and progress of Kafka Connect. To him, it is the natural step from Kafka as a messaging infrastructure to Kafka as a key pillar in an integration scenario. Kafka Connect can be thought of as a piece of middleware that can be used to connect a variety of systems to Kafka in a bidirectional manner. This means getting data from Kafka into your downstream systems, often databases, and also taking changes that occur in these systems and publishing them back to Kafka where other systems can then react.
One day, a customer asked him how to connect Azure Cosmos DB to Kafka. There wasn’t a connector at the time, so he helped build two with the Confluent team: a sink connector, where data flows from Kafka topics into Azure Cosmos DB, as well as a source connector, where Azure Cosmos DB is the source of data pushing changes that occur in the database into Kafka topics.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us