Now suppose, just suppose you've got data in Confluent Cloud, that's a good thing and you want to understand it using Databricks running in Azure? Well, Angela Chu she's with Databricks and Caio Moreno he's with Microsoft Azure, tell us how, on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the Cloud.
Hello and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and joined in the virtual studio by Angela Chu and Caio Moreno. Angela is a solution architect at Databricks, Caio is a senior solution architect with Microsoft Azure. Angela and Caio, did I get those right and welcome to the show.
Yeah, I think that's right. Thank you very much, it's great to be here.
Yeah. Thank you for having me.
You got it. I'm excited to talk about this. So you guys recently co-wrote with, I think, a partner solution engineer at Confluent also, there were three of you who wrote a blog post on the confluent blog. This was, so we're recording this in the middle of March. In fact, it's St Patrick's Day today, and I didn't make green pancakes this morning, I feel that was maybe a little bit of a miss.
But this was a month and a half ago, you guys wrote a blog post called Consuming Avro Data from Apache Kafka Topics and Schema Registry with Databricks and Confluent Cloud on Azure. And it was a really, really, I think, carefully documented, well thought out, just description of this integration. And I really just wanted to talk about it today, what's important about this? How'd you do it, and really talk through the work that you described in that blog post?
That sounds great.
So either one of you, jump ball here, if you could just summarize, what is the thing that you were doing in this blog posT? Like I said, a good thorough, step-by-step description. But before we get into the step-by-step, what is it you were trying to show?
Did you want to answer that?
Yeah, so I think it all came from customers that we all work together where they see the potential and the opportunities to use Spark and Kafka and to run this on the cloud. So we use Confluent Cloud and Azure Databricks and all the Azure service to build on string solution where we can do real time predictions and also could do all the data transformations in real time. And some customers, they want to have a fully managed service, so they want to avoid the complexity of having to manage Kafka, manage Spark, manage that Datalake and manage all the pieces that you need. And then we worked together in this integration from Confluent side, Databricks side and Microsoft, and then we decided to put this blog post and to help other customers and other people that have the same desire.
Awesome. Yeah, some customers want that and the only customers that don't, I think are the ones that are being told they can't do it. It's way better not to manage your own data lake, your own event streaming system, I mean, that's all given. So yeah, we're talking about Confluent Cloud and Databricks things running on Azure, and I think potentially some other Azure services, honestly, just the integration story is really neat. So let's get started in Confluent Cloud. So, I mean, Databricks fundamentally, we were talking before we started recording, I usually to myself say Spark and I realized that those are not identical things just like Confluent Cloud is certainly not an identical thing to Apache Kafka, but, I guess, I'm doing to you what people do to us. The way I think about Spark, this is a slightly biased view clearly since events during a platform and everything is, when there is data that has gotten someplace and I need to do potentially very sophisticated analysis on it.
I need a place for people to think about what has happened and even what has happened in the very recent past that's what Databricks is for. So like there's some application running and the data, the microservices are all, let's say, event driven, and reactive, and talking through Kafka topics and all that, that stuff is going on. We want to get that data to somewhere else, so that data scientists, business analysts, whatever it is, we can build out that kind of analytics backend. And so it's important to have a system that does that, a system like Databricks. So we start with Confluent Cloud, there's data there, and what data did you have in Confluent Cloud in the case of the demo that you built?
Well, we use DataGen to generate some click stream data, because that would be really effective to show a real-time application. Click stream data in the real world is always coming in, people are always clicking on things. And so we thought that that would be good to illustrate how you can build something that is a real-time usage for Confluent Cloud and Databricks.
I love it. And just doesn't take a lot of effort to get that data going, right? There's all this other stuff that does take effort, and you're trying to show this complex thing and these pieces working together. DataGen, it's a connector, you add a connector, you select a schema and you give it a rough idea of how fast you want it to run and it gens the data. So I love that you started with that. And I like to feature that, I'd just like to talk about that when can, and I try to use it in demos when I can and things like that. Because it's just there and it's super easy and here's an idea, let's see, who's the PM on this? Nathan, confidential to you. We should be able to put custom schemas or customized schemas on Datagen, so you could tweak with the statistics of clicks and things like that. So when you're making analytics demos, you could change things, right?
Yes.
Okay. We'll see announcement set on air, so I'm sure it'll make it to the top of the backlog next quarter.
Yeah. Nathan, I hope you're paying attention, definitely.
I'm use to his last name.
Yeah. When I think about using this way, it's like we could create like a repeatable and blog posts, repeatable scenario that people could easily use it without having to worry about the data and how to generate it.
Right. And without sounding like a Confluent Cloud commercial, with Schema Registry, I know you mentioned that you use Schema Registry, did it participate in interesting ways in the setup or what was its role and things?
Oh, that was one of the key reasons that we created the blog actually, because in open source, the Schema Registry can often be left wide open. And so there was a way to get to it when it was wide open, but in Confluent Cloud, and as it really should be the security of your information, you need personal authentication and people didn't know how to integrate with the secured Schema Registry specifically. So you could pull Avro data from Confluent Cloud and then you couldn't parse it because you couldn't get to the schema. So that was the premise, that was the customer that I was working with at the time, they switched from Json to Avro and then they were like, well, great, and we're going to use this awesome Schema Registry we get to version of schemas control and this'll be great, why can't we get our data out? So that's why I had to come in and solve the problem and figure out how to integrate. So it was the key reason to it.
Oh, nice.
Yeah.
Okay. That sounds absolute, the way you gave that answer like that was a contrived setup question, it wasn't, I literally just came up with it. So I like your answer, but that's cool. I'm glad it was important. One might say, well, it's just metadata but of course we all know how that goes.
Yeah. So I definitely agree that the Schema Registry, is a key part of the solution to make sure we have the important parts of the architecture. So customers are looking for that as well.
So we got DataGen, we got Schema Registry, let's move over into the Azure Databricks side of things. First of all, what does it mean to say Azure Databricks, clearly that means it's Databricks running on Azure. Are there other Azure platform components that are of interest to your pipeline? Caio this is a question for you, tell me about that.
Yeah. So maybe Angela can give more details, but Azure Databricks for the people that are listening to this podcast that they don't know, is Databricks running on Azure, so you have all the integration with the Microsoft ecosystem with all other Azure service, for example, activity director, and other things that you will need if you want to run this on Azure. So it just make your life easier if you're building your Azure architecture.
Cool. Now, Angela, whatever way you want to take this, you're the one who knows, so just walk us through this. But I want to know things like, number one, what's the integration? I've got stuff in Confluent Cloud where we've established that, how does it get into Databricks? What does it mean for it to get into Databricks? Like what storage is it being written to? How does that happen? And then what are you trying to do with it? I mean, now we want to do analytics on click streams, but just kind of walk us through that.
Yeah, absolutely. So open source Spark does already have a Kafka connector. So we do use the Kafka connector to pull data from the topics that are inside Confluent Cloud. Now what makes it different is that Confluent Cloud also, when Avro data goes through it, is special. It's not open source Avro, it's got elements of the Schema Registry attached to it. It's got five bites at the beginning of each binary and those five bytes need to specifically be pulled out, we need to get the ID for the schema, otherwise we can't parse it, it's just binary. So then those IDs allow us to then go and query the Schema Registry with a client that you can set up simply in a notebook or again, in your application, if you're writing it with an IDE, you don't have to use a notebook.
The blog uses a notebook for simplicity so that we can show how easy it is. But when you create that client, you parse the schema IDs, it's able to pull the schema and then you can use the from Avro, which is also an open source Spark function to go and parse that data. So what it's doing is it's ingesting the data that is going through Confluent Cloud, and then it's parsing it, and then it's dropping it in delta format on Azure storage so it's helping to build your data lake.
The example that's in the blog is helping to build what we like to call the bronze layer or raw layer of the data lake so that you can start a curated data lake directly on ADL as Gen2 storage. Where you can start the process of manipulating the data and cleaning it, maybe doing some basic business logic and then in another phase you can do aggregations, business aggregates. You can query it directly on your data lake, you can push it to a serving layer. So that's what Databricks is allowing you to do, so it's ingesting that and then giving you control to lay it out, manage it, and then query it later on.
I guess these are really data lake questions, but concretely, what is the storage we've got Avro records and where in Azure, to what sort of storage are they written here in the case of this pipeline?
So it's written into an Azure Data Lake, Gen2 storage. So Databricks excels the platform, it's not actually storing data within itself. The data is being stored in Azure storage and in the customer's account. So we don't take the data and then store it in our own account, it stays in the customer's account, in their storage buckets. So they've got complete control. In the case of the blog post here, we're storing it in Delta format because Delta format can take it's Parquet under the hood, but it can take immutable files and it puts a layer on top. So that then you can suddenly do updates, deletes and merges directly on your data lake with Delta format without having to go and figure out, okay, these are immutable files. So which files have the rows I'm updating? Okay, grab all those, rewrite all of those, delete all the old ones. Delta does all that for you, so you can just write the equivalent of SQL update statements directly on this immutable file format. And you don't have to worry about it.
Delta is a layer on top of Parquet. You said Delta the first time and then I said, Avro. You did in fact say Delta, but it's a layer on top of Parquet is the way we should understand that, right?
Basically, yes, it's another open source format that Databricks created and now we use in our product and it's a metadata layer that sits on top of Parquet.
Got it. So it sounds like to make this work, there's also some kind of metadata registry to keep track of schemas and all the Deltas and all that stuff that's invisible to me. [inaudible 00:13:45]
It actually stores directly with the data, so there is no separate piece, all the metadata and everything is just sitting there part of Delta, even when you want to register it as part of like a catalog or something. When you want to register a Delta table in a catalog, you just point to the location of the Delta table and give it a name. You don't have to specify the schema or anything, or any of the properties because that's stored right along with the data on storage.
Nice. Okay. So you just go, look at it, introspect it, and the schema comes out. I like it.
Mm-hmm (affirmative).
And imagine Caio, there's a listener maybe who doesn't know all the Azure things very well. What is Data Lake Gen2 storage?
Yeah. So when Microsoft release a data, like there was the generation one and then basically create generation two, it brings more functionality. I think, what is interesting for these use cases, like let's say for your data science and you want to have real data coming in and you want to build like a machine learning pipeline using Databricks, Confluent. For example, Azure machine learning, you could integrate everything and that was my main focus as well. So to have like machine learning streaming platform that we could predict in real time, so for the blog posts, we didn't go that far, but if you look at the architecture, you will see that we put a service, like Azure machine learning, one that I usually use more in my daily activities.
And then we can integrate everything. Since the sensor data coming in this case, the click stream data, but could be IOT sensors as well. And then you ingest this to your platform and then you can do real time predictions. So respond to your question, for generation two, it's what I would recommend people to use right now because some customers in the past, they started with generation one, but now they should be using generation two. So that's why we choose ADLS Gen2 to this blog post. It has much more features and much more-
Azure Data Lake Storage Gen2.
Yeah.
Being Gen2 is better than Gen1?
So that's the year. Well, we can go more in details if you want.
ADLS Gen3, this time it's not ADLS Gen2, so we've already got the tagline figured out.
But technology about that, there's many blog posts in the Microsoft website that we can compare the difference. But I mean, to make it easier for people that are leasing to the podcast, just use Gen2, that's what we recommend now.
Got you. I'm going to try to find a link out to that, if you're interested to know more about how that works and like why it's not blob store. What's different about it, it's a place to put it [inaudible 00:16:43] -
Yeah, so that's a very, very good question. So usually when you talk about a data lake, you need more features, I would say, you will use. So it's basically storage account, so you create the storage account and then when you create it, there's some options that when you choose it becomes ADLS Gen2. So, basically it was for the use case, yeah.
Cool. So Angela, back to the flow here, it's in storage, we got that for trying to get this done inside 30 minutes, I need to ask fewer questions. Take us from there. You also said notebook, and I'm going to assume everybody knows that that's Databricks notebook, which is like a Jupiter notebook, it's the browser-based thing with code that runs behind it. So keep talking. Where do we go from here?
Where do we go? Well, when we've pulled the data from Confluent Cloud, and then we've pulled the schema and then we've parsed it, so now it's actually a visible, you can parse the binary and then you can see it. And we've started in a raw Delta table, so for the blog post you can see that we just put the nested data, we just put that in one column in the Delta table and we started in storage. But from there, you can pick it up again and then you can pull out specific nested data items from the parsed data, you can flatten them, you can adjust them and then you write it out to a silver layer in Delta. That silver layer is often what machine learning, which I just talked about, that's often what the data scientists are going and pulling from in order to do their machine learning algorithms. And utilizing functionalities like Spark, you can train your models with large quantities of data in a much shorter amount of time.
And so the more data you can use to train your model, the more effective that model is going to be. So that's something that can be done directly off that silver layer that was just created. And then from the silver layer, if you want to aggregate your data, often when analysts are doing their job and they want to create reports for the upper levels, you want to make business decisions based off of their analytics. It's nice to pre aggregate that data to make things go faster, you aggregate that data and again, you can pull it from the silver Delta table, either in a batch or streaming process. If you want to do it more, real-time, you aggregate that data and put it in the gold layer. Now, once it's in the gold layer, then you can either hit it directly with your favorite BI tool directly on storage.
You can push it to something like Azure Synapse, if you wanted to use that as a serving layer, you have a lot of choices there. But then you've got that data set that is your goal of data set that is guaranteed to be aggregated the way that the users need it, and they can Gen pull and do their analytics off of that gold layer. But you've got a single source of truth right there on Azure storage, and it's curated, it's not a data swamp, as I know a lot of people like to call data lakes from the past. It can be managed, if you need to go in and update it a rerun.
I guess I said that once.
Yes. Uh-huh (affirmative). If you need to update things, if you need to reprocess things, all of the data is right there and can be done in a controlled manner so that the data is usable and well-managed.
What I just heard, the difference between gold and silver is that gold has pre-computed aggregates and is directly queryable. They're both fundamentally the same kind of storage, logical, medium, but one of them is queryable. I've done some pre computation.
So they're all queryable, but it depends typically when a business analyst is coming in and they've got to make their daily dashboard for their management, and you want to query that pre aggregated data. But then if they wanted to do data traceability, if you have data scientists who don't want aggregates, if you've got some power users that don't want aggregates, you can query the silver layer with the same tools that you query the gold layer with. And then if you've got someone who is responsible for the traceability of that data all the way through, they can even query the raw layer.
Although of course, that is less friendly. It'd be just like a staging layer if you're talking about a traditional database where you don't necessarily want people in there, but you've got all three layers. So you can see completely, if something shows up in the gold layer as an aggregate, and you've got someone for management question that number, where did that number come from? You've got traceability all the way back to the beginning. I know that was an issue I had in a previous job trying to trace, why does this number look like this? Tell me what data went into this number. You can trace it all the way back to the row with this method.
And that traceability metadata is in the storage system. When I've got in gold, I have something there I know where it came from.
Yeah, typically you identify your rows of data, but you can also tell you've got your logic right there. You can tell when you're querying your silver data, what the business rules are to go and pull the gold data you can even reprocess from the silver. So it gives you all of these options because you've had your bronze, silver, gold medallion model, as it's often called within your data lake.
Got you. Is there a Mithril layer? And if not, why? Seems like that's a bit of a miss, there should be a Mithril layer, maybe that's a-
You know? I'll take that back to the engineers they need to work on that.
Yeah. Maybe a product thing and even product marketing thing. They get some buy in there and they'll put pressure on that. We need a Mithril layer, this will appeal to the users, at least if not the buyers. All right.
Definitely.
So it's in the gold layer, it's in silver layer, I'm querying it directly from a notebook. Tell the story of the actual analysis you did, and I want to preface this with, we talked about the DataGen and how awesome it is, because it's just there and it does things and how I wish the statistics were customizable. Because that's one of the problems with randomly generated data when you're trying to show some cool ML thing. It's like, well, the statistics aren't that cool. 20% probability was this, 80% of probability was that and that's what you got. But how did you put all that together with the data that you had?
Well, we didn't.
Do you go.
We stopped by landing it in the Delta format and the data lake, because I was doing a little bit of analysis myself, and you don't have to use a notebook, I mean, you can use your favorite BI tool to hit it directly to Databricks allows that.
Absolutely.
But I was trying to do some queries and things like that against it. I've got IPs and users and funny little commands and that. So I wasn't able to do too much analytics with the data, but it was perfect for showing how well this whole system works. Because really, I didn't need the data to be complex to show that integration and I didn't need the data to be complex to show how easy it was to pull that data, parse it, and then drop it in Delta format on storage. It made it so easy.
Nice. Yeah, that is always how it goes with auto-generated data, it's clean. The real world is messy and we build analytics tools because of the impenetrable mess of the real world that we can just barely maybe get a glimpse into. But I guess, like you said, your point is, the data's in topics in Confluent Cloud, doesn't matter how it got there. I mean, you picked the right solution for this demo, but if it gets there from messy and penetrable, real-world phenomena, then the same pipeline does the work and all those backend tools you can bring to bear on that.
Absolutely, I've changed it to capture data is what I ended up having to solve for the customer that I was working with, which was definitely much more complex. And it worked beautifully.
Nice. So what... Oh, go ahead.
If I can add that. Yeah, so for the blog posts we were basically showcase like, okay, you can get the data and then you can do all the fancy analytics, but in real customers that I'm working with, we're continuing the story and then using the machine learning tools to do predictive. So in this scenario that I worked was for predictive maintenance, so you could easily, as you have the data, you could bring your models and do the end to end story. That's possible. So we want just to showcase to other people how you make the integration, then it's really up to the users, to the developers to continue to story until you bring what they feel more comfortable to use.
There you go. Mike Wallace kind of grill question here. In the integration, what was your favorite part and what do you think felt bumpy and could be smoothed over in the future? And that could be on the Confluent Cloud site. So, frank question, what did you love, what did you not love about the whole process?
That's a good question. Probably trying to figure out the correct libraries, I didn't need to add libraries to the data based cluster, using the functionality that's shown the blog. I didn't use the correct versions of the libraries, there were some version conflict depending on if I got something too early or I got something too new. And so I think that was the bumpiest part. It was getting the correct versions.
I'm going to quote Andrew [Lochiwes 00:27:00] having said, dependencies are a pain.
Yes they are. And that's why I felt it was important to figure that out and then share it with everyone. So no one else should actually get hurt.
That is so good. I did it, here is how, just listen and do this, and that is solid gold. So thank you for that. But anyway, keep going.
I agree.
You were going to say another thing, Angela, I cut you off.
Not to remember what I was going to say, go ahead, Caio.
I agree the dependencies and the libraries, yeah, exactly.
And then keys made it very convenient, but I know that there are some customers that don't like keys. As I know, I did use Databricks secrets, which is a method that you can use so that you can use a value without actually displaying the value anywhere. Using Databricks secrets, you can retrieve the value, but it will display as redacted. So people can use them, but not know them. So that's good for not displaying it directly, but [crosstalk 00:28:04]
So like the key secret key, not key in the message and the topic. Okay. Okay.
Oh, no. Yeah. The API key and secret for the registry and then the API key and secret for the topics as well. So it made it easy for me, but I know some customers don't like using keys.
Got it. And they want to, I'm not sure what the option is there, but right. Security is also a pain, so these are like elemental truths you're expressing here and you are 100% right. Well, my guests today have been Angela Chu and Caio Moreno. Angela and Caio, thanks for being a part of Streaming Audio.
Thank you very much for having us.
Thank you very much.
Hey, you know what you get for listening to the end, some free Confluent Cloud, use the promo code 60PDCAST that's, 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available, so don't miss out.
Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter at @tlberglund that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign up links for those things in the show notes, if you'd like to sign up. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five star review and we think that's a good thing. So thanks for your support and we'll see you next time.
Processing data in real time is a process, as some might say. Angela Chu (Solution Architect, Databricks) and Caio Moreno (Senior Cloud Solution Architect, Microsoft) explain how to integrate Azure, Databricks, and Confluent to build real-time data pipelines that enable you to ingest data, perform analytics, and extract insights from data at hand. They share about where to start within the Apache Kafka® ecosystem and how to maximize the tools and components that it offers using fully managed services like Confluent Cloud for data in motion.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us