Enhance your career, get your certificate as a Data Streaming Engineer | Get your Certificate

March 22, 2022 | Episode 205

Building Real-Time Data Governance at Scale with Apache Kafka ft. Tushar Thole

Transcript
Notes

Kris Jenkins: (00:00)

Today we're talking about stream governance. Now, I, wearing my developer hat, I first saw this as tools for debugging and monitoring, but it's actually a lot more than that. For this episode, I'm joined by Tushar Thole, who's the engineering manager for stream governance at Confluent. And he very kindly gave me the bigger picture on what governance is, and tools that they've been working on to help with debugging and monitoring, yes, but also a lot more within the organization of various sizes. Before we start on that, let me tell you that Streaming Audio is brought to you by Confluent Developer, which is our site that teaches you everything you need to know about Kafka. From how to start it and run it, how to write your first app, all the way up to architectural patterns, performance, tuning, maintenance, all that good stuff.

Kris Jenkins: (00:49)

Check it out at developer.confluent.io. And if you want to take one of our hands-on courses that are there, you can easily get Kafka running using Confluent Cloud. Sign up with the code PODCAST100, and we'll give you $100 of extra free credit to get you started. And with that, I'm your host, Kris Jenkins, this is Streaming Audio. Let's get started.

Kris Jenkins: (01:17)

My guest today is Tushar Thole, who spent 10 years working at VMware, where he picked up three software patents in security and distributed file systems. And then he came to work with us at Confluent as head of engineering for the stream governance product. Tushar, thanks for joining us.

Tushar Thole: (01:36)

Thank you, Kris. Glad to be here.

Kris Jenkins: (01:40)

It's a pleasure to have you. Stream governance is something I find it can be a bit nebulous, because it's not one thing and it's not for one user, it's covering several different fields in a group. So just let's start off, give me an idea of what kinds of things stream governance is trying to achieve.

Tushar Thole: (02:02)

Yeah, that's a great first question. And before I start, I would like to give a shout out to my team, my talented colleagues in product and engineering. I'm merely representing the work that is done by many others, so thank you to all of them.

Kris Jenkins: (02:18)

If you want to achieve anything, you need a lot of good people behind you, right?

Tushar Thole: (02:25)

Yes. And to answer your question. If you look at where the industry is going, there are many ways to describe it. People say that data is the new oil. Another way to describe it is that this is a data everywhere moment. Third way to describe it would be, hey, software is [inaudible 00:02:44] the world. At the end of the day, again, from looking at it from Confluent perspective, we firmly believe that you have to leverage your data in motion to make strategic and tactical decisions. And it doesn't matter which industry you are in, whether you are an e-commerce organization, or you are an oil rig, or you are helping people hail a ride, or you're trying to [inaudible 00:03:10]. You want to rely on data, not just to make quick decisions to figure out, hey, is this transaction fraudulent? Or to figure out, hey, should we be investing in this new initiative at our organization?

Tushar Thole: (03:24)

In both areas you have to rely on data. And one thing that we realize is that key thing that is, or key inhibitor to this moment is, can people trust the data they have? Can people find the data they have? And can they visualize in real time, hey, do they comply with government regulations? Which are also coming into play more and more these days. So with that in mind, we decided to look at it from customer's point of view. And we realized that there is clear demand here. There is clear ask from customers where they want to search and discover data. And they want to find how the data is changing as it navigates through each stage of the retail pipeline. And they want to be able to be sure that they can trust the data they have. So with that in mind, we basically started looking at stream governance. And we think of it as a logical evolution of our Schema Registry product. Which is, we consider as a defacto metadata management standard for Kafka.

Kris Jenkins: (04:33)

Okay. Yeah. So it's not just about types anymore, it's going beyond that to answer larger questions about the structure of your data.

Tushar Thole: (04:41)

Yeah, totally. Basically, if you look at schema history, it has been around for a while. If you look at [inaudible 00:04:48] story, it started pretty much very quickly after our founders started a project called [inaudible 00:04:54], or Kafka [inaudible 00:04:56] LinkedIn. Because what we realized is that Kafka, being a binary protocol, people need to be able to trust the data that is being written. And to the broker and the data, that they're reading from the broker. And for that, we basically came up with a neat project called Confluence Schema Registry, is Confluent Community License. There are a bunch of open source, external committers to this project.

Tushar Thole: (05:23)

And it helps us do few things really well. But it helps you define a contract between the produce and consumer, that is a standardization part. Then it can check against, is the data being written, does it comply with the schema? That's the validation part. It lets you evolve the schema. So that's how we started with Confluence Schema Registry. And what we think is that customers are today underserved by Schema Registry, because what they're trying to do is solve this governance use case. And as we talk to customers more and more, we realize that, hey, we can take it to the whole new level. And that's how governance was born.

Kris Jenkins: (06:08)

I get the feeling it's like your first stage is to get all the data from different silos into one place where you can actually see it. Maybe with [Kinect 00:06:18] and things like that. And then once you've got it there, you're like, oh my God, what am I looking at? And where did it come from, right?

Tushar Thole: (06:25)

Yeah, yeah. You're totally right. First thing you want to do is that, if you look at Confluent's mission statement, we want to set data in motion and we want to unlock the part of the data. And right now the way we look at, if you look at a traditional... Not traditional, if you look at the way modern applications are being written, they are microservices based. So leads to this spaghetti architecture where there are multiple services talking to each other. And that's how we said that, hey, Kafka can solve this problem, it can become your central nervous system. And now what we want to do is take that idea to the new level that, hey, now we want to enable you to search and discover the data that is there in your data in motion infrastructure.

Tushar Thole: (07:13)

And by that what we mean, is that what data are you storing in Kafka topics, in various brokers that can be on-prem, that can be in different cloud providers? Or which connectors you're running, which case SQL applications you're running, which custom apps that you have written that are producing and consuming data. So in order to search and discover all this data in place, that's how the stream captor was born. And as you said, the step here is that you should [instrument 00:07:42] the entities that you're interested in. You should ingest them in a centralized place, which is the catalog. So that customers have one place that they can find the data that they're interested in, their organization.

Kris Jenkins: (07:58)

How does that play out as a tool? For the discovery, is it mostly front-end UI that lets you search? Is it like an app still for your data?

Tushar Thole: (08:11)

Yeah. What we believe in is that our... We want to have a great user experience for this, because again, people want to get this insight quickly. One way is that we built a UI in Confluent Cloud, with which you can actually search all the data that you have. So there is a front-end aspect to the catalog. And at the same time, we believe that many of our customers have developers for building applications, and they want to use APIs. So we have public APIs for stream catalog in addition to the public APIs that we have had for Schema Registry today.

Kris Jenkins: (08:55)

Okay. That's actually getting a bit meta, because you're using an API to tell you which APIs you can access, right? In a way.

Tushar Thole: (09:01)

Yeah. Yeah.

Kris Jenkins: (09:01)

Yeah.

Tushar Thole: (09:02)

And in addition to that, we also have CLIs, so API CLI UI. We want to enable our customers to get the most value out of their data.

Kris Jenkins: (09:13)

Okay. That's your first piece that piece, the discoverability of the data you've actually ingested. And then I guess the second question you're going to ask is, what about the quality of that data? Where does that come in?

Tushar Thole: (09:29)

Yeah. One of the solutions that we... If you look at the problem, is that, yeah, I have so much... First of all, customer starts with, "Hey, I don't know what data I have because it's siloed." Using Kafka and enterprise, you can break these silos with stream catalog. You can search and discover data in one place. Now next question is, now I have access to this mountain of information, can I [inaudible 00:09:57] this information? And in order to achieve that, there are few steps. So this was a problem that we saw that customers were trying to solve, and we are tackling it in couple of ways. One is that, our Confluent Schema Registry is, as I mentioned earlier, we consider it as a defacto metadata management standard for Apache Kafka ecosystem.

Tushar Thole: (10:17)

It lets you define this contract between producer and consumer. And what we realize is that it defines this contract, but again, producers can't ignore the contract. You can get into a situation where you are writing garbage to the broker and then clients have to deal with that garbage. And it'll make sure, can [inaudible 00:10:40] use data? So what we decided to do is that we want to continue enhancing our Schema Registry product, because it is very loved by our customers as well as by the community. We continue adding features to it. For example, we added support for JSON, schema, [inaudible 00:11:00] in 2020, which was highly requested feature by the community. So-

Tushar Thole: (11:03)

2020, which was highly requested feature by the community. So we continue evolving that. At the same time, we want to figure out that, hey, how can we make it more useful to the customer?

Tushar Thole: (11:12)

And one concrete way that we did that until now is that, how about we make the Kafka brokers [inaudible 00:11:18]? That way, if a producer is trying to write data that doesn't comply with the schema, broker can simply reject it because it knows that, hey, it doesn't comply with the schema that it should have. That way, we are reducing the work on the consumer side so that they can trust the data that is on the broker.

Tushar Thole: (11:44)

So this is one of the steps that we are taking in the data quality space. And we, as we call it, [inaudible 00:11:53] quality pillar. And I think this is very exciting going forward, because these are just couple of things that we have done so far, but there is right now long set of things that we have planned to make it more useful.

Kris Jenkins: (12:10)

So you've got now, you can have no schema, you can have an advisory schema and you can have a mandatory schema. And is there support for, what happens to those rejected records? Is it all handled at the application layer or is there support for kind of a dead letter queue type affair? Or do you manage that yourself?

Tushar Thole: (12:35)

Yeah. So today what we are doing is that it is left to the producer to deal with the rejected records. But this is something that we are thinking of adding that... The natural next step for us here is that we want to enhance it in few directions, right? We can take this data quality story forward in few ways.

Tushar Thole: (12:56)

One is that we are adding a schema ID validation in the broker. That was a first step that we took. We can do many more syntactic and semantic validations. So this is the validation part, right?

Tushar Thole: (13:10)

Second part is that when the data gets rejected, how can we give some actionable feedback to our customers? So one way is that we can showcase data quality dashboards with which you can see which data is getting rejected, why it is getting rejected, so that you can infer some insights from it.

Tushar Thole: (13:33)

And next step to that would be that we can also think of doing reconciliations that, how about we remediate the bad data and so that application, you can process it again to the application? So these are few steps that we are thinking of taking in that space.

Kris Jenkins: (13:53)

That makes sense. I've worked with some pretty dodgy data feeds in the past, and you get it working, then your third party changes the data feed and suddenly the most important job of the day is to find out what the new format is and why it's sometimes breaking.

Tushar Thole: (14:08)

Yeah.

Kris Jenkins: (14:09)

Right.

Tushar Thole: (14:10)

True. True. And I think-

Kris Jenkins: (14:11)

That sounds useful to me.

Tushar Thole: (14:14)

Yeah. And what we see that, we talk to customers and regularly, and one thing that is very apparent to customers is that everybody instantly agrees why they care about quality of the data. And one thing I see is that Confluent has an edge in this case in particular is because we have been thinking about this problem for a long time. Confluent's schema registry is around for I think seven years at this point.

Tushar Thole: (14:47)

So we have thought about it. We have heard not only from customers, but from community who are putting this product in use in multiple use cases. So we are pretty familiar with what customers are trying to do.

Tushar Thole: (14:59)

And other thing I would say is that Confluent also has ability to solve really hard problems in Kafka broker space, by the virtue of having really talented [inaudible 00:15:15] system engineers who work at Confluent. So I think Confluent will be able to add lot of value to our customers.

Tushar Thole: (15:24)

And another thing is that customers already have Apache Kafka as a central piece of their data infrastructure today. So if we can reduce the burden on our customers where they have to have these separate jobs, where they have to clean the data, we can help them focus on their business problem.

Kris Jenkins: (15:46)

Yeah. And monitoring is always a big part of this, right? Being able to take that off the table is huge.

Tushar Thole: (15:54)

Yeah. Yeah.

Kris Jenkins: (15:56)

So that's kind of the producer side, but then you've got the flip side of when I'm consuming data. Sometimes I just want the data, but sometimes I want to see the bigger picture of how it came to be.

Tushar Thole: (16:06)

Mm-hmm (affirmative).

Kris Jenkins: (16:09)

So that presumably is another thread. What have we got there?

Tushar Thole: (16:12)

Yeah. So that's a very interesting question. So a lot of people are interested in the origin story, right? Or the lineage of the data. Where is the data coming from? Right?

Kris Jenkins: (16:23)

Yeah.

Tushar Thole: (16:25)

So with Catalog, it's democratizing data in your organization, now you can suddenly see what you have with quality. You can see, "Hey, can I process data truly?" And now I'm interested, hey, how did this data change? Where did this originate? Who is using this data downstream?

Tushar Thole: (16:42)

And after hearing about this from many customers, what we realize is that this is something that will be very useful to customers and that's how Stream Lineage was born. So Stream Lineage is a knowledge graph of your data pipelines.

Kris Jenkins: (16:59)

Right.

Tushar Thole: (16:59)

So what we are doing is that we don't expect customers to do any configuration on their part. On Confluent Cloud, if you're using Stream Lineage, it will showcase how your data is changing in every stage of your pipeline, as it passes through various producers to topics, to connectors, KSQL applications.

Kris Jenkins: (17:23)

Yeah.

Tushar Thole: (17:24)

So it gives you a bird eye view of the data. And so that you can find out the lineage of the data and also where it is going, and should that producer or consumer have access to this sensitive data?

Kris Jenkins: (17:37)

Oh, right. So you've got visibility over the topology, but also, you are starting to bring in the whole idea of security?

Tushar Thole: (17:46)

Yeah. So Lineage is a very interesting tool because I think of it as a set of Legos that we are giving customers the set of Legos and they can figure out what they want to build out of it. And it is true for Governance as a whole, but it is particularly true for Lineage, because we are trying to solve a few problems with Lineage in particular.

Tushar Thole: (18:16)

So first one is that, "Hey, I want to provide you visibility so that you clearly understand how data is flowing," right? And this is maybe addressing your auditing use case that, "Hey, I want to audit the data accesses." And that's a security story that you're talking about, right?

Tushar Thole: (18:31)

Second thing is that we are also overlaying this Lineage graph with operational view of each node in pretty much real time. What I mean by pretty much real time is that there is a lag of few minutes, but and you can also choose to have it real, truly real time, just that there'll be a performance implications, but it is pretty real time, is what I would say.

Tushar Thole: (18:56)

And then with that, you can see which node senior infrastructure have an operational problem. There's they have high latencies or they have low throughput, right? And with that, we have customers using it to do troubleshooting in real time in their infrastructure.

Tushar Thole: (19:15)

When they see something, some application not working properly, they can actually go to Lineage graph, figure out, hey, what is going wrong? And shout out to my colleague, David [inaudible 00:19:30]. So he has a really nice demo that we published as part of the product launch.

Kris Jenkins: (19:37)

Mm-hmm (affirmative).

Tushar Thole: (19:38)

It clearly shows how you can leverage Lineage and figure out, "Hey, there's something wrong with this schema. That's how this application is not able to do its job correctly."

Kris Jenkins: (19:46)

Right. So you start with the piece that isn't working and you're able to easily track back until you find where the problem began, right?

Tushar Thole: (19:54)

Correct. Yes.

Kris Jenkins: (19:55)

Which can be a nightmare if you don't have the right tools. That can be hideous. So I'm glad you're building the right tools.

Tushar Thole: (20:04)

Yeah. And other thing I would say is that again, at Confluent we have always put among few company values that we have, the number one thing is we want to earn love of our customers. And again, I would say, if I look at Lineage in particular, right?

Kris Jenkins: (20:20)

Mm-hmm (affirmative).

Tushar Thole: (20:21)

So again, maybe a couple of years ago, we, again, heard from customers that they were trying to solve this problem about how do they do root cause analysis? And as a response to that, we figured out, hey, let us build a feature called Data Flow, which tells you how the data is flowing. You can use it for root cause analysis.

Tushar Thole: (20:43)

And Lineage takes it to the whole new level, right? It enables you... Data Flow used to give you just two steps, producer, topic, consumer view. And Lineage today gives you this entire view within a cluster. Within a cluster, no matter how many [inaudible 00:21:02] your data is taking, you can actually see that. So again, it is purely based on customer feedback that, hey, this is something that they wanted to do and we thought, "Okay, we can do this better"

Kris Jenkins: (21:11)

Yeah.

Tushar Thole: (21:12)

And another thing I would say is that we also see that customers are using this tool more and more to answer how do they comply with various regulations such as GDPR or CCPA. Yeah.

Kris Jenkins: (21:35)

Yeah. Tracking those kinds of things. So how does it... That's a good one to drill into. How is it going to help me specifically with GDPR? Because that's a common headache.

Tushar Thole: (21:45)

Yeah. So currently what we are trying to do is that with Catalog, we will, so again, it crosses the boundary between Catalog and Lineage, sort of. So let's say, let's take a hypothetical regulation which says that your data-

Tushar Thole: (22:03)

Let's take a hypothetical regulation, which says that sensitive data should not be accessible to everybody. Let's say this is our use case. And so what you want to be able to do is that you want to find what is a personally identifiable information that is present in all topics that you have. And you want to make sure that only Chris has access to those topics. And in order to do that, first of all, you want to be able to find such topics. And again, what you would say is that, hey, the data is defined by the schema. The schema says that this data is going to have these five fields out of that. Let's say there is a field called social security number, that is personally identifiable information. And I want to make sure that any topic that contains this data should be tagged as PII, so that only Chris has access to it. Nobody else would.

Tushar Thole: (23:02)

So, in that case, there are a few things that we can do here, that we enable to take that out of the box. First one is that using schema resist, you can define the schema that you want, right? Then using our stream catalog product, you can search and quickly find out that, hey, do I have a field called social security number in any of my schemas? Once you do that, you can tag, or you can annotate the schemas with the tags or key value pairs that you're interested in. You can say that, "Hey, I want to add a tag to this field saying PII." So now we have annotated your schema and now [crosstalk 00:23:41]-

Kris Jenkins: (23:41)

Is that tagging the individual field or the-

Tushar Thole: (23:44)

Individual field.

Kris Jenkins: (23:45)

Right, okay. Yeah. So, it's granular.

Tushar Thole: (23:46)

Yeah. It's granular. And now if you go to your stream lineage, it can actually show you these labels. You can see which topics store the data with this PII field. And so this way you can see that, hey, maybe this data should not leave a particular region, but it is going to a topic which is stored in a cluster, which is stored in a region that is outside the geographical boundary. So, this is one way customers can figure out whether they comply with certain regulations or not.

Tushar Thole: (24:27)

And one thing we realize is that the way I look at stream governance today is that, we have just started this journey, right? Again, it is based on customer feedback. The product is out there for more than nine months in preview phase. And now we have announced general availability of it. But again, still, I think that it is a beginning of a long journey for us, along with our customers. And right now what is, we have released the building blocks that will continue enhancing on, so that we can build these higher order solutions.

Tushar Thole: (25:05)

So, one use case that we can solve for our customers going forward is that customers might say, "I have all the tools, right? I can figure out, do I comply with some regulations such as the GDPR, but I don't even want to do that step, right? Why don't you give me a report? Do I comply with GDPR? Because you have all the insights, you can actually look at it." And this is something that we definitely want to explore. So again, I would say that today we enable customers to figure out whether they comply with their regulation that they're trying to comply with. But in the future, we'll try to, again, take that pain point away. Because again, at the end of the day, our mission is very simple, for stream [inaudible 00:25:52], that we want to set data in motion so that you focus on, can I make business decision? All these things that we know you need to worry about, if we can't take care of it ourselves, we'll try to do that.

Kris Jenkins: (26:05)

Yeah. And if we can't, then we'll give you the tools that you can at least do it yourself, for now.

Tushar Thole: (26:10)

Yeah. That is the first step. That's where we are, correct. Yeah.

Kris Jenkins: (26:13)

Yeah. On the journey. Okay. Is this something that just affects large companies or specific types of companies? Or do you think it's kind of general use no matter your size?

Tushar Thole: (26:27)

Yeah. I would think that regardless of, again, as the first thing that we talked about, regardless of the organization or the domain or the industry that you are in. If you care about data, whether it is who are the customers ordering some stuff from my eCommerce. I'm a eCommerce vendor and I have something on Walmart or Amazon. I want to know who are the customers ordering from my store. Or, I want to know how to charge customers for the ride they just took in my car. So, regardless of industry you are in, if you care about data, I think you should care about governance.

Tushar Thole: (27:12)

And we used to say that if you have data, you should have schema. That's what we used to say a couple of years ago. It's still true. And what we say is that we want to take it to the next step, that if you have data, you should have data governance. Because, if you look at the recent developments past, if you just look at past four months, cybersecurity is a big deal. People realize that data is the key asset to an organization, that gives them a strategic advantage today and in the future. So, you want to make sure it is accessible to the right set of people and it helps you do the job you want to do quickly. So, I think with that, I think governance would appeal to customers of any size or shape.

Tushar Thole: (28:12)

And one thing that we realized was that in 2021, we had kept governance in a private invite only preview. We worked with companies across various industries. Four of those customers became our reference customers, and shout out to them Care.com, Judo Bank or Richie brothers, [inaudible 00:28:35]. And they're in different verticals, and they all find value in it. And one thing that we have realized is that there is this tendency to over serve our customers. But, we think that, I mean, as the product matures, and right now, I think they're in a stage where everything that we build has value, no matter how big or small you were.

Kris Jenkins: (29:04)

All right. I was wondering, so you've had it in preview for a while. You've worked with customers who are using it actively. Have you had any surprising uses of it? Has anyone used this toolset for things you weren't thinking of?

Tushar Thole: (29:23)

Yeah. So I would say that, let me think. So one of the interesting use cases that we found was internal, where the way we, when we even talk about lineage, we talk about it as it is a knowledge graph. It helps you see all your data in one place. And of course we know that root cause analysis is a use case that customers are interested in, but we weren't sure that, hey, this is something that they will latch onto. And one thing that we started realizing is that we started getting these kudos from our support team, who found it tremendously valuable to deal with ongoing escalations. That when there is escalation that is ongoing and they want to quickly figure out what is the problem, they could get to a conclusion quickly by using stream lineage.

Tushar Thole: (30:36)

And that actually was a pleasant surprise to us that, hey, we always knew that this is something that it can use it for, which actually making a difference to somebody whose main job is to get to a solution quickly, who's under pressure. And they do use it now as one of their primary tools. So this is something that was a pleasant surprise. And another reason, I mean, I would like to highlight this use case is that one thing I believe in is that we want to drink our own champagne and not because-

Kris Jenkins: (31:11)

I've not heard that version of it before.

Tushar Thole: (31:14)

Yeah. Not because you are biased that, hey, this is something that we have built. But the thing is that if you are solving your own problem, then you have better empathy with your customers. You know how your customers are going to use it. They may be in different industry, but you empathize with them better. And this was a very great use case where we can see that, hey, if Confluent is a company that builds infrastructure for data, it is now becoming a core DNA of any organization in the world. And we are leveraging our tools to solve our own use case, I mean, which is very encouraging to see.

Kris Jenkins: (31:56)

Yeah. You think you're building a product for customers and then you accidentally find it solving your own problems, that's probably a good sign, right?

Tushar Thole: (32:04)

Yeah.

Kris Jenkins: (32:05)

And also when you've got those kinds of feedback loops, it always leads to better quality software. You write the best software when you rely on it yourself.

Tushar Thole: (32:14)

Yeah. Totally.

Kris Jenkins: (32:15)

We had Domenico Fiorentini on the show recently, and he's done a Kafka transformation where they're dealing with data sources from shops and online and different third party systems. And they've spent the past 18 months pulling all that data into Kafka. And I wish I'd done the podcast in the other order, because I would ask him, have you tried this? What do you think of it? Because you've got all these different data feeds that you've been busy pulling together and cleaning up, and now you need to see the bigger picture. You need step back and say, where did this all come from? And the lineage of that. I might get into touch with him and see what he thinks.

Tushar Thole: (32:57)

Yeah, definitely. I mean, I'll definitely reach out to Domenico. And actually, that is something that.

Tushar Thole: (33:03)

[inaudible 00:33:00]. And actually that is something that... One call to action I have for anybody who listens to this podcast, that if you are a developer, we want you to try out the API and see, hey, is there something that you would like to see? And again, if you are a customer, we definitely want you to give it a try. And again, no set up necessary. Because what we want to do is that our goal is to make sure that, are we solving the right problem? And we can only understand that better the more we talk to customers.

Kris Jenkins: (33:39)

Yeah. Again, it's that feedback loop, right?

Tushar Thole: (33:41)

Yeah. Yeah.

Kris Jenkins: (33:43)

Okay. So that's where we are today. You must have plans for the future. Can you tell me what's coming up on the horizon?

Tushar Thole: (33:51)

Yeah, that's a very interesting question. So before I answer, I just want to recap from the customer's point of view where we are. So from customer's point of view, I think they face a paradox of choice. What I mean by that is there's a proliferation of open source options in governance space today. Then at the same time, there are a lot of solutions offered by income burns in this space. But today, if I look at from customer's point of view, there's a few things that they're trying to use, a few qualities that they're looking for. One is that, hey, does it offer a managed solution or do I have to manage it myself? Second is that, does it offer all the enterprise grid features that my business needs to make quick decisions? And do I believe in this team? Can they innovate in this space so that I'm always on the cutting edge of the technology? And is this domain this company's part of core mission? So these criterias are on top of mind for our customers.

Tushar Thole: (35:10)

But with this context, there are a few things that we can publicly talk about, about what we want to do next in governance space. First one is that we want to keep refining the building blocks that we have announced as part of our GA launch in Kafka Summit in US in September, 2021. And you will see some exciting announcements coming up in near future about the additional features that we are adding in this space. And again, the focus here is twofold. One is that as a whole, we have a managed cloud agnostic offering for our customers. Can we improve the SaaS cushion of that offering? That means, can we give customers freedom of choice? They want to have their governance offering, managed governance offering running in a particular region. Can we do that? They want to have some specific private networking options offered by a specific cloud provider. Can we offer that? That enables customers to solve the problem in their own way. And at the same time, can we reduce cost for our customers so that they can spend their money where their business is?

Tushar Thole: (36:25)

Second thing I would say in terms of what we are trying to do in terms of refining these building blocks is that in all these areas, whether it is catalog, we want to figure out, hey, can we improve the experience for the users? That means we'll continue innovating in the space. At the same time, can we partner with some of the offerings out there so that customers can immediately get value out of it? That is something we would like to do. In case of lineage as we... But if you talked about, we want to take it to the next level that could they show you the view or lineage view within a cluster, we can take it across clusters. We can figure out, how do we make it easy for customers to figure out, do they comply with a certain regulation so that they can answer this question from the data steward in their organization confidently, that do they comply with something or not?

Tushar Thole: (37:23)

Last but not the least, the most exciting thing in my mind is that we'll make strides in our data quality story because there is so much we can do in terms of different validation that we can introduce, give better monitoring and visibility in the trustworthiness of data for customers, introduce various transformations. So this is a view I see. And last thing I would say is, as we think about the road ahead, the first part I talked about was how we want to go deep, refine what we have. Second aspect I look at it is that we want to go broad. And what I mean by going broad is that I'll just take one example, like that customers are solving many complex use cases with our stream governance portfolio, and we can communicate the value proposition of stream governance better to our customers if we can take a solutions based approach to explain, "Hey, this is a targeted problem. Let's say this [inaudible 00:38:27]. How can you do something better using governance?" One concrete example I can talk about today is our talented office of the CTO group at Confluent put together this demo about data mesh. It leverages governance behind the scenes. And that is just an example that, hey, you can not only use governance for the targeted use cases we talked about, but hey, now we are trying to figure out this next big thing about data mesh. And if you want to stop it from becoming a data mess, you can use governance. And this is something we want to communicate better to our customers. And again, this is, as you said, this is going to be a feedback loop. The more we talk to our customers, the more we are going to learn and will feedback into our product.

Kris Jenkins: (39:18)

Yeah. Yeah. The thing with data mesh is so it's like you are not just tagging for sensitive GDPR information, but you can also tag as, this is a data product we're publishing with this SLA within the organization. So that tagging wears many different hats, depending on what you want to say about meta data.

Tushar Thole: (39:37)

Yeah. Yeah. And yeah, Jay's keynote from Kafka Summit paints a really nice picture, so yeah. I highly recommend listen to that.

Kris Jenkins: (39:48)

We'll put a link in the show notes. I'm going to finish on one very practical question. If someone is a user of Confluent Cloud, do they need to do some something to enable this or could they go and kick the tires right now?

Tushar Thole: (40:01)

Oh, they can kick the tires today. So we also give some credit for customers to try out the product. We can include the link to the coupon codes that we have, and we would highly encourage everybody to try the product. It is zero set up on your part and give us feedback. We are here to listen.

Kris Jenkins: (40:25)

Cool. Cool. I imagine some people will go and kick their tires on it immediately and some will wait until they've got a problem that suddenly they need insight into.

Tushar Thole: (40:32)

Yeah, whatever works.

Kris Jenkins: (40:32)

Hopefully we'll make that day a bit better.

Tushar Thole: (40:33)

Yeah, totally.

Kris Jenkins: (40:37)

Well, Tushar, thank you very much for coming and talking to us, and maybe we'll have you back in a year and you can tell us what progress you've made and how people are using this.

Tushar Thole: (40:45)

Yeah. It was a pleasure chatting with you. Thank you very much for having me.

Kris Jenkins: (40:49)

Thanks for joining us. Bye.

Tushar Thole: (40:51)

Bye.

Kris Jenkins: (40:52)

So that was stream governance, which kind of feels to me like a second generation product. Your first problem is capturing data and using it, but once you've got that solved, you kind of got to step back and say, well, I've got all this data. What is it? What can I do with it? How can I put it to better use? And that's the next chapter in event driven architectures. Anything we can do to unlock that data more easily for use cases we hadn't even thought of, that's exciting. Before we go, let me remind you that if you want to learn more about event driven architectures, we'll teach you everything you need to know at Confluent Developer, which you'll find at developer.confluent.io. If you're a beginner, we have getting started guides and if you want to learn more, there are blog posts, recipes, in-depth courses. And of course, if you want to see those stream governance tools we talked about for yourself, then sign up for a Confluent Cloud account and use the code PODCAST100. That'll give you $100 of extra free credit and so you can give it a proper test.

Kris Jenkins: (41:56)

If you liked today's episode, then please click like or thumbs up or star or whatever's on your user interface and let us know that you're interested so we know what we can do in future. If you have a comment or want to get in touch, please do. I'd love to hear from you. If you're listening to this, my contact details are in the show notes. And if you're watching this, then there's a comment box down there. Use that. Be nice. Be nice on the internet always. With that, it just remains for me to thank Tushar Thole for joining us and you for listening. I've been your host, Kris Jenkins, and I'll catch you next time.

Data availability, usability, integrity, and security are words that we sometimes hear a lot. But what do they actually look like when put into practice? That’s where data governance comes in. This becomes especially tricky when working with real-time data architectures.

Tushar Thole (Senior Manager, Engineering, Trust & Security, Confluent) focuses on delivering features for software-defined storage, software-defined networking (SD-WAN), security, and cloud-native domains. In this episode, he shares the importance of real-time data governance and the product portfolio—Stream Governance, which his team has been building to fostering the collaboration and knowledge sharing necessary to become an event-centric business while remaining compliant within an ever-evolving landscape of data regulations.

With the increase of data volume, variety, and velocity, data governance is mandatory for trustworthy, usable, accurate, and accessible data across organizations, especially with distributed data in motion.

When it comes to choosing a tool to govern real-time distributed data, there is often a paradox of choice. Some tools are built for handling data at rest, while open source alternatives lack features and are not managed services that can be integrated with the Apache Kafka® ecosystem natively.

To solve governance use cases by delivering high-quality data assets, Tushar and his team have been taking Confluent Schema Registry, considered the de facto metadata management standard for the ecosystem, to the next level. This approach to governance allows organizations to scale Kafka operations for real-time observability with security and quality.

The fully managed, cloud-native Stream Governance framework is based on three key workflows:

Stream catalog: Search and discover data in a self-service fashion
Stream lineage: Understand the complex data relationships with interactive, end-to-end maps of event streams
Stream quality: Deliver trusted, high-quality event streams to the organization

Tushar also shares use cases around data governance and sheds light on the Stream Governance roadmap.

EPISODE LINKS

Continue Listening

Episode 206March 29, 2022 | 23 min

Bridging Frontend and Backend with GraphQL and Apache Kafka ft. Gerard Klijs

What is GraphQL? And how can you combine GraphQL with Apache Kafka to query data in real time? With over 10 years of experience as a backend engineer, Gerard Klijs is a Confluent Community Catalyst, a contributor to the GraphQL project, and also a creator and maintainer of a Rust library to use Confluent Schema Registry with Java client. In this episode, he explains why you want to use Kafka with GraphQL and how they work together to bridge the gap between backend and frontend to make data more easily accessible in the frontend.

Listen Now

Episode 207April 7, 2022 | 70 min

Scaling an Apache Kafka Based Architecture at Therapie Clinic

Scaling Apache Kafka can be tricky, let alone scaling a team. When he was first hired, Domenico Fioravanti of Therapie Clinic was given the challenging task of assembling a sizable tech team from scratch, while simultaneously building a scalable and decoupled architecture from the ground up. In addition, he wanted to deliver value to the company from day one. One way that Domenico ultimately accomplished these goals was by focusing on managed solutions in order to avoid large investments in engineering know-how. Another way was to deliver quickly to production by using the existing knowledge of his team.

Listen Now

Episode 208April 12, 2022 | 10 min

Confluent Platform 7.1: New Features + Updates

Confluent Platform 7.1 expands upon its already innovative features, adding improvements in key areas that benefit data consistency, allow for increased speed and scale, and enhance resilience and reliability. Following the standard for every Confluent release, Confluent Platform 7.1 is built on the most recent version of Apache Kafka 3.1, including KIP-768: extend SASL/OAUTHBEARER with support for OIDC, KIP-773: Differentiate consistently metric latency measured in mills and nanos, as well as KIP-775: Custom partitioners in foreign-key-joins.

Listen Now

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free

Apache Iceberg ™

Kafka® 101

Apache Flink® SQL

Apache Flink® Table API: Processing Data Streams in Java

Designing Event-Driven Microservices

Apache Flink® 101

Building Flink® Apps in Java

Kafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Articles

Patterns

FAQs

Blog

Streamables

Learn More

Language Guides

Tutorials

Demos

Meetups

Community Slack

Community Catalysts

Community Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2026

Past Current and Kafka Summit events

Building Real-Time Data Governance at Scale with Apache Kafka ft. Tushar Thole

Kris Jenkins: (00:00)

Kris Jenkins: (00:49)

Kris Jenkins: (01:17)

Tushar Thole: (01:36)

Kris Jenkins: (01:40)

Tushar Thole: (02:02)

Kris Jenkins: (02:18)

Tushar Thole: (02:25)

Tushar Thole: (03:24)

Kris Jenkins: (04:33)

Tushar Thole: (04:41)

Tushar Thole: (05:23)

Kris Jenkins: (06:08)

Tushar Thole: (06:25)

Tushar Thole: (07:13)

Kris Jenkins: (07:58)

Tushar Thole: (08:11)

Kris Jenkins: (08:55)

Tushar Thole: (09:01)

Kris Jenkins: (09:01)

Tushar Thole: (09:02)

Kris Jenkins: (09:13)

Tushar Thole: (09:29)

Tushar Thole: (10:17)

Tushar Thole: (11:03)

Tushar Thole: (11:12)

Tushar Thole: (11:44)

Kris Jenkins: (12:10)

Tushar Thole: (12:35)

Tushar Thole: (12:56)

Tushar Thole: (13:10)

Tushar Thole: (13:33)

Kris Jenkins: (13:53)

Tushar Thole: (14:08)

Kris Jenkins: (14:09)

Tushar Thole: (14:10)

Kris Jenkins: (14:11)

Tushar Thole: (14:14)

Tushar Thole: (14:47)

Tushar Thole: (14:59)

Tushar Thole: (15:24)

Kris Jenkins: (15:46)

Tushar Thole: (15:54)

Kris Jenkins: (15:56)

Tushar Thole: (16:06)

Kris Jenkins: (16:09)

Tushar Thole: (16:12)

Kris Jenkins: (16:23)

Tushar Thole: (16:25)

Tushar Thole: (16:42)