OpenTelemetry is getting a lot of press these days. I talked today to Confluent engineer Xavier Léauté, about the role it plays in the overall observability solution behind Confluent Cloud itself. Listen, in on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud. Hello and welcome to another episode of Streaming Audio. I am your host Tim Berglund, available these days as an audio feed and on YouTube as well. So check that out. If you're the sort of person who likes to consume your podcasts as video artifacts, that's the thing you can do. I'm joined in the studio today by my coworker Xavier Léauté. Xavier, welcome to Streaming Audio.
Thank you, Tim. It's a pleasure to be here.
We are going to talk today about OpenTelemetry. Now this is one of those things six months ago, I had heard of it, but just barely. It was like not on my radar at all. And now let's just say my radar is full of little pings to do with OpenTelemetry. It's on everybody's mind and it seems like a thing everybody's talking about. So I want to dive into what it does, what it is and talk about implications for Kafka and Confluent Cloud and just whatever you got. But first, could you tell us a little bit more about yourself? What do you do?
And so I work on the engineering team here at Confluent. One of my primary responsibility is kind of leading all over what we call observability effort. I think it's a pretty broad term, especially at Confluent since we have both internal needs and external customers, both of those want to have visibility and assistance we run. And oftentimes those are the same systems. We maintain them and manage for our customers are the same that our customers also want to have visibility into. So it's kind of a very cross-cutting role and it's maybe different than some of the other companies out there that just do this for their internal purposes. So this sort of also makes it an interesting challenge to work on these problems. And that's kind of been my focus for most of my tenure here at Confluent, starting out with some of the on-prem concerns and then now moving on to the cloud where now we're full-steam getting all these things up.
Quite so. And observability, I guess we used to call that monitoring and alerting and things like that. It's kind of the, not just the new word, but a slightly broader and more inclusive category for that set of activities, wanting to know what's going on with a system that you're operating. Does that sound fair?
I think that's fair. I think there's a lot more people with much stronger opinions about what these terms means out there. On Twitter you can find-
Those are productive debates.
People that work at Lightstep or Honeycomb can probably tell you more about how they think about these things. I think we address both here. In the way we use this data is actually both for observability and for monitoring. I think today we mostly covering the monitoring aspects, but we're adding more and more of the observability part, which is really debugging the system that you're trying to, when you're trying to understand when things go wrong, rather than just like, okay, I just want to have some health indicators, which is what monitoring is about. I'll probably stop there before other people start complaining about what I say.
Before we get into. This is like asking what a developer advocate does or something like that. It's just not something you should do on a podcast. It's too political. Anyway, tell us about OpenTelemetry. I mean, it is a relatively new thing in terms of its participation in the popular community consciousness, just as a piece of technology we talk about. So assume listeners don't know what it is. Is it a set of standards? Is it an API? Is it infrastructure? What is it?
In some ways it's all these three things combined, which is actually the nice thing about this. I think people that have been working in this space probably know about open tracing, which was basically an API standard for people to install into applications and get traces collected without having to use vendor specific APIs. There was also a project from Google that was addressing more the metrics side of things, which is called OpenCensus. And so these two parties basically decided to join efforts and be like, well, we should be providing these becoming the standard to instrument [inaudible 00:05:02] your applications. And they went a bit further than what OpenTracing did additionally, which is to actually provide both the API is to instrument your applications, but also define the formats that this data would be collected in.
And the nice thing about that is that that makes it, your system is completely vendor agnostic, and you can then send this data directly to various different observability providers or monitoring providers without having to install custom agents or things like that for every single vendor that you want to integrate with. So you basically now can instrument your libraries once. Sorry, some applications once and then get all this data wherever you want it without having to redo the work every single time you want to switch from one vendor to another. Or use the data yourself for that matter.
So that instrumenting step is that, how should I think about that? Is that a library that I code against? Is that a step in the build process that goes through my Java by code and does things to it. I mean, what's that-
There's various levels of how involved you want to get in the process. I think most people-
By the way, I just made a language assumption there. I just said Java, and I assume there is support for Java, but maybe talk to us about that little bit of parochialism of mine as well. Maybe there's more than Java.
There's a lot more than Java. I think the nice thing about these parties is that, especially up until now it's really gotten traction. There are many, many different languages that are supporting it. Now there has been implementations for, I mean, Java and Go are the two ones that I deal the most with, but there's also Python and JavaScript now. The whole ecosystem is really flourishing. So there is a lot of things coming down and you can hear more about all these things on the open Twitch channel or other places where they talk about these things and their recent developments.
Sounds like we should link to that in the show notes, the Twitch channel.
Yeah. So there's a lot more being discussed there and people go into the core details of what's happening there. So you were asking about instrumenting the application versus just letting it do its thing. So depending on the language you use there are various levels of work involved due to the way that the language was working. For Java, you can just have an agent that audit instruments, a whole bunch of things it and does things for you. And you get a lot of nice things out of the box, which can be useful, which is kind of similar to what you get out of the box for most frameworks that do like an API layer, an automatic of your traces, and you'll have useful information, but it doesn't tell you anything specific about your application. So you still have to go in there-
You can run time component.
That's right. So they would be-
The plan would be just deployed and there was nothing about OpenTelemetry, but here's the agent.
Right. So in the case of Java you obviously inject this agent that will go instrument your bytecode, figure out all the classes that knows about whether it's like your, the RPC framework or your Spring framework, your Dropwizard framework. All these things like it and figure these things out and basically adds instrumentation where it sees anything and it knows.
Yeah. And it knows those usual suspects, some set of those usual frameworks aspects, looking for a class name to instrument and methods and so forth.
And in other languages, you still have to go do a bit more work, but they usually have some kind of hooks. And then the framework basically gets into these hooks and you explicitly have to code against it, but it will take care of a lot of the heavy lifting for you.
Got you.
The interesting part is when you actually want to provide more additional details about your own application. So if you have certain attributes or certain properties you want to capture or with every single request and your traces, or you want to also inject those things into the metrics that you're collecting. So that's when you have to do a little bit more work. But that's also where it becomes a lot more powerful and interesting because you have visibility to things that you care about, which is what this really allows you to do without having to use anything that's pretty vendor specific or locking you into whichever vendor you're currently using. You're kind of agnostic to that and you don't have to worry about rewriting this in the future. So that's kind of the holy grail so to speak that we're trying to get to.
Got you. And so again, asking on behalf of all the people whose specialty is not observability, that's my version of asking for a friend you know. When you say vendor, so we've got instrumented code, we've got the standard ways of instrumenting code. That's either a runtime thing or a build time thing or an API. There's seems like there's a spectrum of options depending on my language, get built as the runtime. What's the vendor? So like this data is going somewhere. This method got called, this thing happened and I'm presuming OpenTelemetry is wanting to send that to something. Those are the vendors you're referring to, right?
That's right. So once that data gets collected, [inaudible 00:10:40], you want to actually be able to act on this here. So you want to have some nice dashboards. You may have some APM tools out there you want to use, whether it's Datadog, you have like Honeycomb, Lightstep, all these things, Dynatrace. There's a lot of people out there, and a lot of them are actually part of these OpenTelementry efforts now. And really participating, which is the nice thing is that they're all adopting the standards. Once you have this data collected, you can basically send it directly to any of them without having to do any changes or adding more, deploying more agents or things like that to your system to get this data into these systems.
They're a part of the standards body, which the vendors would be. And on the lower level, do you get things like Splunk and Elastic and just sort of more general places to put things? Is that a little bit out of the domain?
So there is an effort separately to, that's kind of nascent that, which is other than metrics and traces also getting logs through the same system. So they're making all the log collection kind of a vendor agnostic. So that's just starting out. I think that's one of the much more recent efforts in OpenTelemetry. And those will probably integrate with things like Splunk and I'm sure they will be, Elastic will probably come out with some kind of integration at some point I imagine. And Splunk is actually a big participant in these efforts hence they also have signal effects and as one of their products out there. So they actually heavily invested in the standard as well.
Oh, nice. Okay. And it seems like there's probably some cost implications. So I want to talk about those [inaudible 00:12:39]. Here's these streams of data coming out of the nodes of your application. So gee, where's that going to go? So there's probably cost implications and it would be nice to the extent that you can talk about how we use this in Confluent Cloud. It's always interesting to know a little bit about how the product is made. I mean, that's the nice thing about it being the cloud is that, if you're building something on top of Confluent things or Kafka, you don't have to care how it's built. You just have some APIs and features and you get things done. You build your own stuff, but it's always nice to look under the hood a little bit. So to the extent that you can talk about it, it'd be great to hear how we do things in the cloud.
Yeah. So we actually have been early adopters of OpenTelemetry or in OpenCensus and some of these standards in cloud and our main motivation there was, we're running all of these clusters for our customers. We have many different teams working on instrumenting their applications, and those are not just Kafka or Casey connectors, there's other types of services we also run that are internal services. So everyone wants to have a way to instrument their applications, to monitor them, to get visibility into those things. And so we're hoping to get to a point where we can have all, we can all follow the same standard, use the same standard to do that. And in OpenTelementry it kind of like present itself as that holy grail solution that would allow us to basically give people one way to instrument all the applications and never have to think about it again, irrespective of how we then use this data.
And that's kind of where I want to get to is that, since I mentioned earlier, a lot of these metrics and trace that we collect from our systems are actually not only useful to our internal operations, they're also useful to our business. So we want to understand how people use the product, what is the distribution different usage data across the different clusters or different clouds. And we also want to expose this data to our customers. They want to have visibility into the clusters that they see. Now, everyone doesn't need access to the full granular data that we collect, but everyone kind of wants to just a slightly different view of that data. So you may want to have like high-level metrics for our executives. You may want to have very low level detailed fine-grained metrics for engineering teams that want to troubleshoot things down to the specific user or customer or request type in Kafka.
And then we have our customers, they don't see those nitty gritties. We don't want to expose things to them that they shouldn't have to care about. That should be of course, something that we take care of. But they still wanted to see like, okay, well, what is my throughput per topic? And what applications are consuming from which topics and how much? And so having that data is very useful to them. But ultimately all these things can be derived from the same source of data. And the way that traditionally it has to be done is if we want to get, use whatever third party tool we use to have our own internal dashboards and alerting, we have to get an agent out there for that and have that collective metrics. And then we also have to deploy our own agent to collect the metrics for ourselves and then pipe them into our business systems and to our APIs and things like that.
So now we're basically getting the point where we can replace all these things with just one thing, we're just collecting the data once. And then we can funnel it to the different places where we want to use it. So if that's like whichever vendor we're using to do all our monitoring and alerting, we can send it directly to them. We can push it into a Kafka topic to then have that consumed by our data science team and do whatever fancy analytics they want to join that and business reporting and all that. And then we also pipe that into our infrastructure to serve to our customers through our metrics API. So basically having all this data in one place, we can now have, form that into many different other systems and build new products off of it.
And they're all consistent, which is a nice thing because in the past, someone from the engineering team or the support team may have seen some issue and then they look at what the customer sees in the API, which is different than what the engineer is looking at in their dashboard. And usually, I mean, the source is the same, but the data might be slightly different because of the way it gets instrumented. The namings are not consistent. It gets very confusing very quickly, especially when you have to deal with thousands of different clusters and trying to figure out what's going on. So by having all this data just collected once, we're all seeing a consistent view of all that data, which is very nice. And it's also a standard, which means we can actually use that standard to put that, use that standard to encode this dat, and then funnel it and build components that will work with that standard or funnel into other systems that understand the same standard, which is makes this very powerful.
Yeah, it is. Some things like this I learn even about our own product and they're surprising and then I think why was that surprising? Because it just makes sense. But you know what starts as, "Hey, we need to keep on top of this cloud service we're operating because it needs to remain available for customers because they pay for that. And so let's have an observability solution in place," also becomes a part of the product. Like you said throughput per topic. Well, that's in the UI. All you need to do is click on topics and you see that if that's the same number we're talking about. I mean, that shows up in the UI, right?
Right. Exactly. That's the same exact number that our engineers would see or [inaudible 00:19:06] coming from the same messages we collect at the source.
It's important. You might really need to know that just by looking in the UI, but to some degree, that's a little bit of sort of syrup on top of the thing. It's just, oh look, I can see the throughput. That's nice and it looks pretty. And the other parts of the system are not looking at that UI. But the metrics API, like that's part of the product. That's like a meaningful valuable feature of the thing. I don't want to turn this into a Confluent Cloud commercial, but I am putting a link in the show notes to something describing the metrics API because if you don't know what it is, you should, and it makes sense that that's driven by the same infrastructure. So it just kind of cool to me. And again, it shouldn't be cool. It should be obvious, but it's cool. And I'm just going to sit in the coolness for a moment that we get both done at once. We operate the cloud and make sure it remains available and that same data drives actual features that make it valuable.
That's right. And it's not just for the cloud. We also have Confluent for our on-premise customers. They now have the ability to send us data to the cloud so that we can get ahead of the curve if you want and detect issues with their clusters and let them know about those things before they become problems. So that infrastructure-
So this are called proactive support feature, right?
Yes. I think the product name will probably change in the future, but I think private support is what we call that today.
Okay. Okay. And that's all same foundation, the same instrumentation?
So that's exactly the same instrumentation. So we're using exactly the same format, the same way of collecting this data from on-premise customers and cloud customers. So it means we can reuse all the same infrastructure we've already built for the cloud for on-prem customers. And that makes it very easy to then build more things on top of it.
Yeah. Yeah. So there's one place that investment goes and both types of people, whether you're on prem or in the cloud, you can benefit either way. This is totally selling commercial. I don't mean it to be one, but that is super cool that you can just double up on that investment and not have to build all that stuff twice. Because if you had to build it twice, one of them would be of quarter as good and the other one would be an eighth is good and it never works out. The whole never ends up being equal to the sum of its parts there unfortunately. So that is excellent. How about Kafka itself? Does this land on the open source stuff in any way?
So today most of these components that we built that leverage, I mean today it's, I can say that I mean for Confluent Cloud and Confluent platform, we actually use the OpenCensus format, which is also a standard. We've adopted that initially because OpenTelemetry wasn't quite mature yet. We're not mature enough yet for us to build on that. And we started these efforts almost two years ago. So OpenTelemetry has come a long way since then. So now we're actually in the process of migrating these components that collect data from Kafka brokers and we want that to start using the OpenTelemetry format. So once we do that, I think there'll be an opportunity to also make these things available in Kafka itself. And I think it would be great to have native integration with OpenTelementry and in Kafka directly. Of course there's a lot of other things happening at the same time. So we can't get to all of these things at once.
It's not officially published yet, but as a draft came out there and, which addresses a slightly different problem, because I think having OpenTelemetry would be nice for Kafka brokers. But I think a lot of people already have instrumentation for the Kafka brokers and so that would just be nice to have so to speak. When it comes to new functionality, there is something that you can't really do today, which is to instrument your client applications. So Kafka client applications are often very difficult to instrument or people forget to instrument them or they're leaving, the Kafka clients [inaudible 00:23:55] into some applications that you don't have very good control over. So it's usually more difficult to instrument that. So what we're trying to do is make it easier to get data about these clients and collect that on the broker side, because brokers are usually pretty well instrumented.
So if we can get the clients to actually send us the metrics or some kind of standardized information about themselves that can help us debug applications that aren't behaving well or applications that are seeing issues or if we're seeing issues on the brokers, we can tie them back to maybe some issues on the client side if we had this data. And I think that-
That's KIP-714 you're talking about, right?
That's right. So KIP-714 is basically trying to establish a protocol for the client to actually start sending metrics and other telemetry in the future to the brokers that then can be exposed at the broker level. And the operator of the Kafka cluster can then have visibility also into the health of the clients that are writing and reading from these clusters.
Got it. And that's why you started under the assumption that people have a broker observability solution in place, there are API APIs to get stuff out of brokers, like everybody's kind of plugged into that. So now let's try to suck client metrics into the broker through a standard mechanism. And is that then based on OpenTelementry somehow? Is it like an assumption that OpenTelementry is in the client or is it not? I haven't read the kit obviously.
Right. So we figured there's already a protocol out there to define metrics and there's a well-defined format in how to transport those. So we don't want to reinvent the wheel and the invent our own protocol here. So we figured, well, this is a very good fit for this particular use case. If we have these clients just basically generate the native OpenTelementry format and then send that to the brokers, the brokers can then just forward this data directly to some kind of agent and not even have to look at it. And then the cluster operator can basically get visibility to the clients without having to do much additional work.
So that's kind of the idea behind that. And the reason is that a lot of clients are, it's very difficult to get access to information about these clients unless you're, the same team is running both the broker and the clients. It's hard to get the divisible you need. And think also for us in Confluent Cloud, if we had visibility into some of the clients that are running out there, we could help our customers better and help them improve their, the way that they interact with the cloud. So that is another reason why we're doing this.
So folks at Confluent are interested in the success of KIP-714. That gets to be more precise about that because that would be a revolution in the ability to optimize client to cluster connections. I mean, getting all that data, and obviously this would be subject to opting in and sharing and all of the, so nobody panic. It's not like we're going to start extracting a bunch of things from your clients connected to Confluent Cloud that you didn't know we're doing. But with all that happening in an organized and well governed way, dude, that's a lot of data. And putting that in a data warehouse somewhere, if people start asking questions about would probably result in a, my guess is a new series of KIPs that improve other things that we don't even yet know that they need to be improved.
That's right. I think was everyone thinks about the Java clients usually when thinking about these things, but here we are actually trying to do this in a way that is agnostic to the clients that are interacting with the cloud. So this will also be applicable to some of the other clients out there like they've already Kafka or others we're hoping that they will all implement the same functionality so that we can better understand what's happening and not make that specific to one particular client or not. Of course, I mean, there might be additional metrics that we also collect from specific versions to better understand how they behave.
And yes, I mean making this part of the open source project is key since most people don't use any custom client versions or Confluent version whatever. I think we want this to be something that is, it benefits and end users. And those assets have to be started early because it takes people like years to upgrade their clients. So we want to get this in [inaudible 00:29:11] relatively soon so that within a year or two we'll see some of these, the fruits of these efforts.
I was going to say five, but I appreciate your optimism.
There's early adopters out there that are always a bit more happy to go try these things out but yeah, I think that's. This people will say that things tend to be very slow on the client side, which is also what makes this harder for us because oftentimes we know there are clients that are not behaving because they have bugs. And until recently it was very difficult to even figure out what client version of people are running. So we've added some of that functionality in recent KIP-511 which makes that easier. If we have even more information about how to, about the health of the client and what is going on that will make things even more powerful because it will not just be about, oh, we noticed the client has a bug. We'll be able to tell people like, "Oh, you have a suboptimal configuration." Only the cluster operator can help application developers tune their clients better so that they work well with the system.
Exactly. This is the way that the whole community benefits, even people who aren't cloud users by Kafka becoming more and more a thing that does run in the cloud and KIPs like this being accepted and built and merged, they end up helping everybody uses Kafka in almost any way, which is kind of cool.
And it also reduces the amount of work you have to do to instrument all your different applications. And you just have to basically instrument your broker and then everything would flow from there. And even if the application themselves they can focus on their instrumentation for their actual business logic, but not have to worry too much about the client nitty gritties which are always kind of very pretty deep, and may not be something that they think about much.
My guest today has been Xavier Léauté. Xavier, thanks for being a part of Streaming Audio.
Thank you, Tim. It's been a pleasure.
And there you have it. Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST—that's 60PDCAST—to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available. So don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund, that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign-up links for those things in the show notes. If you'd like to sign up and while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review. And we think that's a good thing. So thanks for your support, and we'll see you next time.
Collecting internal, operational telemetry from Confluent Cloud services and thousands of clusters is no small feat. Stakeholders need to rely on the same data to make operational decisions. Whether it be metrics from clusters in Confluent Cloud or traces from our internal service, they all provide valuable insights not only to engineering teams but also to customers for their own operations and for business reporting needs. Traditionally, this data needs to be collected in multiple ways to satisfy all the different requirements. We leverage third-party vendors for our operational needs, which usually means deploying vendor agents or libraries in addition to our own, as we also need to collect some of the same data to expose to customers.
However, this sometimes leads to discrepancies between various systems, which are often hard to reconcile and make it harder to troubleshoot issues across engineering, data science, and other teams.
One of the earliest software engineers at Confluent, Xavier Léauté is no stranger to this. At Confluent, he leads our observability engineering efforts in Confluent Cloud.
With OpenTelemetry, we can collect data in a vendor-agnostic way. It defines a standard format that all our services can use to expose telemetry, and it provides Go and Java libraries that we can use to instrument our services. Many vendors already integrate with OpenTelemetry, which gives us the flexibility to try out different observability solutions with minimal effort, without the need to rewrite applications or deploy new agents. This means that the same data we send to third parties can also be collected internally (in our own clusters).
The same source of data can then be leveraged in many different ways:
We’ve also adopted the same approach for on-prem customers, which enables us to collect telemetry into our cloud and help them troubleshoot issues, leveraging the same infrastructure that we already built for Cloud.
Regarding OpenTelemetry efforts in Apache Kafka®, we’re working on KIP-714 which will allow us to collect Kafka client metrics to help better understand client-side problems without the need to instrument client applications. Our ultimate goal has always been to migrate to OpenTelemetry, which is now underway. We’d like to make a way for direct integration with OpenTelemetry in Kafka, based on the work that we’ve done at Confluent.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us