Tracing and monitoring. It's a problem that we often talk about on this podcast because it's a thorny one in distributed systems, and we often talk about it from the performance angle. How do you ensure that everything's being processed fast enough? How do you deal with bottlenecks? But this week we're going to look at it in a much more fine-grained way. At the level of individual events. What if you have, say a stock trade coming in over rest, getting processed through various systems, and eventually it has to go out over a connector to meet a reporting deadline?
Maybe every trade has to be reported to the regulator within the hour. That's fairly common these days. How do you debug that processing stream at the level of individual events? How do you track a single event through all those steps? Well, joining me this week with one answer is Roman Kolesnev, who's here to talk to us all about OpenTelemetry, what it can do for you, how it works, how you get started with it, and what it does for Kafka and event systems specifically.
He also covers what he found it lacked, the holes that he's been trying to fill, specifically around Kafka Streams and Kafka Connect support. He's made a lot of progress and he's got a lot to teach us, and there is more than just event level tracing on offer here, but if you need event-level tracing, this is a particularly good podcast to catch. So before we begin, this podcast is brought to you by Confluent Developer. More about that at the end. But for now, I'm your host, Kris Jenkins, and this is Streaming Audio. Let's get into it.
On the podcast today, we have Roman Kolesnev, who's joining us from Ireland, I believe.
Yep. North Island.
Yeah. Yeah. Now you work for, let me get this right, it's CSID. The Customer Solutions and Innovation Division. Is that right?
Yep.
Which gets pronounced CSID. So this is the first thing I have to ask you. Why is it not just called CSI Confluent? That feels like it would be cooler.
CSI. Yeah, maybe. Maybe.
Okay.
Looking for a,
Nevermind. Nevermind. So you guys get to do loads of cool practical things, right? Your department's been doing things like Parallel Consumer and end-to-end encryption and stuff?
Yes. We have kind of two strains of work. One, experimentation and being the crazy scientists/A team, trying to work out things, how they could be implemented, how they can be done, and what's feasible. And part of it is findings professional service engagements and code that can be generalized and made more reusable. So the things that get implemented multiple times for clients in different engagements that generally could be made a bit more robust and made more usable and a bit generic kind of quickly. And-
Right. Pulling out those bespoke patterns into something more useful.
Yes.
Yeah, that makes sense. Yeah. So that brings us to what you've been working on lately because there is a very, very common pattern in building large distributed systems, right?
Yes. I'm experimenting with OpenTelemetry and distributed tracing and how it works with Kafka clients, what's available and what's not available, where the gaps, what we can fix it. It all started more this kind of general, how we can do event-level tracing on Kafka and Kafka ecosystem. And then doing different evaluations and exploring what's available already out of the box, different frameworks. And we landed on OpenTelemetry as a new and coming rage for distributed tracing.
Okay. And tell me something about OpenTelemetry.
So OpenTelemetry is an evolution of OpenTracing and OpenCensus, which originated from Uber and was one of the driving forces. So OpenTracing was doing the tracing side of it. OpenCensus was doing metrics side of it. And OpenTelemetry is kind of new evolution that tries to cater for both. And its goal is to be vendor agnostic so that you can instrument your applications in different platforms in programming languages, but then have the common semantic convention so that those traces and the information could be streamed into common system. And they would all conform to similar specifications and would understand each other, would conform to the same standards.
Okay. And so if you've got your custom microservices around your organization made with all different kinds of technology, you could use OpenTelemetry?
Yes. That's a -
Or if you've got a straightforward Kafka cluster, you could still use it?
Yes. But even with Kafka Cluster for example, some of your clients may be written in Java and some may be written in Go. So even using the same platform for messaging, because you're instrumenting your clients, you're not instrumenting just the messaging platform, or sometimes you're not instrumenting the platform itself at all. You still need to have that conformity on your different clients so that the data that they stream, the traces that they stream are similarly looking for the backend and to be able to process them and understand.
And then a different part of it is that it's vendor agnostic. So you're not streaming traces that only, say Elasticsearch can consume, or Datadog or whoever. You have your collection system that is slightly separated from your processing system and you can switch the backend much more easier because the exporter layer can be configured to point to different systems and move from one back into another, without changing how you collect the data on your application side.
Right. Okay, so that sounds very nice and reusable. Let me just check if I've got check I've got this right. So you're saying it's both for kind of metric level aggregate statistics, and taking the Uber example, I could trace the events related to a particular trip through the system?
Yes. So OpenTelemetry, it has three subsystems. It's tracing, metrics, and logging. Logging is still very experimental. It's still in even more experimental than tracing and metrics. It's not marked as stable on their site, if I'm correct, unless it changed very recently. But the idea is that you need all three generally in your observability.
From a trace, you want to jump to logs, to look at the logs that correspond to the trace. And then maybe to look at the metrics that were collected, either based on the timing or again if you are tagging metrics with trace information because sometimes it's not enough to just have one side of it.
What's the distinction between tracing and logs? I guess they kind of merged in my head.
So the trace records the operations that's being happening and some metadata. And as a developer, you can record business-level trace as a metadata around the business logic and around the message. But, then the logs are your standard logs that you write from the system that may be message related, but they may be more system related, in a sense like consumer polling logs or whatever happening on the system logs.
Generally, again, I guess if you are writing a completely new system from scratch all your message level processing, you could write all the logs as events to the spans to the traces and so on. You can probably get away without writing any logs at all, as logs as such as traditional logs. But if you have already a big system with all the logging implemented and libraries that do the logging and so on, then you need to stitch them together, traces and logs and potentially collecting them in different places as well.
So logs are just kind of the general traditional logging line, dumping ground of everything we know.
Yes.
And tracing is more kind of business logic above it.
It's kind of separate, a little bit. I mean, there is a very thin line because you can do tracing through logs. You can just dump the logs that contain, "Oh, I received the message with such and such metadata. I processed it, I sent it somewhere." And you can do it purely through logging. And then, say you're using Elastic or Splunk or whatever parsing and you're just parsing out the logs that are specific to message flow, for example, ignoring everything else, here's your trace. And you can visualize it, you can reduce metrics out of it, use the log time stamps to work out the performance timings and things like that. And do that all purely on logs.
But, then it's more work. If you have a library that has an SDK, an API, that does it in a more easier, more understandable fashion when it's talking about, "Okay, let's record an operation," and the operation spans from here to here. And then the next operation, it depends on it [inaudible 00:10:22] spans from this point on next point to the next point, and together they make up a trace that captures the flow of this message through services.
Oh, okay. Yeah.
That kind of records it in the natural semantic. And both for the developer and then for architect and for operational people looking it up, it's all talking about the operations and spans and traces rather than it's just a dump of logs.
Yeah, yeah. Okay, I'm with you now. Okay, great. So that kind of -
But, then in that context, sometimes logs are still useful because obviously it's hard sometimes to map the information that you want to capture into the semantic of span and trace and even events happening there. You maybe want to add additional kind of data that you can look up in the logs, once you traced the point to which you look want to look, here's additional information I want to lift up from the logs now that is related to it.
Okay. Yeah, that makes sense to me. So if you want to do that kind of application-level tracing, what do you do currently with OpenTelemetry? Does it have a Kafka plugin that you just wrap into, or what's the deal?
It depends.
My favorite answer.
So if you're talking about Java. Let's take the Java because Java and GVN clients are probably most proliferate, more popular in Kafka ecosystem. So if you're using consumers and producers, a standard Java consumer producer ones, you have few ways to use OpenTelemetry with them. You can add a dependency to your project, to your application and write a bunch of code manually using add span here, add span there, on your application site just before sending or just after consuming. That is a very kind of base thing. But, then you have interceptors you can use, so you can attach interceptor, consumer interceptor, producer interceptor, which would inject the trace points, the spans for send and for receive automatically.
Okay.
And the other approach -
So you can potentially just drop something in and get automatic producer consumer tracing into OpenTelemetry without changing your code.
Yes, you configure receptors and they would be loaded for you and do that. Then, another approach is they provide a wrapping producer and consumer. So you can set up OpenTelemetry object in your code and say give me consumer. Now drop consumer is a tracing consumer, and all its operation will be traced.
The third way is to use Java auto instrumentation. So OpenTelemetry has a Java agent that is written as using APM. Basically like APM library. So it can load it alongside your jar through minus Java agent command line parameter and just loads a jar in addition to your app. And then that auto instrumentation has a lot of different implementations, so it would add automatic tracing for, say [inaudible 00:13:53] if it's web service that you're hosting. Same for Kafka producer, consumer clients, and things like that.
And that's just using the usual JVM-level agent tracing type API?
Yes. So the way it works, under the hood that it has, it uses Byte Buddy, which is a wrapper for Java ASM library. So at class loading it injects additional byte code that then has advices marked for specific libraries and SDKs and frameworks to wrap them into this tracing behavior.
Ah, okay. Right. Okay, I'm with you. Let's get a recommendation out of you, which of those three would you generally recommend people use?
I guess the auto instrumentation agent is the go-to. If you only want to trace Kafka, and it's only Kafka, then interceptors are easier to configure and put in place. But if you want to add additional things like you want to add business logic span in the middle, you are doing web service call for example, things like that, you want the additional instrumentation to be there, then APM the agent is a go-to.
Yeah, that makes sense to me, I think. If I'm going to the effort of installing it anyway, I might as well get things like REST calls thrown in for free, right?
And you can mix and match. So you can have an agent running, there's the traces, the libraries and SDKs use, but then you want to add a span inside your code to create a business logic specific, span specific to your process. You can add it in, they don't clash. You can use both at the same time.
Okay. So I'm leading in to work that I know you've been doing, but what's it like if OpenTelemetry's agent doesn't support a particular API you're interested in? What's it like to add in some extra support?
It supports loading extensions, so in the same fashion as all the instrumentation is written inside the instrumentation agent, you can write your own additional instrumentation as an extension and then you can instrument the application by using the Java agent command line parameter and all that. And then as additional parameter to it, you can specify, "Okay, an addition to the main agent, load those extensions, one or more," up to you. And then you write the same kind of aspect oriented interceptors advice classes using Byte Buddy and so on to add additional behaviors that you need.
That's written in Java?
Yes, that's for Java.
Okay. So you write some Java code that says, "Look out for call to this class function method," and instrument it this way?
Yep. Exactly, yes. In your advice, you're saying for this class name space class, this kind of methods, so for example, public send that takes two parameters and first parameter is integer and second parameter is string apply this advice. And then you have the capability to do OnMethod Enter and OnMethod Exit to specify what additional code you want to run. And you can intercept the parameter values that are passed into the method. You can change the return that's being returned from the method with additional, with a written value if you need to. Gives you a lot of flexibility.
Yeah. It's been a long time since I've done any aspect oriented programming, but it's very useful for that kind of generic rewriting. It always reminded me of database triggers. Now, before you insert this row, do this stuff, right?
Yep. But actually, yes, it's quite similar in that aspect.
Cool. Same concept in a different domain. So you have been writing these, I know. What have you been doing? What's missing in the current state of OpenTelemetry's native Kafka support?
So for Java, again, specifically applications, it covers the consumer and producer tracing quite well. But if you venture a bit outside in the Kafka Streams, in the Kafka Connect, then the support has gaps in it. So there is an instrumentation for Kafka Streams, but it only targets the handover from the consumer into the topology and then tracing the send of the producer. So there is nothing to help with topology tracing inside.
Now, if your Kafka Stream's topology is not using any stateful operations, if you're doing everything in flight, like some mapping and then send out, that works fine. The problem start if you're using stateful operations and you want to write some data in the state store, then look it up later for aggregation, for joins, for things like that, then your tracing context is getting lost.
Okay. So as soon as you do something like a K table or a join, the support falls apart?
Yeah.
Okay. So you've been fixing that?
Yes, I was playing about with that. And the main problems. There are two things that kind of apply, not just to Kafka streams, but in general for frameworks that do that things. One, is when the processing has to be kind of one-to-one, even with consumer and producer tracing. For example, the way Kafka works, when you pull data, it pulls a batch. So it pulls a batch of messages, yeah?
Yeah.
And then you have your consumer records object. Normally, you would have a footage loop, for example, that you do something and then you produce out. If you're saying, consumer produce kind of flow, like Kafka Streams would be. And that works fine as long as your produce is within that iteration loop. Because the way that interceptors, the tracing code works is that the consumer records object is augmented to return a tracing iterator. So when you iterate, it's not the standard iterator, it's actually a tracing iterator that has hooks that for the next call, it takes a context tracing context, puts it into tread local, puts it into current basically context. And then when you do producer or send, it looks, "Do I have a tracing context in memory? Do I have it active?"
Right.
Use it as a parent and links them together. And now the send is linked to the previous consumer process, basically.
Right. Yep.
Now where it falls apart is if instead of doing for each and within the, for doing the send directly, you for example, store it into a list, do something, and then it trades at list because at list no longer returns the trace iterator, it has no idea.
Oh, right. Yep.
And the context propagation is lost. So where I'm going with this-
It's really stormy because you wouldn't know as a programmer that you've just broken that chain, right?
Yes. Again, I guess it tries to pattern to 90%, 95%, whatever the standard pattern, the ones that you can anticipate. Because the problem that I found as well, writing that extension, when you're working as a developer on a platform or on the application, you know what you're doing, you know what your processing is, you know what your domain is, you know how you're using it. Slightly less so if you are working on a platform or a libraries that you give to your internal developers, but even then you kind of can give some constraints or capture some usage scenarios because you're within the company, within kind of the departments.
But, then when you're writing that instrumentation for general use, you can anticipate only so much. Someone is doing something crazy, someone is doing something completely different that you haven't thought about. And sometimes you can't really just do anything about it. Or you can say, "Okay, this way it works. This way you need to think how to solve it in your code additionally."
Yeah.
And it's not hard in a sense that, if you need to do that handover, if you're writing to some intermediate buffer or list or whatever before you do processing and send or before you do the send, you can always read the trace context that's available while you're iterating over consumer records.
Okay, yeah.
Store it somewhere, and with a message, store a pair, here's a message, here's a trace context. Now, when you're passing it to produce, you can say, "Here's a message, use this trace context for your send as a parent."
But you have an insight into how Kafka Streams tends to work, so you can pick up some of those common patterns.
Yes. So the same thing with the Kafka Streams. So Kafka Streams, tracing propagation and context. So it had kind of two areas that we looked at. One is getting trace to propagate, not to lose it when we are doing stateful operations. And second part of it is actually trace the stateful operations themselves. So whenever we are doing store [inaudible 00:24:07] in the state store, whenever we do a read from it, it records spans for them, it record those operations as part of the chain so that you can see it in the trace.
And the trace propagation issue, we are fixing exactly in the ways I described. So the problem is if you use caching enabled for state stores, so Kafka Streams has a caching layer, when the state store operations don't go directly to the state store but are written to [inaudible 00:24:39] and then cache is flashed at interval either on size trigger or during commit, and you have the same problem. You're trading over messages, you're hitting the operations that does, for example, stay store put, writing to stay store, instead of protesting the follow-on operations, they're just sitting in the cache until the flash happens. So now your context flow is kind of broken because you're returning to the next message, you're closing this context of the message process and you're picking the next message and starting its own process span. So what we're doing is we are storing the context along with the cash message so that on flash we can again rehydrate and read it back and rehydrate it and continue on the chain.
The second part of it is state stores, they don't store headers. They only stores a message key and payload, and the tracing, which is sometimes causes that issue. And tracing context, is stored typically as a Kafka header. So we needed to kind of devise a way to store the header data alongside the actual payload of the message. So I'm doing merged serialization, so I'm taking the bytes of the payload, taking a byte of the trace context, you merge them together, add Magic Bytes so I can detect that it was a traced message or not traced message when I write to the state store. And then I'll read. Again, check for Magic Bytes, and split them out and write it right.
Oh, right, okay. Does that mean if I shut down the JVM and remove the agent and then started it back up, it wouldn't be able to read the state store?
Yes, that's a problem.
Right, that's a bit of a project.
It would rebuild the state store. Yes.
Okay.
And the change log at the moment, the stuff that it reads to the change log has those Magic Bytes as well. So you would need to replace the messages from the original topic, but have a ticket in the backlog to address it, to move the trace data from the payload into headers when right to the change log. And then when rereading, rebuilding the state from the change log, then if it understands those headers, it can add the trace context as needed. If it's not, if the tracing library is removed, it'll just ignore it because it doesn't care in the normal flow about those headers.
Okay. Yeah, that makes sense. That'd be a lot faster. And so you've done that for Kafka Streams. You've added a lot in for Kafka Streams. You've also said you've been doing work with Connect as well, is that ...
Yeah, the Connect has similar and different issues. Two kind of gaps. So there is no specific instrumentation for Connect in OpenTelemetry at all. You can use the same tracing interceptors or agent, but it just targets the consumer and producer within the connect workers and connectors. So depending if you're using sync or source connector, your consumer would trace records the span for receiving the message for the sync connector. But that's basically the last thing it'll record. What happens after in your connector would be is it disconnected or lost?
Oh, okay. Yeah.
And the same for the source connector. The produced send will be recorded, but where it came from would be lost. So mainly I was looking at replicator, which is a combination of consumer and producer, but the propagation between the consumer and the producer was broken in it because it has similar logic. When it consumes the messages, whatever the source is, whether it's the database or replicator, which is Kafka consumer or MQ, first, it writes it to Intermediate Buffer, then converts it to internal connect record objects, and then iterates on the buffer and produces them all. And as I was explaining, the tracing context is lost because you're using that Intermediate Buffer, so you're no longer using that tracing iterator.
So again, similar we are recording the context, we are storing it and reiterating it when we are going into the produce. We added a few extra features like tracing for Simple message transforms so that we have a chain of transformations you could pinpoint if some of them are slow or breaking to have that correlation.
Or corrupting things, simply?
Yeah.
Yeah.
And then a second part of it that is not very well kind of addressed yet in OpenTelemetry I find, is that it's more targeting individual services like microservices. Yeah?
Right.
You have a resource attributor that says my service is, I don't know, let's say place order service or add to cart service or something. But Connect is a platform, so your single instance of Connect can run 20 different Connect workers and all of them are actually individual services that belong to different flows. So you don't want to name that instance saying "Connect platform one" because -
That's not very helpful.
... what is it like? Yes. All the workers and record a single service. So what I had to add is some overrides so that I extract from connector name basically from the config from the instance of the connector running, and then override that resource attribute based on the connector name so that the traces recorded by all the different connector worker instances have the names that correspond to the connector, not to the instance that runs and not overall platform instance.
Right. Yeah.
And things that would be applicable to a lot to, say tracing Flink or tracing KSQL again, or any platform that runs jobs when you want the name of the service to correspond to the job itself, not the platform instance.
Yeah. Yeah, that makes sense. That makes me wonder, have you done any work with KSQL on this yet?
Not yet. I haven't looked into it deeper. I know that it uses Kafka Streams as runtime. So my hope is that most of the Kafka Streams' work would be applicable to fix again, the same issues. Propagations, they store operations and things like that. But it'll probably be a mix of kind of Connect issues and Kafka Streams issues because then again, it's like platform side of it and extracting individual workflows, individual jobs and their metadata and capturing it accurately as part of the trace.
You must spend a lot of time diving into the Kafka Streams' code base and just figuring out exactly how it works and writing new rules.
Yes, yes. I did. Actually, on the kind of separate topic. We did a hackathon with a colleague on geospatial Kafka Streams' usage.
Oh yeah.
And part of it we were substituting [inaudible 00:32:17] and figuring out how that would work. The previous work was OpenTelemetry and figuring out the whole flows of where the writes are, where the reads, how it all hangs together helps you a lot.
Yeah, I can see how you build on that. That's cool. So what's the actual final experience at the moment with this? As a user, as a debugger who's got nicely traced code, what am I actually going to see? Is there an OpenTelemetry GUI that now shows me my Kafka Streams topology or something?
Not topology as such, but the flow of the message. So you have your spans ... so I have a demo application of test applications that I'm using that is like a transaction bank or something like that. Processing like withdrawal or deposit, really simple flow. So it has an account Ktable that stores account details, is a bank account open or closed, and then transaction, which is a depository withdrawal. So the typical transaction, say for withdrawal, that goes in the way it gets traced. You have your producer send recorded, then Kafka Streams application process, picking it up and processing it.
Then look up, so the accounts Ktable fetch, reading the accounts details. Next, fetch of the state store for account balance, the aggregate value is the current aggregate value of it. Then, update of it as the next operation and then send off of the output to the consumer and then consumer picking it up. So how that kind of flow of operations, that touching either state store or external systems. So we're not tracing each node in the topology. If there is a flat map or filter that doesn't get traced at the moment as individual span, but anything that stores data or fetched data that gets traced all the kind of storage points, I guess.
You would see the data as it goes from topic to topic, maybe not internal topics?
It would show internal topics, in competitions change logs, sends and reads. Yes. And the good thing about it is OpenTelemetry has a concept of linked traces or linked spans. So when I'm recording, say read from state store, account state store, it records a link to the previous update of it. So you can navigate from this trace straight to the trace that shows what was the flow that updated the value I just read. So if there is an issue with that data, you can go and look at the last update to it and its flow and find its logs and trace and all that.
Oh. So you could track back through someone's account history in your example?
Yes. Yeah.
Oh, cool. Okay. That's groovy. And this is all some OpenTelemetry front-end GUI, or what's the actual tool?
So telemeter doesn't have a GUI or backend as such. It terminates as a collector and exporter kind of data agent side. So you have agent running in your wraps that send traces to the collector set. And the collectors, you can specify additional batching, pre-processing, filtering of it and then exporting to whatever the storage and GUI is. So there is open source Jaeger, which is again Uber's open-source product, but then you could be using Elasticsearch with Kibana or Splunk or Datadog, New Relic, LightStep.
Right.
Lots and lots of [inaudible 00:36:07] systems.
[inaudible 00:36:08] Analytics.
Yeah.
Yeah. Okay, cool. So what's the state of this at the moment? Is it open-source? Alpha, beta? If someone wants to get their hands on it and try it.
It's not open-source yet. We don't have a training on the customer yet, so we want to find a customer within Confluent, I guess engagement that needs it and wants to kind of partner a bit to battle test it basically. Because it's all written, it has integration tests and things like that, but it's never run on production platform at production or pre-production UAT, but with the proper level of flow and use cases that maybe we haven't thought about and want to kind of iron out corner cases in something that is a bit more real rather than our demo application or integration test.
Right. So you're at the, "We need to battle test this" stage?
Yes, yes.
Cool. Okay. It sounds like if someone wanted to get started with this, then the first thing to do is grab OpenTelemetry and maybe start writing their own interceptors if they're feeling spicy.
Yes. It depends. I find interceptors for the agent itself are best for tracing libraries because you don't want to fork Kafka Streams, you don't want to fork the library you're using. So the way to trace it is usually to write the aspect oriented codes that you can run at right places within the library. But, then if you have your own application side, your business logic and so on, it's easier and probably better most of the time just to add this span adaptation or create span here, create span there.
Right.
Those are attributes that are relevant to my domains that I want captured along with the span from my message, and things like that.
Oh, okay. Okay, I see. Yeah. So the difference is do I control the source code, happily?
The source code and how generic it is. If it's my opposite is only doing my thing and it's my business logic, then there is no point to write interceptors for it because it's like singular basically, or not very widely reusable.
Yeah. There's no value to the overhead of writing Byte Buddy code.
Yeah. Now, if you are writing a library, if you're big corporation, Netflix, say size or big super, whoever, and your platform teams that writing a library to be used by your application teams, then you can either bake in the tracing code because again, you could control the source code or use interceptors to kind of separate the codes that does legwork of messaging and tracing kind of separately from it. And it's distributed separately potentially.
Cool. I think I've got that. Yeah, there was one other question I wanted to ask you which kind of relates to all this. It's the distinction between synchronous processes and asynchronous processes. I got the sense from reading some of the OpenTelemetry documentation that it had come from a background of dealing with traditional synchronous processing. And the -
Yes, that's correct.
... the [inaudible 00:39:48] of Kafka is something that's been less well-thought through. Is that fair?
That is correct. The semantic convention for messaging for synchronous messaging is still being developed. It's very recent, and the main use case was hundreds of microservices, but the whole deal in request response pattern to have a clear request response. Maybe not necessarily blocking, not necessarily always synchronous, but always more or less going through the chain of calls. You still have that, "Here's a request, I need to build up the response in whatever ways hitting whatever number of services."
Right?
With Kafka and this general messaging, it's a bit different. Sometimes you have that request response, but more often than not, its here's a signal of something.
Yeah, yeah.
That the event is a signal that then flows and it's aggregated, protest rooted, splitted and reached and can go in different ways in the different system. There is no direct correlation between the request and the response pattern there.
Yeah.
And on top of that, the kind of transport versus logical flow considerations, because the producer accumulates a batch of messages that it sends and sends them in a block. Now, do you want the trace that records that operation, the send of the batch, or you want the trace that traces the individual message inside the batch on the song?
Oh. Yeah.
Or both, and then how do you link them together? And the same on the consumer side. You pulled 20 messages, it took, I don't know, 30 milliseconds, but that 30 milliseconds is operations that pulled those 20 messages. Do you want to apply that to each message and split it and record per message, or do you record both the pull and then processing of each message individually? Because depending what you're doing, if you're doing performance analytics, then sometimes it's not fair to say it took 50 milliseconds to consume this message because alongside this message, you actually consumed 50 more messages in this case.
Yeah. But it's also not true to say it took a fifth, like one millisecond because you can't straight divide by 50. Yeah.
Exactly. And the thing is, it's like what you're tracing, what you're trying to look at, are you tracing a flow of the message through the system? Are you trying to find performance bottlenecks and you are more interested in kind of metrics? But then should you be looking at the metrics then instead rather than the trace timings?
Yeah.
Or you're trying to find A/B test stuff. So say when it's taking this pass, it's taking that long when it's taking this pass, it's taking that long. And then you are interested in performance, but you want it to be filterable. So you're looking only at some subset that's grouped in some way. Say you want to test performance improvement of potential conflict change, so you change it on some application, some instances, and then you're kind of comparing one against the other.
Right.
So there is a lot in it.
Yeah. So it sounds like the tools are all there for doing it, but at the asynchronous world brings up some new questions you have to answer.
Yes. And the specifications, the semantic convention is discussing that. There is a lot of movement in that regard. I think it's getting ironed out so that you have best of both. So you record both per message flow, but then you're linking using the trace links, the transport level traces, and they're kind of separate but alongside each other so you can jump from one to the other.
Okay. I could see people also using this to trace things like Akka Streams, like actor-level models, which is very similar, right?
Yeah. Yeah. It supports Akka, to an extent. I haven't dug deep to say how much it supports it, but it supports a lot of those things. A nice thing that they found is that it automatically correlates the logs for you as well. So the trace and span that you generate, you can just enable using MDC like in context logging to inject those into the logs that are generated while your visiting that span when your code is generating them from the span.
Oh, right.
So then you can jump from the trace to logs. But I read an article on Medium recently from VIX. They wrote their own kind of Kafka library, Greyhound I think, and what they do, they go in further. They inject trace IDs or message correlation IDs into the metrics that they generate. So, then you can correlate the metrics back to individual message flows as well, not just the logs, I guess, where it makes sense.
Okay. Yeah. Is that something ... Sorry. Is Greyhound the level above OpenTelemetry or is it completely separate?
Greyhound is a VIX Kafka client kind of more rich Kafka client, so it's a wrapper around Kafka consumer producer, and they added capability to write traces and metrics and so on into it.
Right.
But it's just an interesting idea that read about there that sounds, "Yeah, we're not just correlate logs. Let's correlate metrics on top of correlating logs and traces to stitch into one kind of cross correlated set of data."
An idea you could potentially steal for OpenTelemetry?
I need to have a look, yes, I guess. But then again, it depends. Which metrics you care being captured per message versus more aggregate-able per instance, per tread or whatever, because again, it's like horses for horses. What's a use case, what it fits where when you should do it when you shouldn't.
Yeah.
Because a lot of performance analytics metrics are perfectly fine for performance analytics. You don't need the per trace granularity for it to see that your IO is slow, for example, you're hitting disc too much. Doesn't matter for which message you're doing. It's just for that instance, for that node, you're doing it.
Now, things that I guess are more per message performance analytic relevant is, either something in those messages that is different from our other messages, like for example, checkout process that calculates some discounts, that uses some specific holistic, and this combination of parameters going into it is much slower than everything else. Now we're trying to then look, is there a different process working? Is it taking time because there is a bug when it's only for people from Canada ordering or some combination or something.
Yeah. Those kind of weird combinations they come up when you've got to try and have some way of finding that that's the problem, right?
Yes. Yeah. Okay. That's where it comes that correlatable metrics of data.
It sounds very useful. It sounds like we have to wait for you to publish more that they will come. If someone wants to get in touch with you, especially if they want to battle test it. Can we leave your contact details on the show notes?
Sure. And there is Confluent, I think, Open Slack, community Slack.
Oh yeah. Confluent Community Slack. If you're on there, that's probably the best place to find you, right?
I'm on there, yes, and I think there is a tracing channel or observability channel as well that I'm checking it once in a while and other people would ping it to me as they notice.
Perfect. If you're interested, head there. And we're going to get some more people from your department on, because you're doing all the fun stuff. As I said. Thank you very much for joining us.
Thank you for having me.
Pleasure. Thank you, Roman. As we said, if you want to go and talk to Roman some more, maybe pick his brains, check out the Confluent Community Slack channel. Link in the show notes. It's easy to get in and there are always plenty of us here from Confluent, chatting with the rest of the community.
In the meantime, if you've got a hankering for more monitoring related information, you might want to go back and check out an old episode we recorded from Streaming Audio with Anton Rodriguez talking about eBPF, which is interesting because it approaches exactly the same domain monitoring, but from the level of hooking into the Linux kernel. I could definitely see a large enough organization wanting to use both. Monitoring's a large enough field to encompass both angles. So go and check it out. Link in the show notes.
While that said, if you want to monitor more of what we've been doing here in the developer experience corner of Confluent, head to developer.confluent.io, where we have an ever increasing library of courses, tutorials, blog posts, videos, long and short, teaching you everything we know about event systems, from design and implementation to monitoring and performance.
It remains for me to thank Roman Kolesnev for joining us, and you for listening. I've been your host, Kris Jenkins, and I'll catch you next time.
How can you use OpenTelemetry to gain insight into your Apache Kafka® event systems? Roman Kolesnev, Staff Customer Innovation Engineer at Confluent, is a member of the Customer Solutions & Innovation Division Labs team working to build business-critical OpenTelemetry applications so companies can see what’s happening inside their data pipelines. In this episode, Roman joins Kris to discuss tracing and monitoring in distributed systems using OpenTelemetry. He talks about how monitoring each step of the process individually is critical to discovering potential delays or bottlenecks before they happen; including keeping track of timestamps, latency information, exceptions, and other data points that could help with troubleshooting.
Tracing each request and its journey to completion in Kafka gives companies access to invaluable data that provides insight into system performance and reliability. Furthermore, using this data allows engineers to quickly identify errors or anticipate potential issues before they become significant problems. With greater visibility comes better control over application health - all made possible by OpenTelemetry's unified APIs and services.
As described on the OpenTelemetry.io website, "OpenTelemetry is a Cloud Native Computing Foundation incubating project. Formed through a merger of the OpenTracing and OpenCensus projects." It provides a vendor-agnostic way for developers to instrument their applications across different platforms and programming languages while adhering to standard semantic conventions so the traces/information can be streamed to compatible systems following similar specs.
By leveraging OpenTelemetry, organizations can ensure their applications and systems are secure and perform optimally. It will quickly become an essential tool for large-scale organizations that need to efficiently process massive amounts of real-time data. With its ability to scale independently, robust analytics capabilities, and powerful monitoring tools, OpenTelemetry is set to become the go-to platform for stream processing in the future.
Roman explains that the OpenTelemetry APIs for Kafka are still in development and unavailable for open source. The code is complete and tested but has never run in production. But if you want to learn more about the nuts and bolts, he invites you to connect with him on the Confluent Community Slack channel. You can also check out Monitoring Kafka without instrumentation with eBPF - Antón Rodríguez to learn more about a similar approach for domain monitoring.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us