Enhance your career, get your certificate as a Data Streaming Engineer | Get your Certificate

April 13, 2022 | Episode 209

Monitoring Extreme-Scale Apache Kafka Using eBPF at New Relic

Transcript
Notes

Kris Jenkins: (00:00)

Who watches the watchers? That's the famous quote about monitoring people who do the monitoring. Well, if you'll allow me to butcher the English language just for a second, you could almost call this episode how watches the watchers. In this week's Streaming Audio, we're taking a look inside New Relic. We use Apache Kafka to provide a monitoring service at, frankly, an astonishing scale. And Anton Rodriguez is going to tell us about the project he's been working with to monitor those huge Kafka clusters themselves, and get some insights out that I wouldn't have thought possible. Or, at least I wouldn't have thought practical. And sometimes, that's the most important difference, right?

Kris Jenkins: (00:45)

Before we start, let me tell you that the Streaming Audio Podcast is brought to you by Confluent Developer, which is our site that teaches you everything about Kafka, from how to start it running and write your first app to systems design and monitoring. Check it out at developer.confluence.io. And if you want to take one of our hands on courses you'll find there, you can easily get Kafka running using Confluent cloud. Sign up with the code, PODCAST100, and we'll give you an extra $100 of free credit to get going. And with that, I'm your host, Kris Jenkins. This is streaming audio, doors first draws. Let's go. (Silence). My guest today is Anton Rodriguez, who is the principal software engineer at New Relic. Now, I know New Relic as a big monitoring platform, but I'm sure he's going to tell us a lot more under the hood than that. Anton, thanks for coming on.

Anton Rodriguez: (01:45)

Thank you, Kris. Big pleasure to be here today. Big pleasure to see you in the Streaming Audio Podcast.

Kris Jenkins: (01:52)

Oh, thank you.

Anton Rodriguez: (01:54)

Podcast is a source of inspiration for all of us. Really happy to see you moving it forward, continuing with the work the amazing work team was doing on that.

Kris Jenkins: (02:03)

He did a great job with this podcast and it would've been a tragedy to let it drop. So, I'm doing my best to fill his shoes.

Anton Rodriguez: (02:10)

Sure. You're going to do it great.

Kris Jenkins: (02:12)

Thank you. So, let's start with where you're at. So you're the principal software engineer at New Relic and New Relic is, I know them, they do monitoring. And I know they do monitoring at scale, but give me an idea of what kind of scale we're talking at.

Anton Rodriguez: (02:30)

Yeah. Okay. Yes. New Relic is all about monitoring. Because of that, we have customers around the world, a lot of them, and all those customers are sending metrics and logs and all that type of information to us. So we just a huge amount of information. Just to give you some numbers, which I think are good to understand the scale, we just around 125 petabytes of data per month. That's around ...

Kris Jenkins: (03:04)

125 petabytes.

Anton Rodriguez: (03:06)

Petabytes, yeah. It's pretty crazy. I mean, I'm sure there are other companies doing similar or even more, but at least with my experience working with Kafka, it's a lot. And many of the best practices are typically using other places doesn't work here because of that. We need to move some of those things to the next level to make things work at that scape. It's just thinking how many messages we, in just per minute. It's around three billion of that. So just in this minute, we are ingesting a lot of data.

Kris Jenkins: (03:41)

That's insane. I imagine everybody listening to this has had a server go down because one of their log files filled up. But you're filling up like a whole rack full of stuff.

Anton Rodriguez: (03:52)

Yeah.

Kris Jenkins: (03:53)

Minutes, seconds?

Anton Rodriguez: (03:55)

Yeah. It's a lot.

Kris Jenkins: (03:57)

Okay.

Anton Rodriguez: (03:59)

And yeah, because of that, we started like in many companies with a single Kafka cluster that was growing a lot. At some point, we have more than 275 brokers. It was very painful to maintain and to operate that thing. So we started with what we call a cell architecture, which basically split that huge cluster in smaller pieces and the branding in the cloud. And that's working much better for us, but it's also introducing other challenges in ...

Kris Jenkins: (04:33)

Just out of interest, how do you split it? Do you, what, group together customers and what's the logical split of cells?

Anton Rodriguez: (04:42)

Basically, we do it with two different things. One, it's customers. So we do share basically in the accounts and in that way, we can move a specific account to specific cluster. And also data type, so we have different types of data like metrics, distributed to tracing, logs, events. So basically, we use both things to try to speed the load, to make sure. The good thing about this is, if we have a specific problem in one of our Kafka clusters, it shouldn't affect the rest of the data of that particular customer or other customers. So it allows us to have some level of isolation for our customers and to make sure things work for them all the time, or almost all the time.

Kris Jenkins: (05:31)

That makes sense. So, okay. So, that's your first level of, I can see how that would introduce management problems. Take me through it. What happened next, when you split that up?

Anton Rodriguez: (05:45)

Yeah. That's the thing, because there are two things here. First one is, now we work with a lot of Kafka clusters at the same time. So the typical thing I was doing in the past, where you have one Kafka cluster, and then you add a new one and you are configuring specifically every cluster for the traffic. That's just not possible. Now we have more than 100 clusters. We cannot go cluster per cluster, optimizing for the traffic of that one. So basically, we group them for functionality. Like for example, a group of clusters for ingestion and all those clusters have the same configuration, the same topics. Everything is the same.

Anton Rodriguez: (06:27)

So if we change something, we change something for all of them, which is really good for be able to manage and operate those clusters. But at the same time, reduces new challenges in the sense of maybe because the traffic is slightly different between them. That find the right spot, the right configuration for all of them, it's much more complicated.

Kris Jenkins: (06:51)

But you're splitting it logically so hopefully, there's something common or your ingestion things that will sort of work out. And something commentary or your consumer end stuff.

Anton Rodriguez: (07:02)

Yeah. There are things we can use, and we use them for sure, but there are also things which make things more complicated. Like for example, the accounts, our customers, we have a really big customers, which is slightly difficult to move them. And we have other customers which are smaller. So, put everything together and to make sure they work without affecting ones to another ones, it requires some configuration. We need to have Kafka quotas in place. We need all our work infrastructure, being able to react to changes in real time and to adapt to those situations.

Anton Rodriguez: (07:43)

And that's another of the challenges we have is all those crafts clusters of our set architect tool. In any moment, we can add new clusters or we can decommission some of them. So all our tuning, all our processes need to be aware of that because it's just, this is not going to be permanent. In any moment, we can just decide to decommission it or, and get a new one. And all the traffic can be effective by that. Our customers shouldn't notice all those changes under the hood. So that introduces also some challenges, how we operate and these which are not so typical in or some of which are in other places.

Kris Jenkins: (08:29)

Yeah. Yeah. And meanwhile, you've got all your customers, presumably some of them at any one time, they are expecting to be able to see in real time the data that will solve their problems and their production issues. So you've got to be really transparent underneath their surface, right?

Anton Rodriguez: (08:44)

Yeah. It has to be. This is just, of course, they want to go to New Relic and have those [inaudible 00:08:51] and see how the data is receiving. But even more important for us is most of our customers have alerts, receiving the data. So if any specific salt water stops reporting data, they want to receive an alert saying, "Hey, your service is not reporting data." So we need to have all the logic in our pipelines to detect that type of situations and generate those alerts. And we need to do it at the scale of three billion messages per second. So, yeah.

Kris Jenkins: (09:19)

Yeah. And in real time, right?

Anton Rodriguez: (09:22)

Yeah. It's near real time. It's not, we don't have such strong requirements in terms of latency. Thinking, for example, in an alert, it doesn't matter if the alert, if it logs out of the other, like a couple of seconds later. That's for most of our customers, it's just, okay. So we have some room we can play with in terms of latency. There are other places, other people, other companies with stronger requirements on latency. In our case, it's more about the manage all that through all those messages. We have some requirements in latency, but they're not so strong as other places.

Kris Jenkins: (10:06)

Why you say that, but I sometimes think you can divide the computing world up into what they consider a long time. Right? Two seconds, that's pretty fast. Some people think 200 milliseconds is slow. Some people think a week is fine and we're not in that place.

Anton Rodriguez: (10:22)

Yeah. Well, a week for sure is not okay for us and not sure our SLA, specifically.

Kris Jenkins: (10:31)

But one, two seconds where it [crosstalk 00:10:33] ...

Anton Rodriguez: (10:33)

Yeah. I think for, yeah. For most of our, for example, I was working a lot with the logging team and in a couple of seconds, we have all the logs already ingested. In most of the time, I watch pretty well. I know it's slightly faster than other competitors. Yet, as I said, I think it's just not like the most important thing for us, from our customers, more reliability. And being able to define and to process all that information, to really have a nice UI, which it's able to make sense of all that data. Then they really, how fast we just, that's important, but it's not the first thing. That's my opinion. To be honest, I don't work with our customers. Normally, I'm more, much more focusing our work platform.

Kris Jenkins: (11:25)

From down in the mines.

Anton Rodriguez: (11:26)

Yeah. So, and I am. Of course, we are a heavy user of New Relic because we use it for monitoring.

Kris Jenkins: (11:33)

I was going to ask.

Anton Rodriguez: (11:35)

What?

Kris Jenkins: (11:35)

So do you? You must have a huge monitoring problem of your own. Do you eat your own dog food and how does that work out?

Anton Rodriguez: (11:45)

Totally, totally. And I have to say, that's nice because the whole thing is, we can monitor as much as we want. So, in other companies I have that thing of, hey, this is expensive. Please don't send so many metrics. That's not happening here at New Relic. I can use as much as we want. We have very good monitoring in place, which allows us to operate Kafka at scale. Some of those pieces are already open. So also, other people can use them too. And we are also working open source, even more of them. That's something I like a lot.

Anton Rodriguez: (12:24)

And yeah, one of the things also is before to make any change in our production environment, we do it in our own monitoring environment. So it's then we change something. We are using it and we are confident enough to deploy it in production data because we see that's working for us, and because we are our biggest account. It's going to work for [inaudible 00:12:52].

Kris Jenkins: (12:52)

Yeah. Yeah. It's always a great way to get feedback loop if you can be the user of your own service, right?

Anton Rodriguez: (12:59)

Yeah. Every day. I recommend it for everyone, if you can do it.

Kris Jenkins: (13:03)

Maybe I should put myself on the podcast for the same reason.

Anton Rodriguez: (13:06)

I think, please, please. You already asked that.

Kris Jenkins: (13:09)

That would be weird.

Anton Rodriguez: (13:13)

Maybe you can ask [inaudible 00:13:14] to the US?

Kris Jenkins: (13:15)

Oh, God. Yeah. That would change things up. We'll have him back next year, once he's got his feet settled somewhere else. See what he's been up to. So you must, along the way, you must have had some new insights into how to do this monitoring stuff at scale.

Anton Rodriguez: (13:32)

Yeah. One of the interesting things that I wanted to share, it's eBPF and XD.

Kris Jenkins: (13:39)

EBPF.

Anton Rodriguez: (13:41)

EBPF, yeah. If you're not familiar with it, eBPF is a new Linux capability. It's something we have in the car. It's pretty new, not, I mean, it's not this year, but it's pretty new. Yet, for many engineers, basically it's allow us to retrieve information directly from the kernel with a very low overhead in a very safe way. In the past, when we wanted to have information directly from the kernel, it was kind of risky because the Linux scanner, it's a cool component. If we break something there, we are going to create a big mess. But with eBPF, it's just slightly different. We can access to all that information, almost in a very efficient way and without making any type of problem in the kernels.

Kris Jenkins: (14:31)

What kind of stuff can you read out of the kernel from eBPF?

Anton Rodriguez: (14:36)

Basically every, almost everything you have in the kernel. You can detect each time on applications, open a file or open up external connection. Or for example, some of the things we are doing right now, it's each time on an application, a Kafka application or broker or the client. It's receiving a package or network packages. Your eBPF is able to retrieve that package to send it to other programs, see what we are doing in Kafka, and then continuing the process.

Anton Rodriguez: (15:10)

And we can do that very efficiently. So it's not affecting latency or the through put or anything like that. And then what we did with Pixie is make this software able to understand Kafka, the Kafka protocol, to not only have monitoring, but monitoring specifically for Kafka to understand what's happening. So things which typically we cannot have easily with Kafka because we need extra software and we need to really understand how it works. Now with Pixie, it's really easy. Things like being able to measure the consumer lag, so how many messages we have pending to be consumed by a consumer.

Kris Jenkins: (15:54)

Really?

Anton Rodriguez: (15:55)

Yeah.

Kris Jenkins: (15:55)

You can get ...

Anton Rodriguez: (15:55)

We can get that.

Kris Jenkins: (15:56)

That from a kernel level, too?

Anton Rodriguez: (15:57)

Yeah. Directly from the kernel. And this is in the past, when we want to do this, we always needed an extra [inaudible 00:16:07]. There are open source things like Burrow from linking in or the Kafka. [inaudible 00:16:12] supporter for light men, but this is just something you deploy in Kubernetes. Pixie, it doesn't matter where are your customer, where your applications or what technology they're using. It's able to retrieve all the information, makes sense of it in the context of Kafka.

Anton Rodriguez: (16:31)

And so we get to use in a very nice way. So for me, I discovered this like, I don't know, nine months ago when it was open source to the Cloud Native Foundation. And it was, for me, was mindblowing all the things you can do, which was not possible in the past. And how it's just not the typical monitoring we have for Kafka. It's just totally different because it's like the normal inventory. So I'm very excited with this, providing a lot of ideas to the Pixie team to build new use cases. But I am even more excited to share it with the Kafka community to see what ideas have the Kafka community to use it in their own projects and use cases. Because pretty sure, because it's so new, it's so different, we're going to find quite interesting ideas out there.

Kris Jenkins: (17:21)

Yeah. I'm going to need you to. Maybe this will stimulate some ideas. I'm going to need you to draw the line for me, because I think of like kernel spaces down here and application space up there. Draw the technical line for me, that gets me from monitoring the kernel to actually seeing how many pending messages I've got in the consumer, because I can't draw that line.

Anton Rodriguez: (17:40)

Yeah. That's a good question. So basically, we're using Pixie. Pixie, as I said, it's a new platform. Part of it is open source, part of the Cloud Native Foundation. What we do is, do the prop Pixie in Kubernetes. Pixie, what it's going to do is use eBPF in all the Kubernetes notes to understand what is happening and send all that information, all those metrics using eBPF to a shared database. So you are with Pixie. We have information about what is forming in Kubernetes, how much CPU they're consuming, the through put, all that information.

Kris Jenkins: (18:25)

Open file handles, all that stuff.

Anton Rodriguez: (18:27)

Yeah. Whatever we want, really. I mean, whatever. No, but almost many, many metrics will have a lot of information. And because it's just not depending of the applications themselves, it's using Kubernetes as a standardization layer. Those applications can be in Java, in Go, in RAS, in wherever. But Pixie, because it's just receiving all that information directly from the kernel using eBPF, it's able to send all that information. And you have like a standardization of the information for all the applications.

Kris Jenkins: (19:03)

Right.

Anton Rodriguez: (19:05)

And then we have in Pixie, you have all that information, but you have also the possibility to create scripts, to create your own dashboards and to modify and aggregate the metrics you are receiving. So basically, what we are doing in Pixie is adding scripts, specifically for Kafka to make metrics aware of the context of Kafka and show specific information for Kafka. Things like [foreign language 00:19:38] or [foreign language 00:19:39] or just the discovery of clients.

Anton Rodriguez: (19:42)

Some of the problems I have right now is because we have so many clusters and so many clients, it's not always easy to identify the applications' consumer producing with Kafka. With Pixie, you just deploy it in Kubernetes. And it's going to list you all the applications consuming or producing to Kafka and the through put of them. And you're going have all that information in, just with a couple of comments that's as.

Kris Jenkins: (20:10)

But, okay. So you say consumer lag. Does that mean we're getting the difference between the topics offset and the consumers offset or?

Anton Rodriguez: (20:19)

Yeah, it's a slightly similar. That's how we typically mess with consumer lag. What we are doing with Pixie is tracking each time a producer is sending a message and it's tracking its time consumer is reading that message. And it's calculating the difference on in time between both the actions. So if we have a producer sending a message, the consumer, because it's reading a lot and it's not able to read the same rate, it's taking longer and longer and longer. Pixie is able to say, "Hey, now for the maximum time a message is waiting to be consumed, it's wherever." And it's able to provide us in time, which is usually much more useful than in offsets we typically have in the Kafka.

Kris Jenkins: (21:13)

Yeah. I can see that. So is that tracking every single message or is it you just sort of sampling?

Anton Rodriguez: (21:22)

It's tracking every single message, but it's most people use in the part of the Kafka protocol. So each time we have a consumer and the broker, they are sending control messages. So what Pixie, what we are doing is tracking those messages to have information and to make that information. So we don't need to track all the data for, because could be a lot, but it's specifically tracking the messages. This is very interesting, for example, for re balances. When we are adding a new service reading to Kafka, it's typically need to coordinate with the existing services in what we call the consumer group. And that's what we call our re balance. And it can be quite costly for Kafka in the sense of that coordination may affect the throw put.

Anton Rodriguez: (22:19)

So when that's happening, all those clients are sending messages to the broker. What we do is we retrieve those messages to understand what's happening and to be able to really provide information and context about that. And that's something in my experience quite hard to do until now, because you need have information in the broker side and in the client side. Put all together is not easy, but with Pixie, because has access to both things in the same way, it's much easier.

Kris Jenkins: (22:50)

Yeah. And you don't need to make sure that you're aligning languages presumably on each side. Right? It's completely transparent to the application layer.

Anton Rodriguez: (22:59)

Exactly. It is. As a matter, if you have it consumer in Ruby or Java or Go, they're going to work or in the same way, because you are really not taking a look inside those applications. But how they communicate with Kafka.

Kris Jenkins: (23:15)

Yeah.

Anton Rodriguez: (23:15)

So it's a total different approach. And that's why I think it's so interesting because we are looking at typical problem with a total different perspective.

Kris Jenkins: (23:25)

Yeah. Yeah. That's nice. I can see, especially for a company. How many people are you? You must be ...

Anton Rodriguez: (23:31)

I don't want to lie, so.

Kris Jenkins: (23:34)

Give me a ballpark.

Anton Rodriguez: (23:35)

I'm not sure. I really don't know. I think in engineers, we should be around 2000, 3000. That's the number I have in my mind, but it could be totally broad, to be honest. Really confusing the technical side of things and not put too much attention. I pretty sure they repeat that to me many, many times, but I'm not able to put that information in my head.

Kris Jenkins: (24:01)

But you're definitely in the scale where you can't make every engineer, realistically, you can't say everybody does it this way so that we can monitor you.

Anton Rodriguez: (24:07)

Yeah.

Kris Jenkins: (24:07)

You have to have a solution that's transparent to that.

Anton Rodriguez: (24:11)

Yeah.

Kris Jenkins: (24:12)

That's nice. You've been, you discovered this nine months ago, you say, and you have been putting it into production.

Anton Rodriguez: (24:20)

We don't have it in production right now, basically because how we have our own services. We are in the middle of a migration and because of that, we are taking some time to put that in production. In this moment, what we are doing is more from, at least from my side, it's exploring the technology. What I have been doing is have regular meetings with the Pixie teams to test this technology and things like ... Yeah, I have this problem of the teams using Kafka sometimes have re balances and it's not easy to see what's happening. So the Pixie team say, "Okay, let's try to use eBPF and Pixie to solve this."

Anton Rodriguez: (25:10)

And they are creating those new solutions to those use cases. So for me, my approach to Pixie right now, it's more in the site of evolve the technology. I'm making sure it works with Kafka. We are able and find things which are typically pinpoints for me and make it, fix it to solve them. That's my approach right now, more than put anything in production. That's also related to our scale right now, is just much more complicated for us to producing all our classes at our scale. Than probably for another company having less traffic, which could be do it more easily. It's just, it should be easier for them.

Kris Jenkins: (25:59)

Yeah, sure. But you're planning to, you think this is part of the future at New Relic?

Anton Rodriguez: (26:05)

Yeah, for sure. I mean, we are using Pixie right now for, not only for Kafka, for many other things. So there are many engineers really excited about this in New Relic and in other places. And I think also, just in the context of Kafka, we are going to find more and more new use cases and things which are really hard to monitor in a different way. And we can easily solve them with Pixie. So yes, I'm really excited about it. I think it's just pretty new yet, and I need some time to be more mature and to work in more places.

Anton Rodriguez: (26:45)

But for sure, I'm totally sure it's going to be part of the future. And I ensure once more engineers know about it, they are going to find more use cases. Maybe with Pixie, maybe directly with eBPF, maybe with other tools, but it's just a tool you want to have in the tool set when you are running things in production. It's really, really helpful.

Kris Jenkins: (27:15)

Okay. So for the sake of people who want to get started with this, let me make sure I've got this clear in my mind. You've got eBPF, which is the kernel level monitoring piece.

Anton Rodriguez: (27:23)

Mm-hmm (affirmative).

Kris Jenkins: (27:24)

Then above that, you've got Pixie, which is both gathering database ...

Anton Rodriguez: (27:30)

Yeah.

Kris Jenkins: (27:30)

Exploration tool.

Anton Rodriguez: (27:31)

Pixie is the service. So basically, it's making eBPF is to use for everyone and specifically in the context of Kubernetes. So basically, you deploy Pixie in Kubernetes, and it's going to put eBPF in every node and all the metrics in every node. So you don't have to go to really that low level. And then it provides to you an UI where you can have all that, those metrics and build your task words. But at the same time, you can also create your own scripts in that UI and you don't have to really know about eBPF or C or the Linux scanner. It's something very similar to a Python script. You can write you on there to make whatever you want, process information, transformations, things like that.

Kris Jenkins: (28:27)

Give some gory details on that. So what does it actually look like if I want to write a script?

Anton Rodriguez: (28:33)

It's pretty easy. You go to the UI on the right side. You have like down maybe third, so you just click on that. You make the changes. Like I want to receive that particular metric, or I want to aggregate those metrics together in ones, or I want to calculate the ABS. Really, I mean, whatever you want to do is super flexible. In our case, we do that too. Okay, I'm going to retrieve the metrics between the Kafka broker and the services. And with that, we are going to, for example, for the consumer lock, we are going to measure the difference between them to calculate the consumer lock in second. That's just one clip. You can go to the UI, open it, see it. And if doesn't work for you or you want to modify it, you can easily change it.

Kris Jenkins: (29:22)

Yeah.

Anton Rodriguez: (29:23)

And then submit it, and that's all you have to do. And now, it's working with whatever modifications you wanted. So, it's super flexible.

Kris Jenkins: (29:31)

Okay. And it can retrieve from more than one node. You're not, it's a general overview that you can write these scripts.

Anton Rodriguez: (29:38)

Yeah, you can. Yeah.

Kris Jenkins: (29:41)

Okay. So I know you're going to be speaking about this at Kafka Summit. Depending on when the podcast is released, the Kaka Summit's April 25th, 26th, this in 2022. So hopefully, that's in your future, listeners. What else are we going to learn from that? Don't give me the whole talk, but what have we missed out?

Anton Rodriguez: (30:03)

Well, and in that, in the talk, we're going to, first of all, show some of the challenges when we are monitoring Kafka, things like discover the clients, things like the consumer log, things like the re balances. So explaining in detail with, even for people who is not so familiar with those things. And then the second part of the talk, what we are going to introduce eBPF, how it works, why it's important, what we can do with it. Not only monitoring, but also in other things like network or security. And then what we are going to do is to show a demo of Pixie, measuring all those things.

Anton Rodriguez: (30:47)

The nice thing is the demo, it's much more straightforward. It's just much easier to see the demo than explain Pixie. So we want to spend like 20 minutes or so during the talk. So with the demo, which I think it's really exciting for engineers to see what they can do, or how easy is to access that information and to build those things right now. Because in the past, to build some of those things was quite challenging and difficult. You need a lot of background in eBPF and C and all that stuff. But now, every developer out there can do that and that's ...

Kris Jenkins: (31:26)

Yeah, yeah. I have delved into the kernel once or twice, and I think I still have the scars to prove it.

Anton Rodriguez: (31:33)

Yeah. I mean, even more challenging than it was. I was surprised when I started with eBPF. I wanted to understand how it works under the hoods and was doing some PSCs and I'm reading a lot of documentation about it. And it's a whole new world for someone who is more producing Kafka. Java is just too hardcore. At least it was for me.

Kris Jenkins: (31:56)

Yeah. Yeah. I think I'm happier in application space. Occasionally, you have to dive down, but.

Anton Rodriguez: (32:04)

Yeah, it's a very interesting space. One of the things with eBPF is, eBPF is running like a [inaudible 00:32:14] machine inside the kernel. I have the chain, all that stuff. And so in many things that are pretty similar to Kafka, Java, I'm assuming it's just not so different because you have that built on machine. You have also just subscribing to events from the kernel. You are processing those events in streaming to return them to the user space, to do more processing. So yeah, there are many things which are really similar concepts, but there are others, which is just totally new for me. How you work with the kernel. Even when I was younger, I was compiling the kernel in my laptop and all that stuff, but right now it's just not ...

Kris Jenkins: (32:59)

Me too. I'm watching it crash, right.

Anton Rodriguez: (33:01)

Yeah.

Kris Jenkins: (33:04)

Yeah. Cool. So just, I have to ask this to complete the circle. Do you think one day we'll see the output of Pixie going natively into New Relic? Will it be integrated in the future?

Anton Rodriguez: (33:14)

It's already partially integrated.

Kris Jenkins: (33:18)

Oh, really?

Anton Rodriguez: (33:18)

Yeah. I mean, Pixie is a totally independent project, it's part of the Cloud Native Foundation. You don't have to use New Relic at all, if you want to use Pixie. But if you want to use Pixie with New Relic, you want to combine all the features and services we have in New Relic with Pixie, you can also do it. So I mean, the engineers, or many engineers working in Pixie work for New Relic. So I think makes a lot of sense. So, I really like that model of, you have like the open source thing with the core and everyone can use it and explore it. At the same time, you want to have additional features, you can go with New Relic as a service. Or just you do want to host it by yourself, then you have the option with New Relic. I think that's a very good combination. It's similar to what Confluence is doing with Confluence Cloud.

Kris Jenkins: (34:14)

Yeah, exactly. It's you want to be able to get the core of it and access it, but you want the option of like not having to worry about it so much. Right?

Anton Rodriguez: (34:22)

Exactly.

Kris Jenkins: (34:23)

Yeah. Very interesting. Well, so last question. If someone wants to get started, what's the first place they should go?

Anton Rodriguez: (34:32)

You go to Pixie website or that you have a repository to have that, all the information. It's really easy to do the setup and to start to work with it. So really, it's really not difficult. Anyone can do it. And I think it's the first, the big first step. As I said, the tooling has been really designed to provide a really good developer experience. So you can deploy it in every cloud and start to play with it very easily, just in the context of Kafka or in the context of any other application you wanted to use. So yeah, it's only that. And also, you can ping me if you have questions. I always happy to chat about Pixie, Kafka and all that stuff. I'm quite active in the Kafka community. I'm really happy to have those conversations if someone wants to ping me directly.

Kris Jenkins: (35:32)

Okay. In that case, we'll put your contact details on the show notes.

Anton Rodriguez: (35:36)

Sure.

Kris Jenkins: (35:36)

Along with mine, if you want to get in touch with me for any reason to. Anton, this is one of those talks where I just want to go and quit the rest of the day's work and actually go and play with it, and see what it's like. But I might have to wait awhile.

Anton Rodriguez: (35:52)

Let me know. I really would like to have your feedback on it. It will be awesome.

Kris Jenkins: (35:55)

Yeah, it'd be cool. In the meantime, I look forward to meeting you in person at Kafka Summit, and it'd be nice. And if you're listening to this in the distant future, I'm sure this video will be up. If you want to go and see us in action, we'll have that up on YouTube or ever before long. Anton Rodriguez, got to pronounce that correctly. Thank you very much for joining us on Streaming Audio and I'll see you around.

Anton Rodriguez: (36:18)

Thank you very much. This has been a blessing.

Kris Jenkins: (36:21)

Cheers. Well, thank you, Anton. He's given me another reason to look forward to Kafka Summit, because I'd really like to see that in action. As I said, the Summit is April the 25th and 26th in 2022, in my backyard of London. If that's in your future, I hope you'll consider joining us. And if it's in your past, well, YouTube will probably have his talk up by now, so you can see it for yourself. If you want another way to get in touch with either of us, then reach one of the usual channels. You'll find both our contact details in the show notes.

Kris Jenkins: (36:56)

If you're listening to this podcast, we would appreciate a review so that other people can find us. And if you're watching on YouTube, now's a great time to reach for thumb up icons and notification icons and the comment box. Before we go, let me remind you that if you want to climb the ladder from your first Kafka application to processing events at the kind of ludicrous scale New Relic do it, we'll teach you everything we know at Confluent Developer. That's developer.confluent.io.

Kris Jenkins: (37:28)

If you're a beginner, we have getting started guides. And if you want to go further, there are blog posts, recipes and in depth courses. You'll need a Kafka instance to make the most of it. So consider signing up for an account at Confluent Cloud using the code PODCAST100. That will give you $100 of extra free credit, which at my exchange rate is worth having. And with that, it remains for me to thank Anton Rodriguez for joining us and you for listening. I've been your host, Kris Jenkins, and I'll catch you next time. (silence)

New Relic runs one of the larger Apache Kafka® installations in the world, ingesting circa 125 petabytes a month, or approximately three billion data points per minute. Anton Rodriguez is the architect of the system, responsible for hundreds of clusters and thousands of clients, some of them implemented in non-standard technologies. In addition to the large volume of servers, he works with many teams, which must all work together when issues arise.

Monitoring New Relic's large Kafka installation is critical and of course challenging, even for a company that itself specializes in monitoring. Specific obstacles include determining when rebalances are happening, identifying particularly old consumers, measuring consumer lag, and finding a way to observe all producing and consuming applications.

One way that New Relic has improved the monitoring of its architecture is by directly consuming metrics from the Linux kernel using its new eBPF technology, which lets programs run inside the kernel without changing source code or adding additional modules (the open-source tool Pixie enables access to eBPF in a Kafka context). eBPF is very low impact, so doesn’t affect services, and it allows New Relic to see what’s happening at the network level—and to take action as necessary.

EPISODE LINKS

Continue Listening

Episode 210April 21, 2022 | 51 min

Using Event-Driven Design with Apache Kafka Streaming Applications ft. Bobby Calderwood

In this episode, Bobby Calderwood, founder of Evident Systems and creator of oNote explains event modeling—a converse approach to the reductive data model system. Event model system is enabled by tools like Apache Kafka, which effectively saves every bit of activity generated by the data system.

Listen Now

Episode 211April 28, 2022 | 48 min

Optimizing Apache Kafka's Internals with Its Co-Creator Jun Rao

You already know Apache Kafka is a distributed event streaming system for setting your data in motion, but how does its internal architecture work? No one can explain the internal architecture better than Jun Rao, one of the original Apache Kafka Creators and Co-Founder of Confluent. Jun has an in-depth understanding of Kafka that few others can claim—and he shares that with us in this episode, and in his new Kafka Internals course on Confluent Developer.

Listen Now

Episode 212May 3, 2022 | 2 min

Build a Data Streaming App with Apache Kafka and JS - Coding in Motion

Coding is inherently enjoyable and experimental. With the goal of bringing fun into programming, Kris Jenkins (Senior Developer Advocate, Confluent) hosts a new series of hands-on workshops—Coding in Motion, to teach you how to use Apache Kafka and data streaming technologies for real-life use cases.

Listen Now

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free

Apache Iceberg ™

Kafka® 101

Apache Flink® SQL

Apache Flink® Table API: Processing Data Streams in Java

Designing Event-Driven Microservices

Apache Flink® 101

Building Flink® Apps in Java

Kafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Articles

Patterns

FAQs

Blog

Streamables

Learn More

Language Guides

Tutorials

Demos

Meetups

Community Slack

Community Catalysts

Community Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2026

Past Current and Kafka Summit events

Monitoring Extreme-Scale Apache Kafka Using eBPF at New Relic

Kris Jenkins: (00:00)

Kris Jenkins: (00:45)

Anton Rodriguez: (01:45)

Kris Jenkins: (01:52)

Anton Rodriguez: (01:54)

Kris Jenkins: (02:03)

Anton Rodriguez: (02:10)

Kris Jenkins: (02:12)

Anton Rodriguez: (02:30)

Kris Jenkins: (03:04)

Anton Rodriguez: (03:06)

Kris Jenkins: (03:41)

Anton Rodriguez: (03:52)

Kris Jenkins: (03:53)

Anton Rodriguez: (03:55)

Kris Jenkins: (03:57)

Anton Rodriguez: (03:59)

Kris Jenkins: (04:33)

Anton Rodriguez: (04:42)

Kris Jenkins: (05:31)

Anton Rodriguez: (05:45)

Anton Rodriguez: (06:27)

Kris Jenkins: (06:51)

Anton Rodriguez: (07:02)

Anton Rodriguez: (07:43)

Kris Jenkins: (08:29)

Anton Rodriguez: (08:44)

Kris Jenkins: (09:19)

Anton Rodriguez: (09:22)

Kris Jenkins: (10:06)

Anton Rodriguez: (10:22)

Kris Jenkins: (10:31)

Anton Rodriguez: (10:33)

Kris Jenkins: (11:25)

Anton Rodriguez: (11:26)

Kris Jenkins: (11:33)

Anton Rodriguez: (11:35)

Kris Jenkins: (11:35)

Anton Rodriguez: (11:45)

Anton Rodriguez: (12:24)

Kris Jenkins: (12:52)

Anton Rodriguez: (12:59)

Kris Jenkins: (13:03)

Anton Rodriguez: (13:06)

Kris Jenkins: (13:09)

Anton Rodriguez: (13:13)

Kris Jenkins: (13:15)

Anton Rodriguez: (13:32)

Kris Jenkins: (13:39)

Anton Rodriguez: (13:41)