In this week's Streaming Audio, we are doing a deep dive about performance, and we are going really deep. This is not for the faint of heart. We are diving right down to the language level, and then we're diving even further into the heart of the JVM to see what makes the JVM a fast, smooth platform to write code on these days.
I'm going to confess, it's been a long time since I last took a compiler optimization class back at university. And I remember some of it, and there's plenty I've forgotten. Well, if I had to get someone to come in and get me back up to speed, this week's guest is exactly the lecturer I would've wished for.
It's Gil Tene. He is the CTO of Azul, for reasons that will become completely obvious to you. And he's also a co-founder of Azul, and we'll get into that story too. Before we begin, Streaming Audio is brought to you by our education site, Confluent Developer, and our cloud service for Apache Kafka, Confluent Cloud.
I'll tell you more about those at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get right down into it. Joining us on Streaming Audio today we have Gil Tene. Gil, welcome to the show.
It's good to have you here.
It's great to be here. Yeah.
Yeah. We have something in common, or at least we used to, being a CTO and co-founder of a company. So-
Yeah, it looks like you've moved on to doing smarter things than that.
... I've opted for a different kind of stress in my life, I think.
But tell me, how did you get started in that? What led you to co-found a company?
It's probably an interesting ... Sorry, this was Azul. Azul was formed, I think, out of the ashes of the dot-com crashes. I'd worked in networking and security and all kinds of cool things in the late '90s and happened to use a lot of Java, build a lot of things in Java back at the time.
And one of the things that was very apparent at the time is Java took over the world very quickly. In a matter of only a handful of years, we went from it not being a thing at all to being the thing all future applications are going to be built in. And with practical actual business applications being there, the emergence of what we think of as that dinosaur, like Java EE now [inaudible 00:02:48]-
Yeah. God, I remember those days.
... And since I started playing around before those just with the very first versions of Java, I evolved with it. I ended up building cool UI things with it, event-driven architectures with it back in the '90s, and accidentally built an app server because that's what everybody did before [inaudible 00:03:08]
Yeah. Yeah. And everyone was trying to find out the solution to how to do this well, right?
Yeah. And it was a very rapid evolution. Also a huge emergence of true open source sharing of code and frameworks and ideas, where the best ways to build stuff were often based on libraries other people put out rather than vertical commercial products.
But one of the things that was very clear is we're getting a lot of this server side parallelism that came from the natural world, transactions, people talking to web servers, operations, lots of things to do at the same time, and we're running them on computers that had two or four cores.
And that mismatch and the need for massive numbers of computers just to run a skillable application seemed at the time like an opportunity. And we actually set off to address the data center compute needs of large CPU and memory centers that could run this kind of stuff well, and that's what the Azul was actually formed for. We built hardware. We built hardware. Our machine grew to be 800 plus cores and the terabyte of memory in the mid-2000s.
Oh, jeez! You were almost building mid-level super computers back then.
Oh, yeah. Oh, yeah. Super computers were small compared to us. But they were not trying to be fast, they were trying to just do lots of things at the same time with software that was doing lots of things at the same time. We didn't try to break one thing into many problems. Nobody had a problem of where to get 10,000 things to do at the same time. That was just people talking to you.
You just needed to do them. And Java was a natural place to do this with. We had some interesting success on the hardware side, but the world shifted quite rapidly there too. Virtualization came about, hypervisors started slicing machines in two, and multi-core machines started forming easily.
And we evolved to what today is the cloud, right, where you don't need special hardware to do this. The commodity hardware powers the cloud pretty well. But along the way we picked up a bunch of cool techniques. At the time when we started off it was, "How do I get a big honking piece of hardware that can run hundreds of JVMs at the same time and do a good job?" But each JVM could be large and not stall or pause or be heard by the neighbors.
And that's what we put a lot of work into. We built some JVM technology for that, and that's really what we formed a company for. And then from there we took what was good and evolved and evolved and found what people did with it. We ended up building JVMs that were able to handle lots of memory without stalling or pausing, because that seemed like an obvious need.
How would you run a 50 gigabyte heap or a 200 CPU thing if you were stalling for seconds or minutes at a time at those scales? That was obviously not going to happen. We built stuff to solve it. We thought everybody was.
Yeah. We've been past that colossal amount of work to make the object-oriented garbage collection world work seamlessly, right?
Yeah. I mean, garbage collection was certainly one of these annoying problems that was in the way. It wasn't our top problem, it was number three or four in the way. We took what we could and we used other people's stuff where we could and we built what we had to, a garbage collection that could work at the scale that things were running.
It just wasn't available for others, so we built it. We honestly thought that there are three or four other teams doing the same thing at the same time. But we ended up with what is today the C4 collector. And it's had an interesting decade and a half of being kind of the supreme garbage collector from a point of view of being able to handle scale and concurrency and all that.
And in recent years you can see the trend of people doing what it does. Future garbage collectors in the java world are evolving to do the same thing that C4 has been doing for a decade plus. And we'll have multiple collectors that do the same things eventually.
Okay. Teach me something about what it does. Because I know university undergraduate level garbage collection techniques, so update my knowledge a bit please.
Well, I have this interesting talk called Understanding Garbage Collection that you could probably find on YouTube somewhere. But it evolved from exactly that question. I was trying to explain C4 and lay down some basics. And I tried five minutes, that didn't work, in 10 minutes, didn't work. Eventually turned out 45 minutes of intro works, but we don't have that amount of time.
Let me try to hand weigh very quickly. Garbage collection is an interesting but a very simple problem. A garbage collector needs to find all the dead objects and get rid of them and keep the live ones around. And it needs to do it at the rate that you create new objects. It needs to get rid of old objects that are no longer there to make room for new.
There are many techniques just now coming up on almost 50 years of academic paper work on techniques and ways to do this and really cool algorithms. And terminology has evolved over the years. What used to be called parallel is now called concurrent and things like that.
But fundamentally you need to find all the life stuff, you need to somehow get rid of the dead stuff, and you need to make room for the new stuff. Because if you don't move things around, you'll end up with Swiss cheese memory and no room to put your stuff that is, "I need room for this 10 KRA. I have lots of memory but nothing is 10K inside." So-
It becomes like the disc defragmentation problem, right?
... Exactly. Memory naturally gets fragmented in all applications that have variable object sizes. If you don't think you have variable object sizes, try parsing JSON. You have objects of all sizes, over time you'll end up with Swiss cheese memory. Compaction is an inevitability. You have to compact memory.
You need to somehow move life stuff together to make room for empty stuff that is contiguous or at least in parts. And compaction is one of those things that people have tried to avoid for decades, or delay in many ways for decades. We eventually compact. That's what a full GC is, and things that fall back to full GC, and that's what other techniques are.
But traditionally it was all about delaying this. This is a terrible thing. It gets worse the bigger my application, the bigger my heap, the bigger my life stuff is, so let's not do it now. Let's do it later, and later, and later.
And we have many techniques to try to do other things instead of this until it gets bad enough that you can't avoid it and you have to do it. Our change in approach was to say, "No, let's just do that hard thing, and only that hard thing, and do that all the time." C4 actually stands for a Continuously Concurrent Compacting Collector. That's four Cs.
Snappy. I like it. Yeah, okay.
All it knows how to do is compact. The only way it recovers memory is by compacting.
Oh, right. Okay.
And it took on the hard problem of, "How do I compact without stopping your application?" And so concurrent compaction is the fundamental thing that we did. And we weren't the first to do compaction or concurrent compaction by no means, but we're probably the first to do a practical one at scale.
There's an actual peer review, academic paper on algorithm if people want to read it. And there are probably 17 other ways to do concurrent compaction well. There are at least five that I know of that are published, and a couple of others that are actually in implementation.
But for some reason people didn't build these for real because they perceived overheads or complexity, or the most common one is, "Do I really need this?" Seems to work okay right now. Yeah, people are complaining, but it's working.
It was really about being able to address scale, like modern computers with gigabytes, and tens of gigabytes sometimes of memory, and CPUs generating 10 gigabyte per second of new stuff, so you need to keep up with it. And that's just what I just talked about right now. That's what a small AMI or instance on a cloud with eight or 16 course can do. It doesn't need a supercomputer to do these numbers anymore.
Yeah. Yeah. Our sense of scale has changed massively in the past 20 years. Right.
Garbage collection was certainly a really cool innovation, one that has lasted all the way to here, and one that I'm happy to see lots of other things in Java now mimicking, which is great.
But founding a company and seeing the CTO thing, that really has to do more with, "I run into a lot of ideas that I thought other people should do, and advised other people to do, or helped other people do."
But every once in a while you run into stuff that you won't let any other people do because you want to do it. And I think that's how you fall into that trap of actually founding a company and being a CTO.
How many of you were there co-founding it, out of interest?
There were three of us. Scott Sellers, our CEO at Azul today is one of them. And another colleague of mine, Shyam Pillalamarri, who I worked with at Nortel and Shasta Networks before, was another co-founder. And we're all very technical engineering minds, or at least at the time.
Yeah. Presumably at least one of you has been forced into management and, or sales.
Yeah, that's another problem with success. You end up having to do that part of the business. I'm very happy to have a strong partner that does those things. I try to stay out of that part of the business and stay on the technology side.
Yeah. Yeah. You've accidentally hit on the reason why I'm no longer a CTO and co-founder, because I would've got sucked into management.
Well, one of the things I very carefully have done as a CTO is made sure structurally that I have nobody reporting to me. I've worked-
You're allowed to do that.
... Yeah. I've worked successfully with very good, strong engineering organization that is run professionally and well. And my ability to work up and down and across that organization on collaboration and direction, and getting my hands with and working on architecture is very important to me. To do that, I've avoided the burden of managing people.
A smart move in my opinion. With apologies to all the managers listening, it's not for all of us. I guess then to understand what your working life is like we need a better understanding of what Azul is doing today, because you're no longer building supercomputers. What's your actual day-to-day business?
In the late 2000s we shifted from being a hardware focused company, where we made some pretty sweet hardware, to Intel primarily at the time. Intel CPUs have gotten good enough, and actually their designs started looking more and more like our machines, to the point where we didn't need to build custom hardware anymore.
Commodity hardware was getting good enough and was starting to show the same kind of skill problems that we had solved in our hardware. We made that very difficult pivot from a hardware company into a pure software company. This happened in the late 2000s, and we introduced sort of software only products right around 2010, 2011.
And they formed into what we called Zing at the time, what is today called Platform Prime or Azul Platform Prime, which is kind of our premier JVM, JVM that is fast and can handle scale, and doesn't stall our pause when it does those things.
And that was an interesting shift, because we had a lot of customers using our hardware successfully at the time. And we were pleasantly surprised that they stayed with us through the transition and we're able to give them both the great hardware we had before and transition them to just virtual appliances at the time on VMware ESXi 6.0 and eventually to just a JDK that runs on any Linux environment.
Okay. Without getting too sidetracked, that's interesting. How did you bring them along for the ride?
It turned out they liked our product. We actually kept supporting our hardware for quite a few years. I remember having to go to talk to our last hardware supported customer in New York and given the venues that, "No, we will not renew hardware support for another year eight years in."
And it was the machines were good. Lots of eCommerce on the internet was running on them. Lots of financial sector things were running on them. And when people needed what they did, and when we had software that could do the same on commodity hardware, they shifted to that.
But really being able to handle the kind of work at scale, and smoothly enough and without pausing or doing strange things like cutting things into lots of little pieces just to scale, at the time that was hard, that people just chose to use our platform for it.
It's still to a very large degree what people do with it, but now they do it with microservices, and with infrastructure software like Cassandra and Kafka and other things like that.
Oh, you said the magic word. We're going to bring in Kafka at this point. Where does Azul link to Kafka?
Azul as a company does JVMs. We've expanded by the way beyond just the technical excellence, and speed, and metrics, and that kind of stuff with our Zing and Platform Prime JVMs too, just supporting open JK in general. We make one of the premier distributions out there that people can use for free or get commercial support for, it's called Zulu, and it's one of the popular ones out there.
Sorry, just to pause you on that. A distribution of OpenJDK, is that similar to you get a distribution of Linux where the core is the same but you've got different bits of flavor on top?
Yes, very much so. OpenJDK is a source code project. If you have bits, somebody built them. That's a distribution. Such like there's a Linux source code, Kernel project, but most people don't get the source code and build it themselves, they get a distribution of it somewhere.
There are plenty of OpenJDK distributions. Azul has been running one of the popular ones since 2013. Zulu is finally the longest standing now OpenJDK distribution that is available for all the platforms and all the versions in that. And we have a free one. People use it. We have tens of millions of downloads quarter on this thing.
I'm set, and I've downloaded an OpenJDK with the word. It's all in the file name in years gone by, and not actually thought about.
I'm pretty sure we've ended up as the JDK inside the Confluent Docker Hub things.
It shows up everywhere. And that's sort of just the free distribution. We also offer commercial support for that. And then we have our better metrics JDK, which is the Prime Platform JDK. What's common to all these is research JDK, so JVMs that run Java stuff, any application built in Java needs one of these to run.
And the link to Kafka is pretty simple. Kafka runs on the JVM, at least the core of it does. The actual clusters and the brokers run on JVMs, they're Java-based applications, even if parts of them are written in Scala and stuff like that. Scala is just a JVM language. And then Kafka clients could be in any language potentially, but there are a lot of Java client JVM based use of Kafka as well.
Anywhere you see Kafka, you see JVMs. And anywhere you see JVMs, you need to find one and pick one to run on. We make some pretty good JVMs, hence the link, right? Specifically in the area of Kafka, and this by the way is for lots of infrastructure software built in Java, the last decade has been great for Java in the sense of going from just the thing people build applications in to the thing most of infrastructure is built in.
When we started off in the 2000s, people were building applications and app servers in Java, but databases were built in C and C++, right? Message systems were built in C and C++. Search stuff was done in C and C++. That's not the case anymore.
If you look under the hood of any large piece of infrastructure, whether it's Hadoop or the largest databases on earth, various databases are, there are messaging systems, et cetera, usually you see a Java based core. Most infrastructure, most heavy lifting infrastructure out there right now is powered by JVMs. And being a good efficient, fast JVM is very useful for that.
Yeah. What do you think it takes to make the best JVM for that kind of workload? Because we've talked about garbage collection, we've talked a bit about metrics, right?
Yeah. I look at it as, what makes one JVM better than another? We don't argue on features and what the language should be like. That's a point of agreement across everybody, right? We developed together in OpenJDK. Java 11 is Java 11. And new Java will be new Java. There's no fragmentation there.
But when you come to measure how this thing executes with metrics, how fast is it? How fast is how much through put am I getting? What is my response time? What is my consistency of response time? What are the outliers like? What kind of machine do I need to run this if I need this much load to be happy? Those are metrics. You can measure them. And you can run on a slow one, on a medium one, and a pretty good one or a great one. I mean, you could choose, right?
And they'll all work functionally and they'll all do the same thing, it just it comes down to whether you want the performance because you're really into speed or you need the speed for your business, or most likely it's about how much it costs.
Most businesses run on more than one machine. If you see a business running on 53 machines, that's because 52 wasn't enough. And what does not enough mean? Well, they probably tried to run it on 40 and people screamed at them and weren't happy, so they added machines until people stopped screaming.
That somehow makes it sound like a Halloween project, but okay.
Yeah. Well, but in my opinion, that is in reality how people manage capacity. Capacity-
As soon as you've broken out of the, "We can only run on one machine," I suggest you find the comfortable number, right?
... Yeah. And luckily we live in a decade where nobody's surprised that we can run on more than one machine. That's not magic, that's normal. That's how everybody is supposed to build. But once you do that, the question of how many machines do I need is a cost question, right?
If I can keep my store running, and my customers happy and buying, and I need 130 machines to do that, that's great. But if somebody comes around, gives you a JVM that cuts that down to 80, well, you might take a look. And if you're AWS or Google Cloud or Azure, bill drops with that, that's pretty good too.
The inverse is also true. Sometimes we find really cool technology and we're willing to pay with a lot of spend on in order to get that, and we get productivity and time to market and all the good things of teams being able to deliver something for the business quickly.
The best is when you can do both, when can actually have good productivity and good development cycles and good time to market, but also be able to run in an efficient way, in a way that doesn't spend as much money for the same capacity.
Yeah. How do you actually achieve that though?
Well, you ask what it takes to make a great JVM, usually it starts with you need a good engineering team that spends a lot of time thinking about how to make this JVM good. And that means you look at the metrics, you work to improve them. You do it diligently and nonstop. And I mentioned the garbage collection thing before, that was actually pretty well solved for us almost a decade ago.
We've been incrementally making it better and better. And yeah, we've grown our maximum from 300 gigabytes to 20 terabytes, and our minimum from a couple gigabytes to a couple hundred megabytes. But that's kind of a solved problem for us. Our things don't pause, right?
It's about speed then. How fast can I get this to run while not pausing? And that's where, for example, we've invested heavily in JIT compiler technology, new JIT compiler technology over the last several years that lets us build optimizations that simply get better sets of instructions to run the same things and get more out of the same exact CPU.
And when we started off, we thought we'd squeeze a few percent out, but it turns out there was a lot more to squeeze. And we went and we leveraged lots of good project work from LVM into our JIT compiler, and we applied a lot of our own strong engineering.
And the outcome of that when we run Kafka is you can run Kafka on Azul, OpenJDK build, or you can run through Kafka on our Zing, Prime Platform build. And it will run in exactly the same way, but Prime will carry 40-50% more messages per second through the same broker and the same piece of hardware. Because it's fast, because the code is optimized, because we've spent several years doing that and keep spending the time to do that.
Okay. I want to drill into something specific there then. I'll define what a JIT is and you can tell me if I'm right or wrong. The idea where the JIT is, the code has been compiled, it's running and you can find some more efficient set of instructions as you're running Just-In-Time to speed up the performance of that particular piece of code.
You find you're hitting a four loop 10,000 times and you think, "Oh, well, this is clearly something we're doing a lot. More efficient bytecode would speed this up." Is that a fair definition?
Yes. And think of that iteratively. Most JVMs today start off with interpreted code. You have a bytecode, you run it one bytecode at a time. You don't want to be doing that with anything you're actually spending money on. Okay? You just do it with stuff you don't do a lot of.
And there are various techniques to try and make it fast from the start. But fundamentally the word hotspot, which is the name for the technology in the core JVM we all use coming from originally Sun and Oracle and OpenJDK, it's still at the heart of our Zing JVM as well is the notion that you look for the hotspotting code and you optimize that.
Java has been doing that for now 25 years, various things, right? That is you look at the code, you look at what's running hot, then you pay attention to that and you optimize.
At run time.
At run time. That's the Just-in-Time part. Well, I've got a lot of things. Which 1 am I going to optimize and how? Now, there are two parts to this. First of all, when it started off it was we don't have the power to optimize at all, so let's just optimize the important stuff. The hard stuff is the important stuff.
That's one reason to do a JIT, only spends the effort to optimize what is needed. But the other reason that evolved over time, which is very powerful, is that with a Just-In-Time optimizer, you get to optimize for what is actually happening rather than what might happen. You can profile and observe the code behavior and say, "Yes, I'm running this loop 10,000 times, but this loop has an if statement in it. And it looks like only one side of the if statement is ever being used. And that other side, that's the one that will run next year."
Theoretically, when it says, "If the year is 2023, then do this," that could happen next year, but right now it's not happening. Let's highly optimize assuming that this side will happen and that won't happen. Now, you've got all kinds of interesting speculative optimizations that you do here. You could say, "Okay, do this, and make that side slower."
You can go all the way to say, "I don't think that's ever happening. Let's optimize assuming it doesn't happen." And because why don't I think it's happening? Well, it's been running 10,000 times and it hasn't happened, so I'm guessing it's not going to happen. Maybe I can get faster code based on that guess. It's a guess, but I can make faster code. That's speculative optimization.
And the key to speculative optimization is to speculate, you have to be able to recover from being wrong. Right? If you can say, "Well, if I was wrong, oops, throw the code away, let's make new code," then you're allowed to make the speculative guesses that let you get faster code. If you leave speculation behind, you leave half of the speed behind. I'm hand waving here. It could be 30%, it could be 80%. But a lot of our speed comes from the ability to optimize speculatively based on actual experience in this JVM on these numbers.
Right. But you still maintain correctness.
Yes, correctness is key. To optimize speculative, you have to be able to deoptimize. Detect that your assumption was wrong, that this code is not safe, is not correct, and not run it. And you could do that in all kinds of interesting ways in the code. "Okay, I've got an if thing, it went left. Oops! I didn't think it'll do that. Let's throw the code away." But that's a code detecting on its own.
But you can actually apply optimizations that are environmental, things like I assume that there's no other class implementing this method, therefore this is the only method in the universe and I can inline it with no checks. But oops, somebody loaded a class that implements this method too and that code is not right. Before the class is loaded, throw away the code.
That's an example where the code is fast without any checks, because something environmental, the JVM, is detecting things that break assumptions in code and throwing away the code before there's assumptions are broken. And there are plenty of interesting examples and techniques. For example, in Zing and in our Prime JVM, we have a competitor called Falcon by the way.
And Falcon does some pretty aggressive optimizations, including what we call speculating the finality of values. We all know that static final, something equals something, that's a constant. I can take that constant and make the compiler note, that's the number. Throw away propagated, throw away code that assumes otherwise, all that, that's safe. That's correct.
But what if I have something that could change, but doesn't? And what if saying it's a constant could give me a bunch of speed but it could change? Well, you could actually apply things to find opportunities to say this in practice is final, truly final, effectively final. We've got all kinds of nice kind of terminology for those. Optimize assuming it is, and when we detect the change in it, before the change happens, the code will disappear.
Right. So you are kind of guessing what the programmer should have put in an ideal world and allowing yourself to optimize for that.
Yeah, definitely. Some cases, "Hey, you forgot to say it's final. Guess what? It looks like it's final, so we're going to guess it's effectively final." In other cases you said it's final, but under the Java semantic rules it could change.
For example, instance final fields in Java classes are not truly final. Well, because reflection could change them and deserialization for example will change them, and unsafe could change them, and things do change them. Final fields do get changed.
They don't all get changed. Most of the time they don't get changed. But if you try to optimize saying a final field is a constant, you'll very quickly find out that your program doesn't run. It's the trick of where is it valid and where is it not and how to recover it from not being valid, so you optimize in all the places you can, but don't optimize the ones you shouldn't.
This sounds phenomenally complex. You must be balancing the Falcon engineering and reading the Java spec to death.
One is yes. But to be fair, a lot of these techniques I just talked about are not just ones that we do at Azul. They've been done in a Java world. Speculative optimization based on recorded assumptions has been done for quite a while. Java is one of the leading platforms that does it, but it's since the late '90s.
And a lot of the fact that Java lets us write clean, object-oriented code, with good encapsulation and good speed has to do with that. For example, the fact that we get to encapsulate our fields with getters and setters, and do that and not get hurt in speed for doing it, is because all JVMs will speculate that a getter is a unique implementation, and inline it and do that thing I talked about where if a class overloaded, it will deoptimize and all that.
But in practice, people don't override getters, so we get to optimize them. And that's not something Azul created, it's something the entire Java world uses. We just get to do more of this, right? And the reason we get to do more of this is we've invested in a JIT platform, our Falcon JIT, that lets our people be very productive.
The number of new optimizations we can bring to market, the rate at which we get faster, those things weren't invested in and it's paid off. When we started off we weren't 50% faster in Kafka, but over the last three or four years we've gotten faster and faster and faster, and we've opened quite a gap by now.
Wow. Do you have something like, I'm speculating here, but you have something like a custom language for writing optimizations in Falcon?
It's not really custom. The Falcon uses LLVM. LLVM is a very popular compiler and optimization project. It's used for many languages, client and Rust and others are built on it. And its core itself is built in C and C++, that's the language people write in.
But within it, and this is true for all compilers, there's an intermediate representation. You translate programs into, or compilation units into. And within that intermediate representation, you do all kinds of transforms and analysis and manipulation.
There's a language that compiler people use in talking to and a whole bunch of useful libraries they use to manipulate all that. I mean, it's written in C and C++ at end, but most of the ... You could think of it as the language is, the closest thing to it is the intermediate representation as an effect of a language.
Right. Okay. [inaudible 00:37:52]
And that's the thing you manipulate. Now, you do that at one level, there are semantic things that you bring from a higher level, like specific language things that you could do, like the knowledge that Java has rules about classes in loading, and virtual methods, and the ability to analyze the universe around you to figure out what you can optimize.
Those become more specific to a problem or language. But we build that in and on top of the LVM engine. And we actually upstream a lot of code to LVM that in our opinion enabled it to be a good JIT compiler. The ability to do optimizations in the presence of garbage collection, in the presence of the need to optimize code, those were things that were kind of theoretically there.
But in reality if you tried to apply them, no optimizations would work. Making it so it's practical to do those things and heavily optimize is something we worked on for several years. And LVM in my opinion is a good engine for building other JITs in, not just for Java. We just focus on the Java one.
Interesting. Okay. I think that's as low level as I fear to tread. I'm glad you're [inaudible 00:39:11]
Oh, it's very far away from the Kafka subject and the messaging subjects that we've talked about.
Well, let's climb back up the stack, because it must be the case that different user space applications could benefit from different kinds of optimizations, right? But is that something you get into? Do you end up saying if you're mostly a request response HTTP server, we should be optimizing this thing versus a CQRS pattern over there or?
It's absolutely true that different applications and different patterns have different optimizations they can benefit from, and some optimizations are good for this, not that. But it's not as big as the patterns at that level. It's more there are idiomatic things that happen in the Java language or in Scala language, or that happen when you do ring buffers and message processing or not.
And so there are patterns and code that will happen, and opportunities to optimize in them, because often the patterns and code are built in a language that could generically allow for a much wider set of things to happen. But if you could just prove certain things you could do better.
For example, you could have a collection that could hold any type, but it happens to only hold strengths. Now, if it was declared that way, that'd be great, but Java has type erasure. At the bottom level, we don't even know this is a collection of strengths. But you could profile it, speculate for it, do all kinds of things for it, in practice only have to run the cases that deal with strengths or integers or whatever that is.
And minimize the paths that are generic, you could get faster code. At the level of other things, there are things that are just the code could be more efficient. There are things that are more systemic, like how objects are allocated, how memory is compacted together or not, locality of fields, patterns of fetching memory before you initialize it.
All kinds of things like that, you can certainly twist some things and get more efficiency. There are good examples in very heavy patterns we use everywhere, like serializers and deserializers of any sort, both Java and JSON, XML, whatever those are, where there are patterns of dictionary LOOKUPs and other things that happen where that pattern could probably be made 20, 30% faster if you recognize it.
But usually the recognition is sort of at the code analysis level of a method, not so much the program. The stuff that is program level, those do affect things like the garbage collector. How does that interact with those patterns? You have a lot of the referencing of weak references. And most applications don't do that, but when you do it, it ruins the life of a garbage collector that assumed it won't happen.
There are all kinds of things around those, or vice versa. You do that a lot and that's why it's slow, so there's a faster way of doing it. But at the end you can think of it as from the JVMs point of view, we have to run the code we're given in exactly the way it's meant to run from a functional perspective. And our job is to try and make it run fast and cheap and smooth, and to try and ...
When I say smooth, I mean at the same speed as much as we can all the time. It's never that way, but you reduce the bumps and the hiccups. And the real world is one where you run into real things, you spend time trying to be better at them. You either are told you have a problem and you fix it. You find a gap and you find out how to close it, or you just are told, "This is what I'm running. Let's spend time thinking about how to make it better."
And we've worked with a lot of our customers over many years. One of my favorite things to get from customers is workloads. Because if we get their workloads into our labs, into our regression tests, then we can study them and get better at them. And the more representative of the workloads, the better. One of our customers and a partner that we've had for many years is LMAX.
LMAX has done a lot of really cool things in open source, including the Disruptor, which is a very common pattern that people use in many applications. Even Log4j uses the disruptor these days. And we've been running LMAX performance tests in our labs for many years, so we are really fast at LMAX code. Because guess what? We run it, we pay attention to it, we make it faster.
And since lots of people use that code and as the core to many other things, because they do a lot of open source libraries, we get to be faster to things that use them as well. That's an example. We generally track in the order of 400 to 500 different workloads that we actually test performance on, try and find what we could do better on.
And in some cases is, "Can I make this faster?" In some cases is, "Well, we've got a little weakness here. Can I close the gap?" And over time our engineers simply make it, their job is to make it better, to find opportunities in this to make it better. And others' job is to bring in more workloads to that Corpus so we can figure out what to be better at, right?
Yeah. It sounds a little bit like the world of virus detection, right? You need to go out into the wild and find those problems while you're simultaneously fixing them.
Yes, but without the urgency.
Yeah. You have quite the zero-day problem, which is nice.
If we don't find it, then it'll take a few more weeks or months or a couple years before we get faster at it. Nobody's going to break. Nobody's going to die.
Yeah, sounds like a much less stressful world.
Exactly. It's a much nicer place to do the [inaudible 00:45:36] than with black hats and white hats on security issues. We do some of that too by the way on the OpenJDK side. But from a performance point of view, performance is fun in the sense that you get things make to make things better, but if you don't, nobody gets hurt.
Yeah. I always think the thing with performance is like it's one of the best areas of life in that I wish I could look at my monthly mortgage bill and study it really hard and make it 10 times cheaper. But that's never going to happen. With computers, you can get that.
You can. And I've seen optimizations that make things 10 and 100 times cheaper. But in reality, to do the kind of work we do, you need to be good with a lot of half percents. It's a cumulation of a lot of good work rather than we found that one thing that makes it 10 times faster. [inaudible 00:46:36]
I was hoping for a silver bullet.
Yeah. Often if you look structurally at the application itself, you can ask high level questions of, "Why are you doing that? You don't need to do that. Instead of making 5,000 database calls, how about if you make one?" You can find 10Xs there.
But if you're going to make those 5,000 calls, because that's what we have to do, and you're going to make them more efficient and all that, you can get very impressive tens of percents of improvement. But usually you don't find 10X that way. To find 10X, you have to look at code that is probably pretty silly. Right?
And most compilers will eliminate that 10X pretty quickly. One of the examples I like to give is compilers can prove away loops, right? 4 i equals zero to 100 i++. Well, we can analyze what I will be at the end.
We don't need to run the loop, right? And if nobody looks at any value in the middle, we don't need to do this. Or the way a compiler thinks about this is, "I'm going to say I ran this loop really, really fast. Prove me wrong." If you can't tell the difference, then it's correct, and I was really fast, right?
Yeah. If you didn't want the side effect, then let's pretend it never happened.
Exactly. We do a lot of that. Compilers in general do a lot of that. But the opportunities to do those is usually you don't find real 10Xs like that.
Okay. Then I'll need to rely on you adding up those half percents for me.
Yeah. If we found a way to make Kafka 10 times faster than any other JVM on the same piece of hardware, we'd be very happy. But we think that 40% is good enough for people to pay attention.
Yeah, absolutely. There's one last thing I wanted to ask you about, which connects to the idea of workloads. Because you have a cloud product, right? Is there a synergy there for getting new workloads and this gradual chipping away at the performance problem?
Down to the technology parts, we've built a lot of cool things into Zing and a lot of firsts. Garbage collector was a first, or JIT compiler does about a bunch of firsts. And one of the nice things we've built in recently, I think we've talked about it for about almost a year now, is something we call the Cloud Native Compiler.
And the Cloud Native Compiler really allows the JVM kind of to outsource, to extricate the heavy lifting of the JIT compilation and lets somebody else do it, the service that is a Cloud Native Compiler. So yes, the JVM can optimize for itself, and that's what JVMs do. And when you've got a big powerful machine you're running on, you could do a lot of optimizations.
You've got a 64 core thing or a 32 core thing, it could afford to do a lot of heavy lifting and optimization to get the code fast. But what if you only have two cores or four cores or eight cores? You can still do the same optimizations but they're starting to look expensive. Eventually you'll get fast code, but you've got 5,000 things to optimize, and that compiler is spending a bunch of CPU thinking about them.
You don't want to slow the JVM down just to optimize it later.
Yeah. Traditionally, and this is something all JIT compilers have to wrestle with, there's a trade-off between the work of optimizing and the benefit that comes from that work. The eventual speed is the benefit. If I'm going to do this 10 billion times, then being fast is good, but I have to pay by optimizing, to actually think, analyze.
And you usually optimize a lot more code than you have to, because anything that crosses a threshold to be hot you do. That usually is thousands of methods being optimized. And then how hard are you going to try to optimize? What level of optimization will you apply?
That trade-off has a lot of interesting things, and including tiered optimization, which is optimized a little bit. Then if it's still hot, optimize it more. And the tier to the optimization of you only optimize so much for things that only so hot, that kind of stuff.
This whole idea of workloads must connect it somehow to you. I know you've got a cloud product, does that give you a certain synergy looking for new workloads to optimize?
Yeah. We've expanded what our JVMs do with all kinds of cool things over the years. Garbage collector was definitely one of the early big moves. We've invested a lot in JIT compilers and did a lot of first in the JIT world as well.
But one of the cool things we added recently is something we call the Cloud Native Compiler, which is a facility that lets JVMs do their job better. What a Cloud Native Compiler does is basically it allows a JVM to outsource the job of JITing and optimizing to something that is potentially better than the JVM at doing it.
Right. Take me through that.
When we started off, we ran on big machines. Remember, we made monster hardware then. Our customers would run big workloads with demanding things, so they had serious strong hardware for that. And when you have 32 and 64 cores, then the job of optimizing the code, it takes up a few cores and crunches it and comes up with faster codes, that seems reasonable and easy to do.
But in reality, in most applications we have to optimize thousands of methods to get to fast code. And non-optimization has various levels and effort that you can apply. And the more effort you spend, the more you're stealing away from the actual workload.
Now, if your machine is big and fat and has room, that's not a problem. But what if you're running in a two core container and a four core, eight core container, and this container is supposed to use this capacity to do the message fasting jobs, or the gateway jobs, or whatever processing your application is doing? You end up having to trade-off the amount of work you do to optimize against the work itself.
Now, obviously over a period of a month, if you invest a lot in optimization now it'll pay off, because you're running faster codes, it's cheaper. But those first minutes or hour could be painful if you're stealing away resources to do the optimization that'll only make the code faster later.
You're competing with the workload itself and it also takes time to get to that optimal code and you're slow until you get there. There are always these trade-offs in JIT compilations. How hard do you optimize? How fast do you make the code, at what cost? And that trade-off is inherent. All JIT compilers have to make the trade-off.
Yeah, you've got a limited amount of computing power, are you going to spend it running your slow but fast enough method, or are you going to spend it figuring out a fast away and then come back later?
Exactly. And that's where people do tiering. They start with only optimize this much, then what's left and still hot, do more and more. But at some point you say, "That's enough. I can't do this here. I can't afford to." The Cloud Native Compiler lets you break that trade-off.
A JVM is a JVM, a cloud compiler's job is to optimize. JVM could have 2, 4, 8 30V cores, whatever it has, and a cloud compiler could have 1,000. And it could have 1,000 now, it could drop down to five later, it could have 5,000 later. Our Cloud Native Compiler is a Kubernetes application.
It's a farm of JIT compilers that serves JVMs. And JVMs can find it and it's a shared resource across them. And when a JVM has a thousand things to compile, it could get that thing to compile for them as fast as it can with the resources it has, because those are only needed very temporarily.
And then the JVM has fast code, and it moves on, and that thing can do the work for another JVM and another JVM. Or if nobody needs the capacity, it could shrink it down and it could bring it up.
You're literally on your two core process you're running. It's going to say, "Here's something I'd like to optimize. I'm going to outsource the work of optimizing while I carry on running." And then you can stitch that optimized version back in later when it shows up.
Yes. In fact, that's exactly what's happening in the JVM already. Our Falcon compiler is running in the JVM. The JVM noticed the code is hot. It nicely asked the Falcon compiler threads to please optimize it while it's running. And when they're done, they give it the code, it installs the code and now it's fast.
We just extricated that work to an external service. It's doing exactly the same thing. Every single optimization, every single piece of data, every conversation it has is the same, except that it's happening across a service with some very sophisticated, interesting cashing technology to make the communication very effective.
But the compiler, the JIT's optimization we're doing in the Cloud Native Compiler is exactly the same as would be done locally, or can do all the same things. In fact, it's exactly the same but twisted to the highest optimization because we can afford it.
Yeah, because you're using a dedicated optimization machine.
In fact, one of the things we're looking forward to doing in the Cloud Native Compiler is doing more and more costly things. Because even we had, had to trade-off this cost versus benefit thing, and our falcon has a level 0, 1, 2, 3. You can think of it much like - 0, 1, 2, 3, 4 JIT Compilers. I want to see a -04, 5, 6, 11.
And those are kind of the optimizations you would never dream of doing locally in one JVM for yourself. Now, there are two things that make it possible for us to do that. One of them is we've extricated the resource. That thing could be a lot more powerful than the JVM that it's optimizing for, and it only needs that capacity temporarily.
Since it's elastic service, it can afford to ramp up 1,000 computations, then ramp them down and not pay for the empty resources that it's not using. But even more importantly, when I run a JVM, and it decides it wants to optimize this method this way with that profile because in this world only these things have happened, that JVM is usually not alone.
There's usually 100 others just like it right next to it that are also experiencing the same thing and wanting this same optimization. And this cognitive compiler has already done one of those for your neighbor.
Ah, so you can cash that work of optimization and reuse it. That's nice.
And because we can reuse the optimization across hundreds of instances that would normally do this themselves, we can afford to spend hundreds of times more compute capacity to crack that optimization if needed.
Right? You're getting the economies of scale and optimization itself.
Exactly. That's nice. This is cashing of analysis work to apply to things that need the exact same analysis. And luckily, applications don't run in one instance and do a unique thing.
Most of our applications today run on clusters where we have tens or hundreds of things doing the same things at the same time, so their optimizations are shared, their profiles are shared, their learnings are shared.
And we can actually learn from the profile of one JVM to optimize another, rather than have one JVM learn what it needs to optimize. And after 10,000 things have been slow, it can now decide I want them fast, we can actually apply profiles across JVMs as well.
Right. Yeah. But contractually, we want all of them to run exactly the same across our cluster, right?
Yes. Or the trick is, how do we know which ones are the ones that we want to have run exactly the same?
Yeah, I can imagine.
The ones in the same cluster, that's what we mean in one, right? We're load balancing them, they're running the same stuff through random stuff. But the one in the same application cluster, but in another data center, that might need a different optimization, because maybe the cluster name is different, and maybe that's a constant that got propagated into the code.
There could be differences, and then you have different applications. HashMap.get() is a method. Many different applications use it and you need to optimize it in many different ways for the application that's running them. Recognizing that this optimization applies in that JVM is an interesting trick.
We have a lot of really cool technology around the conversations and the state questions that happen to decide whether an optimization is. But at the end, you can think of it as there's a conversation between a JVM and a compiler. The conversation starts with, "I have some code. I need you to make it fast. I don't really understand much. Here it is."
And then the compiler interrogates the JVM, it says, "Okay, great. You have this code. What can you tell me about the profile, the frequency that this happens or that happens? You're calling this method here. Give me that method code too. Maybe I want to inline them together." [inaudible 01:00:32]
Oh, so they actually have a discussion?
Yes, it's a conversation. Because the JVM is, without the compiler, it's kind of stupid. It doesn't know what to do, it just knows the code is running and it wants it fast. It doesn't know what to ask. The compiler is asking a bunch of questions to figure things out.
That's a back and forth conversation. It usually involves hundreds of back and forths before you get this really nice nine deep inline thing optimized for your profile, right? But it's not because the JVM knew it needs that, it's because the compiler discovered that by having a conversation.
That conversation is being had with an external compiler. And if you think about it, that conversation has a lot of questions and facts in it. If you repeat exactly the same conversation, you're going to get exactly the same outcome of optimization, or more specifically, we've carefully built our Falcon compiler to be deterministic.
If you see they're the same questions and answers, it will come out with the same code, right? Lots of compilers aren't, but you kind of need that. You need it for quality and for QA testing, but also it is very useful for cashing.
If I can ask all the questions and I get the same answers, I can see, "For these 300 questions, have I done this before?" And if I have, then here it is. And it turns out that with the right works to normalize things and make repeatable things happening across runs, you can get pretty nice hit rates.
Yeah. Yeah. I can see that. That's going to take away the cost, not just in one JVM, but over hundreds of them. That's really cool. So-
It takes the cost away, it takes away the time too. You warm up really quickly because you already have the answers, right?
... Because that ties back to one last thing, because I feel like I could pick your brains for days now, but it ties back to one last question I wanted to ask you, which is something that really concerns Azul is here is real time nature of data, right? A message comes in, I want to send that message on quickly without the cost of performance optimization getting in my way. What's the overhead here for the kind of making the JVM do work you intended it to?
I think that the sensitivity to latency or to consistency of latency is an interesting thing in general. You have a message, you want send it, you want it to be fast, you want that JIT compiler, you have been producing the fastest code for sending that along, but you also want it to be more than just fast, you want it to be always faster, consistently faster, fast the vast majority of the time, whatever level of consistency you're willing to pay for.
Now, this turns into a need whenever you have realtime, semi realtime things. I'm not talking realtime systems like a break system for a car, I'm talking about human real time, human business real time. There's a person waiting.
Yeah, faster than the blink of an aye or faster-
Their happiness level, their frowny face level depends on how quickly you respond. And if I'm doing this with 100 people and one gets angry, that's enough. That one is angry, and then another one gets angry. We don't need perfection, we don't need ...
I mean, most people don't need true perfection, but you usually want high percentiles of goodness, like 99, 99.9% is happy, and we'll deal with unhappy ones. To do that, you need the responsiveness of whatever you do to be good at a high number of nines.
And when your message system is a key part of that responsiveness, if you've got a service that does its job by messaging with other services and then composing what they get back and responding, then the consistency of message latency affects the consistency of the surface itself, and eventually people's happiness or misery.
The ability to consistently be fast is important. The reality is nobody is ever consistently fast. We're always going to have noise. We'll call it jitter, or hiccups, or stalls, or freezes, or whatever it is, but they're there. And how much of it that's there matters.
Now, with optimization, you get speed. But you also have these deoptimization can make you slow and then fast again. That can create all kinds of interesting inconsistencies. And there's a whole slew of things around how do you learn to optimize well in a consistent way.
I've got a hundred JVMs, each one of them is very hopeful when it started. And it thought this is a good optimization, then it found out it's not, then it run slow code. Then it decided to try again and it did a better job. And being able to learn that experience and share profiles of what not to do is useful.
We have cool things that do that. But probably more important than even the consistency of speed and compilation is the consistency of speed at the JVM execution level. This really is where garbage collection is king, or garbage collection is strictly the king of disruption. Before concurrent-
... Yeah. Garbage collection is the reason you use Stop the World, right?
... Yeah. Before concurrent collectors like ours, and some newer ones that are getting better, Java has been known for Stop the World Garbage Collection. And what Stop the World really means is Java is really, really fast between terrible, terrible slowdowns. It's either going fast or it's doing nothing. And it's going fast and then it's doing nothing.
But it has these annoying periods of doing nothing. And how long those periods are and how frequent they are is important. Because if your JVM can do 10,000 messages a second, but every 20 second it frees us for half a second, there's going to be a half second of doing nothing on that JVM that everybody is going to feel.
And it's really nice that you're fast the rest of the time, but that half second is terrible. And if you're doing batch streaming, just background processing, that half second doesn't matter. But if you're doing anything that has an SLA, a response time, you need to put an ad on a screen, or respond to somebody's message, or just put up content. And you're half a second late, that could affect a lot of things. That could be cascade.
The ability to run without stalls and freezes, without hiccups is useful at a message level, because all the applications that share that messaging system then will see a smoother message delivery mechanism, will see much better percentiles on their latencies. And yeah, they could have their own issues with why they stole a pause, but it's not going to be the message system.
This is one of the things that we find makes people running Kafka choose to use RJVMs, not just for their raw speed. The raw speed is great, reduces cost and all that. But for the service levels, for the 90th percentile, the 99 percentile, the three nines to be much better so that the services running across them experience those better nines.
And that often turns into either just better service levels, but more importantly probably a much more dramatic reduction in cost. Because there is another way to get very good service levels, and that's to dramatically overprovision your hardware. You keep it running at 2% utilization so those glitches therefore overlap across machines and stuff.
But if you can run consistently, you can push each machine harder before it starts glitching and making things unhappy, and that, again, translates into cost. Whether it's just raw throughput that you can run a lot more messages per second on a machine, therefore how many machines you need is dropped, or how many messages per second can you run a machine without the people running services screaming at you saying, "This sucks. I'm not using your message system."
That's the level that manages your capacity. Usually you find out what the level is that people scream at you at, then you add some patty. That level is one that could be smaller in the practice, in the real world when your JVM is good. And we're in business making good JVMs that make those things run smoother and need less resources as a result.
Nice. I'm glad you are. And I feel like I've had the most interesting JVM internals lecture of my life.
Wow. Well, thank you. And this was a nice, long conversation. Again, thanks for having me.
Thank you very much coming on and teaching us lots of things. We're going to go away and compile lots of show notes for this one.
Yeah, thanks. Looking forward to seeing the outcome.
Gil Tene, thank you very much for joining us. It's been a pleasure.
Thank you very much for joining us, Gil. That is some hard, deep down in the weeds work, this really nitty-gritty programming. That kind of stuff, it's not easy to do, and it's even harder to explain it well and explain it clearly. Gil, thank you very much for bringing us back up to speed.
Before we go, Streaming Audio is brought to you by Confluent Developer, which is our tech site that teaches you everything you need to know about Apache Kafka and real-time systems in general. We've got tutorials, we've got architectural guides, we've even got the back catalog for this podcast.
Take a look at Developer.confluence.io. In the meantime, if you want to get your own Kafka cluster up and running, and leave the JVM management to someone else, take a look at our cloud service at Confluence.cloud. You can sign up in minutes and you can have Kafka running reliably in no time at all.
And if you add the code PODCAST100 to your account, you'll get some extra free credit to run with. And with that, it just remains for me to thank Gil Tene for joining us, and thank you for listening. I've been your host, Kris Jenkins, and I will catch you next time.
Java Virtual Machines (JVMs) impact Apache Kafka® performance in production. How can you optimize your event-streaming architectures so they process more Kafka messages using the same number of JVMs? Gil Tene (CTO and Co-Founder, Azul) delves into JVM internals and how developers and architects can use Java and optimized JVMs to make real-time data pipelines more performant and more cost effective, with use cases.
Gil has deep roots in Java optimization, having started out building large data centers for parallel processing, where the goal was to get a finite set of hardware to run the largest possible number of JVMs. As the industry evolved, Gil switched his primary focus to software, and throughout the years, has gained particular expertise in garbage collection (the C4 collector) and JIT compilation. The OpenJDK distribution Gil's company Azul releases, Zulu, is widely used throughout the Java world, although Azul's Prime build version can run Kafka up to forty-percent faster than the open version—on identical hardware.
Gil relates that improvements in JVMs aren't yielded with a single stroke or in one day, but are rather the result of many smaller incremental optimizations over time, i.e. "half-percent" improvements that accumulate. Improving a JVM starts with a good engineering team, one that has thought significantly about how to make JVMs better. The team must continuously monitor metrics, and Gil mentions that his team tests optimizations against 400-500 different workloads (one of his favorite things to get into the lab is a new customer's workload). The quality of a JVM can be measured on response times, the consistency of these response times including outliers, as well as the level and number of machines that are needed to run it. A balance between performance and cost efficiency is usually a sweet spot for customers.
Throughout the podcast, Gil goes into depth on optimization in theory and practice, as well as Azul's use of JIT compilers, as they play a key role in improving JVMs. There are always tradeoffs when using them: You want a JIT compiler to strike a balance between the work expended optimizing and the benefits that come from that work. Gil also mentions a new innovation Azul has been working on that moves JIT compilation to the cloud, where it can be applied to numerous JVMs simultaneously.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us