October 7, 2020 | Episode 122

Creating Your Own Kafka Improvement Proposal (KIP) as a Confluent Intern ft. Leah Thomas

  • Transcript
  • Notes

Tim Berglund (00:00):

Leah Thomas was a recruiting intern at Confluent who decided she liked computer science better, changed majors, got a second internship, and delivered KIP-450 this past summer. That's a Streams KIP adding some new windowing functionality. Yeah, that's pretty cool.

Tim Berglund (00:14):

I talked to her today about the interning process, learning a big new code base, and of course, sliding windows and streams. We've had some audio problems on a couple of episodes recently, and this is one of them. So please bear with us on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.

Tim Berglund (00:39):

Hello, and welcome to another episode of Streaming Audio. I am as always your host, Tim Berglund, and I'm joined in the virtual studio by my once and future colleague software engineer Leah Thomas. Leah, welcome to the show.

Leah Thomas (00:52):

Thanks for having me, Tim.

Tim Berglund (00:54):

You bet. Now, I say once and future, because you have been an intern here at Confluent twice, and you're going to be a full-time employee soon, is that right?

Leah Thomas (01:06):

Yeah, that's right. I feel like I've gone around the block in all the different roles, so I'm excited to start the full time.

Tim Berglund (01:13):

Yeah, very happy about that. So in fact, by the time this airs it will be a fait accompli, so.

Leah Thomas (01:21):

Yes.

Tim Berglund (01:21):

That is fantastic news, I'm glad to hear it. Anyway, I wanted to talk to you today about the interning process, and you've got firsthand experience in two very different roles, and so that's what... I'd like to make this about what interning is like, and in particular, I think you have something very unique to offer to people who want to get involved in Kafka as contributors. Which is over the past few months, you've had a front-row seat to the process of getting to know the Kafka code base or at least a part of it, and so I kind of want to talk about what that's been like, how you've done it and maybe give some tips on how people can do that.

Tim Berglund (02:06):

So, there's a lot of people who are just users and don't want to make modifications, but for people who do, you're a newly minted expert in how to do that.

Leah Thomas (02:16):

Yeah, I definitely feel like jumping straight into the edition process was really cool, and getting to see how you can change the offer to be what you want to be, and really make improvements from either inside of Confluent or outside of Confluent. Whatever ideas you have, you can work with other people in the community and really make it happen, which was really cool to see.

Tim Berglund (02:39):

Absolutely. Yeah, I guess you did it as a Confluent intern, but people can do this, when it comes to Apache Kafka, you don't even have to know what Confluent is.

Leah Thomas (02:46):

Exactly.

Tim Berglund (02:48):

[crosstalk 00:02:48] contribute, so, of course, you're listening to this podcast, you probably sort of know. Anyway, you are a recent graduate, tell us about your degree program, your major and after that, we'll get into [crosstalk 00:03:00]...

Leah Thomas (03:00):

Sure. I just graduated in May 2020, from Georgetown University with a dual degree in Computer Science and Women's and Gender Studies. And you kind of previewed this before, but I interned last summer, summer of 2019, with Confluent as well, but I was actually the recruiting intern. So I had a computer science minor at that point and gender studies major and was thinking that I wanted to go into tech, but not necessarily a super technical role, so I did the recruiting internship and really liked it and liked the recruiting team.

Leah Thomas (03:34):

But I was friends with a bunch of the engineering interns and I saw what they were working on, and I was like, I think that that's what I want to do instead. I think I want to be solving technical problems and having technical solutions.

Leah Thomas (03:47):

I love working with people, but that working with people in the recruiting aspect was a lot harder than I thought it would be, and it wasn't quite what I wanted my future career to be. So I came back this summer as a software engineering intern, switched to the Streams team. So my last year of school I added a computer science major. So I switched my minor to a major and have a dual degree now.

Leah Thomas (04:14):

And then did the Streams internship and now I'm coming back full time starting next week on Kafka Stream. So I'm really excited.

Tim Berglund (04:21):

That's awesome. I suppose I should say we are recording this on Thursday, September 24th. So the following Monday, I feel like [inaudible 00:04:30] 8th, Monday the 28th is a happy day. That's just fantastic news.

Leah Thomas (04:38):

Yes, thank you.

Tim Berglund (04:40):

That transition from recruiting to the Streams team is really interesting. And I guess you said that recruiting was in one sense harder than you expected and also didn't really feel like the thing that, you didn't sense a calling to do it.

Leah Thomas (05:00):

Yeah. I always thought that I was a huge people person, but then I found when I was having to talk to people about jobs all day long, I didn't want to talk to people anymore. So, I guess I'm not quite the people person that I thought I was, but I think it was cool to see the recruiting side for sure. Learned a lot of valuable things about that end of the process that I think has helped me and will continue to help me. And I also made really great connections at Confluent, I feel I was... because I did technical recruiting last summer, I was able to meet engineers and already know engineers to some degree when I started my software engineering internship this summer. So, definitely a less traditional path, but I'm pretty happy with how it turned out.

Tim Berglund (05:46):

Yeah. There's so much that you said in there that I think is really important, and I do want to talk about Kafka code soon, but just indulge me. Number one, yeah, recruiting is really hard, and the whole, I thought I was a people person turns out not. I thought, early in my career that I wasn't a people person, and that I wanted to be a developer because computers were easy to deal with and people were hard. I think I was right about that, I think people are harder than any sort of computing device, but I kind of found out later that I was more of a people person and eventually my career took the trajectory it does where it's this combination of the two.

Tim Berglund (06:31):

But the process you went through, I just recently read a book called Range, subtitled, Why Generalists Triumph in a Specialized World, by David Epstein. A really good book and the basic idea was early in life, and that's defined in different ways depending on what you're talking about, but kind of high school, college, early career-type years. He's making the argument that it's good to go through what he calls a sampling period of trying some different things, and so this process you just described is this 100% Epstein compatible, I think I want to do this, let me try this; because you don't know, right?

Leah Thomas (07:18):

Yeah, you've no idea.

Tim Berglund (07:20):

Yeah. And so you try it and you're like, well, no, that wasn't actually quite it, I can do it, but it doesn't feel good, let me try this other thing, because I know, and everybody listening heard, "Well, I did a recruiting internship and then I joined the Streams team." Right. Well, okay, those are different. But that's exactly the point, is that, since you don't... it's really difficult to create knowledge about, well, frankly, who you are and what you're good at, and where you can make the greatest contribution. And these processes, I didn't think I was a people person, but I was or thought I wasn't, but I, whatever. You just have to try stuff, and I think it's really cool that you did that and even changed your major. Well, added a major, to be clear.

Leah Thomas (08:09):

Yeah. I definitely think that I was lucky that I had the opportunity to try something out, and then make that switch. And I was able to... I think a lot of people do this with boot camps now if they're switching into coding, which I think is a great opportunity for people, and I think that it's cool that people can pivot at different parts of their career. But I think I was lucky that I had space in my schedule to change a minor to a major and make it work while I was still in school, and really leverage my degree to give me that, as you said, that path, I really like engineering now. But like you said, career trajectories change,

Leah Thomas (08:46):

And so I think that the background that I have will set me up for whatever I'd want to do 20 years down the road.

Tim Berglund (08:54):

Very true, very true. Also just want to know, Georgetown grad endorses bootcamps, I liked that part, that just happened.

Leah Thomas (09:02):

Yes. For whatever credibility I have to do that, for sure.

Tim Berglund (09:07):

That's funny, I was, just yesterday at a COVID era wedding, a very small wedding and everything is done, I was running live streaming is why I was there.

Leah Thomas (09:18):

Okay. Nice.

Tim Berglund (09:20):

And there was a conversation among a bunch of young people. Is college a good idea? Is college a bad idea? And a recent college grad, who actually happens to be my daughter was making the, "No, you don't have to go to college," argument. Wait for a second, you just went to college. But the bootcamp approach is super cool, a whole different subject, but I'm glad you mentioned that, because that is usually a nontraditional, often second career or no college right out of high school kind of pathway, and that's great that those things exist and have credibility now. I couldn't be happier.

Leah Thomas (10:01):

Yeah. I definitely agree.

Tim Berglund (10:02):

Now, in your internship this summer... Oh, by the way, last summer, did you help people get hired? I feel like we should know that.

Leah Thomas (10:10):

I mean, I think I did, but I didn't put a candidate all the way through the process, which I was also like, well, that's kind of sad. But I did a rotational program last summer, so I did sourcing for a little bit, and then I did, technical sourcing. And then I did university recruiting, and then I did the scheduling. I was a, what... why am I blanking on the term? Coordinator.

Tim Berglund (10:34):

Coordinator.

Leah Thomas (10:35):

Yeah. So I think that, well, I didn't have my own candidate go all the way through, I think I helped with the candidate experience. That's what I'll say.

Tim Berglund (10:47):

That is so important because even if people don't get hired or aren't interested or whatever, job hunting is a big deal. You want it to be as unstressful as possible.

Leah Thomas (10:58):

Yeah, absolutely.

Tim Berglund (10:58):

Then coordinator, huh boy, [crosstalk 00:11:00] so guilty every time I work with the coordinator. I'm like, yeah, I have to change that again. I'm sorry.

Leah Thomas (11:06):

Yeah. That's how it does.

Tim Berglund (11:08):

But you know this pain firsthand.

Leah Thomas (11:10):

I do.

Tim Berglund (11:10):

But Let's talk about your work on Streams. You did at KIP. I mean that's huge. First of all, in case there's anybody who doesn't know, tell them what a KIP is and then tell them what your KIP was.

Leah Thomas (11:23):

Sure. So a KIP is a Kafka Improvement Proposal, is the process where somebody can make changes to the Kafka open source code and it mostly focuses on the public API changes. So anything that changes the way users interact with Kafka and in our case Kafka Streams, has to go through the KIP process, to make sure people in the Kafka community accept it and that the community as a whole agrees that this is the right decision to make.

Leah Thomas (11:53):

So people aren't just deprecating random methods and making all these additions without people knowing what's going on.

Tim Berglund (12:00):

Good, it's got to be debated openly, there's a Wiki page that kicks off the KIP and sort of a template, and then nailing that discussion. So, in the open.

Leah Thomas (12:10):

Yeah. And then once a million discussion dies down, and you open a vote and it has to be open for a minimum, I think, 72 hours. And you have to have at least three Kafka contributors give it approval. And then there's a different amount for if people say they don't want you to do it, but I've heard that that doesn't happen too often. That any-

Tim Berglund (12:31):

That's just for creating a KIP, that we need that much?

Leah Thomas (12:36):

Yeah. Well, no, that's for getting your KIP approved, that you need three Kafka committers to say that they are on board.

Tim Berglund (12:46):

Got it.

Leah Thomas (12:46):

Yeah, so my KIP was KIP 450 which was implementing sliding windows in Kafka Streams. So siding windows are a different form of windows aggregations for the data that comes through Kafka streams. Right now we have time-based windows and session windows. And I just implemented sliding windows, which is almost a conglomeration of the two of them, the two existing windows.

Leah Thomas (13:10):

And so I worked on obviously the KIP and they KIP discussion, which was pretty lengthy for this one because it is a new feature that ended up being a little more complicated than I initially thought it would be. Because of storing all the data, we need to store and then outputting to the user different data as well as the data gets processed through the Kafka Stream.

Leah Thomas (13:36):

So the KIP discussion lasted a while, but we hammered out a lot of the implementation details during that, which was super helpful for me. And I'm really grateful to my team for being really involved in the KIP discussion as well. And then after that it was implementing it and then doing the whole testing suite. So adding all of the unit and integration tests, and then also I ended up adding it to the benchmark and soak tests for Confluent internally.

Tim Berglund (14:05):

Oh, nice. I didn't know that last part. That's super cool. That refers to, just to separate the two, this KIP is an Apache Kafka KIP all part of the open source code and all done through the open KIP process, and the feature is an AK feature, but also we do benchmarking and our own soak tests internally since we use Kafka in our on prem product, and in cloud and stuff. So we beat on it in ways that are just ours.

Leah Thomas (14:38):

Yes, exactly.

Tim Berglund (14:40):

So let's, if you don't mind, I want to talk about windowing a little bit. This was, you said sliding

Leah Thomas (14:47):

Yes.

Tim Berglund (14:48):

What was the KIP number?

Leah Thomas (14:50):

  1. And it should be coming out in 2.7.

Tim Berglund (14:55):

Okay, good, good. And yeah, I am aware of rumblings of discussion of timing of 2.7, how's that for being noncommittal? I didn't say anything, just there. But we're thinking about when to release that. And I don't know, that's a PMC decision when that gets made, by we, I mean, them. I guess I'm always looped in on that stuff because there's usually a video of me standing in front of a river or in some city somewhere in the world talking about it. So I'll have to know.

Tim Berglund (15:28):

But good. And I can pretty much guarantee you KIP 450 will be in that video, because this is a major change to Streams.

Leah Thomas (15:36):

Wow. Well, that's exciting for me. It's kind of a fun plug.

Tim Berglund (15:40):

There you go. I can't see how it would not be there. And if, whoever organizes that list tries to not put it there, I will tell them to put it there.

Leah Thomas (15:50):

Thanks.

Tim Berglund (15:52):

Tell us about windowing, and I want to say hopping windows and sliding windows are the two closest, but I might be wrong, and tell me if I am?

Leah Thomas (16:01):

No, you're right.

Tim Berglund (16:01):

Okay, good. Tell us what hopping windows are.

Leah Thomas (16:06):

So hopping windows are a fixed time window. So over your Kafka Stream, you can set a certain amount of time, say 10 minutes, and your data will be processed within that 10 minutes. And with a hopping window, the size is always fixed, and then you set some advance. So you can have a 10-minute window that advances by two minutes. So say you have a window from 9:00 A.M. to 9:10, then the next window would be at 9:02 to 9:12, et cetera. So, it kind of hops two minutes along, based on the advance time.

Leah Thomas (16:42):

And these windows are basically preset, so you can figure out where the windows are going to fall based on the size of the window and the advance of the window. So no matter where the data comes in, your windows will always be at set timestamps, because the way the windows are calculated of where they fall is based on the advance in the size.

Leah Thomas (17:05):

So you can approximate a sliding window by using hopping windows with an advance of one millisecond. So that's like you're creating a window essentially for every possible millisecond in time, which creates a ton of windows. And [crosstalk 00:17:23]-

Tim Berglund (17:23):

Sounds like the definition does not scale?

Leah Thomas (17:24):

Yes. And so being really inefficient and just not... hopping windows just aren't very elastic. And so sliding windows is a way to essentially get every combination of the data in your stream that fits within a certain window size by using a siding window. So siding windows aren't fixed in time. So like I said, with hopping windows, you can predict where they're going to fall on the time axis, you can't do that with a sliding window, a sliding window exists along the time axis, based on where the data comes in.

Leah Thomas (18:05):

So if you have data coming in at 9:05 and 9:30, the windows you're going to get are different if you had data coming in at 9:02 and 9:40.

Tim Berglund (18:17):

So the width of the window is not fixed?

Leah Thomas (18:20):

Well, the width of the window is fixed, the start and the end point of the window are not fixed. Those are dependent on the data.

Tim Berglund (18:27):

Got it, got it. And that matters if you have data that's bursty, because that example you just gave was bursty where there was something at 9:03 and then 9:30, it's quiet for a while.

Leah Thomas (18:41):

Yeah.

Tim Berglund (18:42):

So, that's an important factor, am I right? That's a question. Is that an important factor?

Leah Thomas (18:48):

Yeah, I think it is, because if you think about the hopping window, approximation of siding windows in that example where you have the bursty data, you're creating a bunch of windows in between. Whereas with sliding windows, you only create windows when you need them, and so you're only creating the windows that specifically are necessary for those two data points coming in.

Tim Berglund (19:11):

Got it, got it.

Leah Thomas (19:12):

So the sliding windows are defined not necessarily by size, but they're defined by the time difference, which is the maximum amount of time between two data records for them to fall within the same window, which we use that phrasing because there was some discussion about inclusivity and exclusivity for the window bounds. And so to try to get rid of any confusion we use time difference instead of size since it isn't strictly size, it's more about how close the records are.

Tim Berglund (19:45):

Right. Okay.

Leah Thomas (19:47):

It's a little bit in the weeds, but every sliding window will essentially be the same size, you will just have fewer windows if you have fewer data, and you will have more windows if you have more data. So it is a lot more efficient, it ends up being more efficient, and you only store the windows that you need and only create the windows that you need.

Leah Thomas (20:08):

So if you're looking for that kind of aggregations, sliding windows is a lot better way to go than trying to make it work with current hopping windows implementation.

Tim Berglund (20:19):

And what's the cost of those extra windows? Why not have them laying around? What data structures are being allocated when you would have created all of them with hopping windows?

Leah Thomas (20:33):

Yeah. So if you create a bunch of overlapping windows, you have to store them in the windows store. And then with hopping windows, if they have data inside of them, they're going to be output to the user. So in theory, if you have a bunch of windows that are overlapping, it means that they're probably, there's going to be some set of them that have the same data inside. So it's essentially a repeat window with different start and end times, but everything inside is the same, and the user is... So those are all being stored in the windows store, and then also all being admitted as partial results to the user.

Leah Thomas (21:09):

So I think it just ends up clogging up, ends up using more space in the windows store, and then also ends up clogging up what the user really wants because you're getting a bunch of information that you probably don't really need.

Tim Berglund (21:22):

Right. There's, whatever topic is holding the output of that window, I guess it's possible in the topology that extra things are happening to that before it actually gets persistent. But you're still at some point admitting things that might be a lot of noise and not really interesting because you had to set this pathological short advance time.

Leah Thomas (21:45):

Yeah, exactly. And sliding windows are because they're essentially defined backward in time, you're able to, if you as a user want to query the windows store and you want to find everything that happened, you know that something came through at 9:10 and you want to know what happened, you can find the window that ends at 9:10 and see everything that happened in that window. So you can query essentially back in time and see what has been going on.

Tim Berglund (22:13):

Got it. And you mentioned the windows store. Tell us about that a little bit, because that's what came to mind when you said hopping window with one millisecond advance time, I was like, Oh crap, that is going to be a problem. What's the windows store and what's it there for?

Leah Thomas (22:30):

Yeah. So there are different types of windows stores, and the one that we, I think use, maybe don't quote me on this, but we use the RocksDB store pretty frequently. And it's just a way to essentially, I mean, to store your data obviously. And then you able to query it as a user, but also get access to it within the Kafka Streams program. So while Sliding Windows are processing, we query the windows store ourselves, in the sliding windows algorithm, and see what has already been created and stored in there, and then we also add to it.

Leah Thomas (23:08):

And then the user can also from outside of the Streams application, while you're inside of the Streams application but outside of the Sliding Windows algorithm, they can also create a windows store and get access to everything that's been created and stored in there in it.

Tim Berglund (23:26):

So there's an API to get those old windows that are there lying around?

Leah Thomas (23:29):

Yeah. There's an API to access the old windows, it's essentially a bunch of different fetch methods. And I think they're all, for the most part, serializes bites, but depending on the type of windows store, you store different things. So the windows store stores a windowed object. And then also a little bit of other information, and then other types of stores obviously they don't need to store window objects, so they store just more generic items as well.

Tim Berglund (23:58):

Right. And I guess those are there in case messages... I've recently been trained by Matthias on these, look these words, Matthias Sax is a Streams Committer. "I was going to say messages might come in late, but that's the wrong word, Tim. Messages might come in out of order."

Leah Thomas (24:17):

Yes.

Tim Berglund (24:17):

So if the message timestamp indicates that the actual event time, but the message in a previous window and it's just out of order in the topic, which is totally possible, right? You can you have that. Then you need to go basically fetch that old window and say, Oh wait, we need to update the aggregations in it.

Leah Thomas (24:37):

Yes, exactly. Those out of order records definitely make calculating the windows more complicated. But by having the Windows Store we're able to see everything that was created previous. We just need to know which one to pick, and then we're able to update the windows with, yeah, like you said, the out of order data.

Tim Berglund (24:56):

This is one of the reasons why the vast majority of people want an API that does this, and they want people Leah to solve these problems because that's tough and the edge cases in there get unpleasant quickly. And you want that solved one time and not lots and lots of times in one API.

Leah Thomas (25:18):

Exactly.

Tim Berglund (25:21):

Cool. Okay. So, that is KIP 450. Thanks for that explanation, this is one of those things, you made it make sense, which is not easy because windows are one of those things that they like diagrams.

Leah Thomas (25:39):

They do, they really need diagrams.

Tim Berglund (25:41):

They really, and I think when it comes to Sliding Windows, I'm feeling like that's going to be an animation, whenever there is a video that gets made about those, I don't think that's going to be a still picture, you kind of need a thing that moves to really make that land. But I actually I'm happy to confess, I haven't kept up with the KIP, and I knew that it was happening and we talked about it a little bit before, but I don't have a super crisp understanding of it, and you helped me clarify that there, so thank you.

Leah Thomas (26:10):

Good. Well, I'm glad it was at least semi helpful. But you're right, the visualizations definitely helped with the windowing.

Tim Berglund (26:16):

Yes. How about just the process of jumping into the code base? I mean, obviously, you're familiar with at least this part of the Streams codebase now. And there's a lot more, so whatever the scope is, what was that like to sit down with it and how did you start learning it? Tell us about that.

Leah Thomas (26:39):

Yeah. I mean, it was definitely overwhelming, because the code base is huge. And at first I was like, I don't even know how to find, there's so many files in here, how do I even find the class that I want? Where are they? And so getting some basic IntelliJ tricks was really helpful and understanding the search function and how to connect all the classes together with the IntelliJ shortcuts was really helpful, and I definitely needed that right off the bat to be able to explore how everything was connected, and how the existing types of windowing worked.

Leah Thomas (27:16):

So definitely spending some time looking through their code and seeing what all the different parts were, and trying to do a deep dive was where I started, which I thought was pretty helpful. But I feel the later stage of actually creating the testing was where I was able to understand more of how Kafka interacts with Streams, this when I was doing the siding windows, it was pretty just Streams focused, obviously there was some Kafka components mixed in, but I felt like I was pretty separated from the core ideas of Kafka.

Leah Thomas (27:53):

So after I was creating the testing, I use the live debugging feature on IntelliJ, I'm not sponsored by IntelliJ, but I do really like it. And really jump through the code and see what was being created when and how it was being created and who was creating it. And see how Kafka Streams worked with core Kafka, live on the ground as it was all happening, and seeing that come together, I feel was the, I guess the cherry on top for starting to understand how Kafka works as a codebase.

Tim Berglund (28:30):

Right. And if I may ask, how would you rate your familiarity with Core Kafka? And I know not code based, because you were new to the whole code base when you came into the internship, but-

Leah Thomas (28:42):

Yeah.

Tim Berglund (28:43):

What did you know about Kafka when you started this? when you're jumping in cold to this big giant thing?

Leah Thomas (28:47):

Oh man. I mean from the recruiting pitch, I knew that it was data streaming.

Tim Berglund (28:54):

Nice.

Leah Thomas (28:54):

And so I would rate may be a three, I tried to read the book before I started a little bit, so I had some idea of producers and consumers and topics, but in terms of how the code worked, yeah, not very high understanding, pretty low.

Tim Berglund (29:15):

That's awesome though. If you think about it, and you just did it, so maybe you don't see it this way, but looking at what you did, you're jumping really into the streams code base and that's a part of Kafka, it's this client library, it's outside of the core thing. But if it's, you know there are producers and consumers and messages and that's kind of your abstraction that you have for Kafka. That's a pretty high-level abstraction, we can agree. There's still just a lot more to learn there. That's just impressive that you jumped into Streams and made that contribution, and learned Kafka internals at the same time.

Leah Thomas (29:53):

Yeah. Yeah. I feel like I definitely have a lot to learn still, I think there's just, I think that familiarity, spending time with it just really increases knowledge obviously. So I think that the next couple of months as I continue to work on it, I think that my knowledge will probably grow exponentially, which I'm pretty excited about to understand more fully how both sides of Kafka work.

Tim Berglund (30:20):

Absolutely. Now, so you mentioned IntelliJ, which I would say obviously the Java codebase, you're going to have some interaction with it. And it's kind of hard to tell, there's a Twitter conversation last week about this. There is I think a little bit of Java work being done in visual studio code, it's like, you can, but pretty much everybody does it in IntelliJ. And so you did that and you used that, some people showed you ways of exploring a code base and finding maybe implementations of an interface or where to locate a class and just basic stuff like that to navigate.

Leah Thomas (30:57):

Yeah. Yeah, which was super helpful, obviously.

Tim Berglund (31:03):

Yeah. Those are good things. And just as I think guidance for new developers, don't underestimate both the cost and the value of learning an IDE well, it's a big investment. There's a lot of stuff to know. And when you're new and everything is new and confusing and you feel like you don't know anything, and there's all that. You may not be aware of the IDE learning that's happening, but it's huge. And if you ever pick up a new IDE in your career, say you get started with IntelliJ, then the next one you learn, you're like, Oh crap, all the mental model is different, every person shortcut is different, everything. Why is this so complicated

Leah Thomas (31:45):

Yeah. Exactly.

Tim Berglund (31:48):

Because you just slowly learned the first one, but it's a huge investment and it's usually useful.

Leah Thomas (31:53):

Yeah, definitely.

Tim Berglund (31:53):

But with that, as you're learning the code, what combination of it... what part of that experience was, we'll call static analysis, just reading the code. And what part was at runtime? Did you ever run a test, and set a breakpoint, step through things, what'd you do?

Leah Thomas (32:13):

I spent a lot of time running tests and setting breakpoints and stepping through things. The algorithm itself, we hashed out a lot in the KIP process, and so when that was implemented, I didn't find too many bugs when I was running the tests. But there were some things that I just felt like I didn't understand and some results I was getting that was weird. And so I did step through a lot, and just so many breakpoints and so much data. And I think there were times that I didn't utilize the IDE where I could have created a smaller data set and made it a lot less work when I was stepping through. But I ended up just kind of taking the long route, which I don't think I would do in the future. But it felt easier at the time to not have to change anything.

Leah Thomas (33:04):

But definitely a lot of stepping through, because I found that in the static code, there's a fair number of abstractions, right? And so things are connected somewhat differently at runtime because there are actual values coming in, and things have been created. And so stepping through at runtime was really helpful, and it took me to parts of the codebase that I hadn't been to when I was just doing the static coding. So the further dive into how streams were communicating with Kafka and all the things that were happening in the middle, and any of the de-serializer issues that I didn't know about when I was doing the static coding all came out at runtime.

Leah Thomas (33:49):

Which was like actually really cool to see, and I love the IntelliJ runtime debugger, it's pretty helpful. And I felt like it was, yeah, really cool to see the windows being processed step by step by step and see them working the way that they worked in my head. And it actually all coming together, which was very validating and very gratifying as well.

Tim Berglund (34:13):

Yes. And it's kind of funny because even with Java as a, not dynamic language, it's like you have, at least not typically in Java code where you've got new kinds of classes being created and missing method, handlers trapping surprise method calls from Streams that are taken as input from the outside world. Yeah, and that stepping through is super important. Even with Java not being a particularly dynamic language, right? It's weird in a Java project for somebody to use some code manipulation library to create new methods that [inaudible 00:34:56] weird, I guess Spring does it all over the place.

Tim Berglund (34:57):

But it's not everyday code in Java the way it would be, I'm thinking of Groovy, where you have explicit mechanisms where you can call a method on an object, and if the method doesn't exist there's a handler that you can go make up a method, stuff like that. Good luck with static analysis, that's hard to understand what the code is doing without stepping through it.

Tim Berglund (35:23):

Even so in Java though, in a non-trivial project like Kafka, there's an abstract class and you think you know which one is actually instantiated in runtime, and you're wrong until you step in, you just don't know, you got to look. So that-

Leah Thomas (35:39):

Yeah. It's hard to look for what you don't know exists too. I can't search for something if I don't know what I'm searching for, so that runtime was really helpful for me being less familiar with what will happen at runtime.

Tim Berglund (35:53):

Exactly, and even when you start to get there, you're like, okay, here's the package with the implementations of that interface, and so it's one of these, and you can't still quit and then you step through, and like, Oh no, okay, there was another package, with other implementers, okay. That's why you just kind of have to look at what it's doing.

Leah Thomas (36:09):

Yeah, exactly.

Tim Berglund (36:09):

One thing occurred to me, you said you did it the hard way, you took a long way around and you're not going to do that in the future. And I thought this would be a time for the narrator to say, "She would in fact take a long way around in the future."

Leah Thomas (36:25):

It's probably still true.

Tim Berglund (36:27):

It's true, it's true. I mean, you're going to be super good at Kafka and Kafka Streams, and you're going to know that codebase better than everybody, but a hundred people in the world. But the next non-trivial project you jumped into, you'll be wiser, but of course, I think the one thing that I wouldn't say just to you, but to anybody who is new as a developer or jumping into a significant project that you didn't write for the first time. I've talked to Gwen recently about joining Kafka now versus joining Kafka eight years ago.

Tim Berglund (37:09):

It's a little harder now. Leah's job versus eight years older version of Leah, this Leah has it a little worse, because it's just a much, much more complex project, it's a much more mature project.

Leah Thomas (37:27):

Yeah, there's a lot more there, yeah.

Tim Berglund (37:29):

But it feels the same way every time. At least as I remember from when I tried to read the code of large projects early in my career, versus later in my career. That same feeling of lostness and I don't know anything, and probably not good enough to do this job, all those kinds of things you have when you're just completely lost reading a big piece of new code. I don't think they ever stop, and maybe there are people, and tweet me if this is the case, if you're like, I never felt like that. I think that's psychologically interesting if you never did, but.

Leah Thomas (38:07):

I'm sure there's somebody out there who doesn't feel like that at all, or won't admit to it at least.

Tim Berglund (38:13):

[crosstalk 00:38:13], that's fine if you won't admit to it, I totally get that, you don't have to tweet that stuff or talk about it on the podcast if you don't want to, but yeah there's confidence and then there's, I don't know, clinical terms for when you don't doubt yourself and maybe you should a little bit, but yeah, that feeling is not one that goes away.

Leah Thomas (38:38):

Yeah, I definitely think, I think that that is I feel like something that's... For me, coming in from a little bit of a less traditional path. And I feel like, having my first software engineering internship after most of my peers have had multiple software engineering internships.

Leah Thomas (38:56):

People like talking about that feeling was really helpful for me and for people to be open about, yeah, the first few months I was super confused, there's so much going on, I didn't really know. It took me a while to ramp up and understand and being open and transparent about how long it takes to be comfortable in something, I feel like is really helpful in... it was helpful for me to gauge where I'm at and where I should be at to kind of check myself against the goals and expectations I have to not be setting them horribly high, but also holding myself to a high standard that there are parts of this codebase that I want to understand, but also I was there for 12 weeks, I can't understand everything.

Leah Thomas (39:40):

So, talking to people about the idea of jumping into something with the complete unknown, and then looking at something new, even after you've been working on a part of Kafka for so long and being like, yeah, I don't know what that is, was helpful in me feeling a lot more comfortable about where I'm at.

Tim Berglund (40:00):

Such a good point, because you have the... turning the minor into the major thing where you had this other career path, and you're like no, I don't want that, I want this, and you did it, you made the decision, you did it, which is all amazing. But that's important to call out again for people who are on the junior end of their career, or even not.

Tim Berglund (40:25):

Now, there's this story you can tell yourself about, Oh, this is probably harder for me because I wasn't doing this two years ago. If I had just started as a computer science major in my freshman year, this would all be easy, and it's not, there's nothing easy about jumping into code base like streams and it never feels different. It's always great, I mean, you get into it and you master it and you have all those feelings of success, learning how to work with it and actually build things in it but now the first time you look at code like that. It's almost as bad as days when you write a timezone code or just a general date and time code.

Leah Thomas (41:00):

Oh.

Tim Berglund (41:02):

Yeah. Those are still, just like you think that maybe this is the wrong career choice for me, and then [crosstalk 00:41:11] sort of some self-care after you do that. It's fine, again, it's like that for everybody, it's just terrible code. Okay, so we're almost at time, but I wanted to give you a chance to talk about Soak and performance testing and I don't know how much you got into it, but just from your perspective, what happens there? It sounds, it's a cool thing.

Leah Thomas (41:31):

Yeah.

Tim Berglund (41:31):

If you want to talk about it.

Leah Thomas (41:34):

Let's see, which was shorter? I think Soak was shorter. Because Soak you really just are creating kind of the application of a huge stream that's running for as long as we can make it run, and so for that, all I did was add a really basic Sliding Window aggregation to the topology, and then it's in there with everything else, and how when we run it to make sure that everything is working smoothly and run it for long periods of time to make sure it can last, Sliding Windows are also being tested.

Leah Thomas (42:08):

So I haven't heard, I don't know if it's broken with Sliding Windows yet, I don't think it has, but it's, yeah, a little flag system, if there's anything wrong in what I read with Sliding Windows, hopefully, we would find it in Soak, and then as we create more things with Streams if anything that we add breaks Sliding Windows in theory Soak would help us identify some of those issues.

Leah Thomas (42:34):

And then the benchmark test, I actually want to expand some, so maybe I'll work on that sometime later this year. But it's about making sure that Streams is working like the throughput is as high as we want it to be. And that the Streams obligation isn't slowed down by running Sliding Windows that something is not getting clogged up or the throughput is low, or we're losing threads or anything like that when we're doing this setting, we do aggregation. So, it's another benchmark test.

Leah Thomas (43:07):

The other ones you can do other types of windowing and benchmark tests, I think you can do suppression in benchmark tests, but what I'd really like to do is there's essentially two implementations of Sliding Windows, one that uses a forward iterator for the windows store and one that uses a reverse iterator through the windows store. And the reverse iterator is essentially the default implementation, and I think that it's more efficient but what I'd really like to do is be able to test it on the benchmark test and see what the most efficient way to run Sliding Windows is to potentially tweak the implementation.

Tim Berglund (43:48):

Awesome, yeah. It's I guess important to check and see which one actually runs faster, but it's nice that that infrastructure is in place and that the Soak tests sounds especially cool. That's like a giant topology that's got joins and groups and reduces and processor things and just whatever, and you just hit it with data for a long time.

Leah Thomas (44:09):

Exactly.

Tim Berglund (44:10):

Love it.

Leah Thomas (44:11):

Yeah, and you just see, you hope that everything works well for a long period of time.

Tim Berglund (44:17):

Yep. And if it doesn't, you fix it.

Leah Thomas (44:19):

Exactly.

Tim Berglund (44:21):

Hey Leah, I think this has been really helpful for not just people who are new as developers and new at Kafka, but people who want to use Streams and want to understand Windowing better. So, this has been great. My guest today has been Leah Thomas. Leah, thanks for being a part of Streaming Audio.

Leah Thomas (44:38):

Yeah. Well, I've enjoyed talking to you too, Tim. And thanks for having me.

Tim Berglund (44:43):

Hey, you know what you get for listening to the end, some free Confluent Cloud. Use the promo code 60PDCAST. That's 6-0-P-D-C-A-S-T to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021 and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit and there are limited number of codes available, so don't miss out.

Tim Berglund (45:12):

Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community Slack. There's a Slack sign-up link in the show notes if you'd like to join.

Tim Berglund (45:33):

And while you're at it, please subscribe to our YouTube channel and to this podcast, where ever fine podcasts are sold. And if you subscribed through Apple podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So, thanks for your support and we'll see you next time.

Ever wonder what it's like to intern at a place like Confluent? How about working with Kafka Streams and creating your own KIP? Well, that's exactly what we discuss on today's episode with Leah Thomas. Leah Thomas, who first interned as a recruiter for Confluent, quickly realized that she was enamored with the problem solving the engineering team was doing, especially with Kafka Streams. 

The next time she joined Confluent's intern program, she worked on the Streams team and helped bring KIP-450 to life. With KIP-450, Leah started learning Apache Kafka® from the inside out and how to better address the user experience. She discusses her experience with getting a KIP approved with the Apache Software Foundation and how she dove into solving the problem of hopping windows with sliding windows instead.

Continue Listening

Episode 123October 14, 2020 | 46 min

Building an Apache Kafka Center of Excellence Within Your Organization ft. Neil Buesing

Neil Buesing, an Apache Kafka community stalwart at Object Partners, discusses the concept of a CoE (center of excellence), and how a CoE is integral to attain and sustain world-class performance, business value, and success in a business.

Episode 124October 21, 2020 | 50 min

Joining Forces with Spring Boot, Apache Kafka, and Kotlin ft. Josh Long

Josh Long shares how the Spring and Kotlin teams have worked hard to make sure that Kotlin and Spring Boot are a first-class experience for all developers trying to get to production faster and safer. He also talks about the issues that arise when wrapping one set of APIs with another, as often arises in the Spring Framework: when APIs should leak, when they should not, and how not to try to be a better Kafka Streams when the original is working well enough.

Episode 125October 21, 2020 | 33 min

Ask Confluent #18: The Toughest Questions ft. Anna McDonald

It’s the first work-from-home episode of Ask Confluent, where Gwen Shapira (Core Kafka Engineering Leader, Confluent) virtually sits down with Apache Kafka® expert Anna McDonald (Staff Technical Account Manager, Confluent) to answer questions from Twitter.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free