Danica Fine is a developer advocate with Confluent. She's also got some real-world experience recently building data pipelines in the wild. So I just wanted to talk to her about that process, kind of lessons learned, what are the tools she used, what was easy, what was hard, how did it all unfold? And how does this relate to the data pipelines course we have on Confluent Developer.
Speaking of Confluent Developer, Streaming Audio is brought to you by Confluent Developer. You may already know that. That's developer.confluent.io, it's a website with everything you need to get started learning Kafka, Confluent Cloud, all kinds of resources. When you do any of the examples or labs or anything like that on Confluent Developer, you'll probably sign up in Confluent Cloud. When you do, use the code PODCAST100 to get an extra $100 of free usage credit. Now let's get to the conversation with Danica.
Hello, and welcome to another episode of Streaming Audio. I'm your host, Tim Berglund and I am super happy to be joined in the virtual studio today by Danica Fine. Danica is a developer advocate. She's the newest member of the developer advocate team here at Confluent. Danica, welcome to the show.
Hi, it's great to be here, Tim. Thank you.
Now we're going to talk about data pipelines today. Because data pipelines are a thing that you've done or a thing that you've talked about at Kafka Summit. And I want to get into that. I want to obviously plug the Data Pipelines course on Confluent Developer, I mean, everybody knows we're going to do that. Nobody would be surprised at the end when we say that you should watch the Confluent Developer Data Pipelines course. But before we do that, Danica, tell us about yourself. I mean, you are giving talks out in the world and doing things and being visible. But if there's somebody who doesn't know you, tell us about you, how'd you get to be here?
Yeah. So prior to this, I spent about three years working on a streaming data infrastructure team. And there we were the first group at our company to really implement a streaming pipeline. We were moving from these monolithic applications that were sort of micro-batch processing. And we wanted to see if we could actually make it real time. So obviously I had a lot of experience there, playing around with Kafka Streams, seeing how that would actually fit into a legacy architecture. And yeah, it was a great experience seeing what works, what doesn't in the context of a large company like that. So it was a lot of fun and would totally go back and do it again, knowing what I know now, especially.
Right. You get to build a second one.
Yeah.
I guess that's good now that your profession is to advocate on behalf of people who build things with these technologies and build, among other things, streaming pipelines. I guess it's good that you'd want to do it again.
Yeah.
Yeah. With even the same tools. That's a good sign.
Absolutely.
So, I want to back up a step. I think it's been a while since we've had a purely kind of pipelines episode on the show. There's a lot of pieces that we cover different aspects of frequently that you could assemble into good pipeline knowledge. If you listen to the whole catalog and remember everything on every episode, but that's tough to do. So why data pipelines? I mean, you can speak in generally, you don't have to speak of what you were doing at your previous employer and what the business motivations were there or anything. But just in general, remind us of the reasons for people that people want to do this.
Yeah, absolutely. So, prior to moving to an event driven data pipeline, we were leveraging a micro batching sort of architecture, and there are many benefits from moving away from that. Namely, you obviously want to react to data in real time to get the best and the most up to date results. And then also moving to something like Kafka, you get the higher resilience, increased scalability, that was a major benefit for our pipeline. And then of course decoupling the source and target systems. And that's just small number of the benefits of moving toward this and everything. To our stakeholders, that was a no brainer, we had to do it.
Yeah. How micro is micro in micro batch? What kind of latency difference did you see, if you remember? It's just very specific question.
It's been a while since I looked at the numbers.
Right, yeah.
So I mean, in the micro batching, we were on a millisecond latency. It was a really, really super fast application that we had going and there wasn't really a problem with the speed. We weren't concerned about that. We actually weren't going to get too much out of moving to event-driven architecture in that regard. But what we did get was the scalability and the resilience, right? So in our previous architecture and the micro batching architecture, we only had one application running, one instance of that really running, and then some hot backups. But here we were able to move over to a more distributed and scalable architecture.
Yeah, yeah. You get all the fault tolerance characteristics of Kafka cluster and Kafka consumers and all that kind of stuff.
Mm-hmm (affirmative).
So that makes sense. Cool. So walk us through the whole process. Like I said, I mean abstract away the details, because you're talking about something you did at some other company, but just in general, what are the steps that you, if you had this to do again, what are the steps you go through?
Yeah. So, and as I walk through this, for the viewers, you'll see that this actually aligns really, really well with the Data Pipelines course. So you should probably walkthrough-
Oh, hello Data Pipelines course on Confluent Developer.
Shameless plug.
This message was brought to you by Confluent Developer. It's developer.confluent.io, ladies and gentlemen. Okay, sorry. Go on.
Yeah. So yeah, for our company, the legacy architecture was not going anywhere. So we had to implement the data pipeline within that context, obviously. So the first step was to be able to integrate with the legacy architecture. So we wrote a couple applications that allowed our input data sources to be fed into Kafka. That was step one. Step two, we wrote a way to pipe that data from Kafka back into the legacy architecture as well. And then after that, yes?
I don't want to drill down into that, not so fast. But that's the connections on the pipe? So the pipe is not doing anything interesting yet, but it connects to the source and you can put data back into the legacy system.
Mm-hmm (affirmative).
And I've made the point a lot recently in some talks, it's mid-November 2021 when we're recording this, and I gave a bunch of talks last month where I was making the point that streaming event driven technologies are a new thing and you don't burn down the old thing and build a new event driven thing.
Yeah.
You supplement, supplant. So that's exactly what you're saying, which is awesome. And the getting data out of the legacy system, were you able to modify the legacy system to produce to Kafka? So you have a direct connection of the events that you own in that legacy system or was it change it to capture or some other thing? What's the connector on that pipe?
Yeah. So we weren't actually modifying the legacy system at all. You can kind of view this as almost like a-
I was dreaming that you were, it just sounded awesome. Oh no, it's fine. We'll throw some Kafka in there and produce events. We'll do what we want.
Yeah.
And then, but yeah, it's not. Yeah. Okay.
Well, so the reason we weren't able to is that these are, this legacy system has been around for decades and it wasn't going anywhere. And the-
It probably compares favorably in age to me.
So instead of altering the legacy system, I mean, instead of, in effect, we wrote a sort of Kafka connector. So there were mechanisms in place already, provided by the legacy architecture, to make a data connection and pull that data whenever you wanted. So we effectively wrote our own connector to that and allowed that to just stream the data straight into Kafka for use later on.
Nice, okay. It's not an actual connector, proper, like using connect.
Mm-hmm (affirmative).
Cool.
Not using connect.
That's not a super mind bending API, is it?
No.
Couple of methods you're sort of good.
Mm-hmm (affirmative).
All right. So, sorry, interrupting. Which is how Streaming Audio works. I ask questions. So you got that connection on that pipe and basically same thing on the other end, here's a hook somewhere in the legacy system that allows you to put data back into it.
Yes.
So that's nice actually. So being able to modify it is... You know the meme with the growing brains, you have the really tiny brain and the meme at the end where it's like this godlike figure. That's that. You don't normally get that, but it's a mature enough legacy system that it wanted to cooperate.
Mm-hmm (affirmative).
Yeah. All right. So, integration step one. And I stopped you right before step two. So tell me about step two.
Great. Step two is actually building the meat of the pipeline, the algorithm itself. So in our case, as we were proving that we can integrate our architecture with Kafka and really embrace adventure in architecture, we chose the most difficult application to move over. And it was on purpose, in hindsight, obviously on purpose to choose the most difficult algorithm. And so we really had a fun time building out a Kafka Streams application that was able to take the monolithic legacy application and break it down into stream processing architecture. Very fun, again, in hindsight, very, very difficult in practice while you're in the middle of it.
But yeah, so we allowed our teams to start playing around in Kafka Streams, our own team, implementing that algorithm. And since we had those connectors that I told you about, moving data from legacy into Kafka, and then Kafka back out, other groups in the organization were suddenly allowed to play around with data in Kafka and see where it could work for them. So we went through this initial testing phase, build out this Kafka Streams application that proved that, not only could we get the data into and out of Kafka, we could also very successfully implement this what had previously been a sort of black box algorithm and move it into Kafka Streams. And then we, in that process, increased the scalability and resilience of the application and achieved this, roughly the same latency that we were getting in the monolith. So, that was a major win.
I like it. Sometimes the host has a question and then he loses his train of thought. We've got the Kafka Stream... Okay, you said, I knew it would come back, you chose the hardest part of the system to do this. Do you remember, this is always interesting to me, the decision process? Was it sort of ad hoc and like, "I don't know, let's do this. This sounds good," which is a lot of how engineering proceeds. Or was there, did you have options on the board and you picked one and you had some rubric? Like how systematic was that for choosing what to attack first?
So I mentioned that we were the streaming data infrastructure team, but we were housed under a specific organization that owned a handful of different applications, different algorithms. This was the one that was closest to our team. We were directly reporting to someone whose group owned this algorithm and this application. So we probably could have chosen easier ones, we could have fought back. But yeah, it made sense because we are working with teams that own this algorithm, that we would just partner with them and make it work.
Right. And there's a tension, because it... That's a wise choice in that the team that you are in effect serving gets a lot of value out of this.
Mm-hmm (affirmative).
They picked it. It's what they want. Therefore, the result you can assume to the stakeholder is going to be valuable and visible.
Yes.
And you want that. And so I'm asking about this because "Where do I start," is a common question for when you're just doing any kind of refactoring of a legacy system or chiseling away of the monolith and replacing it with something new. And what piece do you take first? And I think the best guidance is you want to pick something that's visible, but also easy and they're usually opposites.
Yeah, unfortunately, yeah. Anytime someone asks me that, what to start with, I answer in the same way. You want to pick something that is visible enough that it will be impressive and valuable to someone who sees it later on so that a stakeholder can buy into it, because unfortunately, as engineers, what the technology you want to play with and you want to implement and integrate isn't always, time wise, the cheapest or most efficient to the stakeholders. So you got to make it worth their time as well, right?
You absolutely do. And it's a matter of negotiating competing agendas, because our agenda is hopefully to make the system technically better, more flexible, more performant, easier to staff for, because it's not all 15 year old technology or whatever, or longer, or older. There are all these engineering utilities that it's right for us to want to maximize. And they don't always obviously map onto value for the business.
Mm-hmm (affirmative).
So, investment decisions in the business are going to get made on the basis of value to the business. That's just how it goes. And so we have this task of trying to connect what our agenda is to the right, and frankly sort of legally mandated responsibilities at some point that people spending money have a fiduciary responsibility to spend it in a way that benefits the company. Can't just do it because this is how we want to build the system. So doing that mapping is just important and that topic always comes up a lot. Sounds like you guys, I mean you, it was hard, but that's the thing I'd rather give on. If the only option is something that's super visible but difficult, sorry. You're going to do the hard thing.
Yep.
So, okay. So you built that, you proved it out, and that showed... Gives you momentum for, again, this agenda you have is you want to migrate this thing to an event driven architecture. Maybe the business wants that, maybe the business doesn't, but showing, hey, this is successful in a way that brings value to the business, gives you some credibility for moving forward.
Yeah. We definitely, improving that aspect of the processing pipeline, we bought time, right?
Right.
To then move on to the next stages and really realize our vision.
Yeah. And what was next?
So as I said, we moved data into and out of Kafka successfully, we migrated the algorithm or enough of the algorithm that we could prove that it worked. And the next step then was to bring in the rest of the data that we needed to make the pipeline complete and that involved some configuration data. And we achieved that using Kafka Connect. So we spent a lot of time figuring out the right connectors to use and then figuring out exactly what data we needed because it was all hidden away in some legacy databases. So eventually set alarm-
Oh that's [inaudible 00:17:05].
Yeah, weird. And it wasn't all just packaged neatly with a bow on it. Bizarre.
High quality and yeah, right. Okay, well I guess things didn't work well at that organization compared to every other organization in the world.
Yeah.
No, data integration is terrible and it's always terrible and that's just how life is.
Mm-hmm (affirmative). But thankfully, even though, it is kind of terrible, we leverage the JDBC source connector.
Okay, I was going to ask.
And [inaudible 00:17:35] it was not as terrible as I thought it was going to be.
Good, okay.
Shameless plug, you can watch my Kafka Summit talk on this, where I go into a lot more detail on how that connector worked. And maybe some of the issues that we encountered.
That is a good idea. And it is linked in the show notes, obviously, because it's very relevant to this topic. And that was sort of an informal part of Danica's pre-interview process, I can say. And it's not, employers I don't want to make you wary of letting people speak at Kafka Summit. "Oh no, they're going to get hired away." It doesn't happen that often. It was just a real good talk. And I like that, like, hey, we should talk to this lady. So anyway, and I mean, that's not exactly how the sequence of events went. It was a good talk and you should watch it.
Thanks.
So connect, use connect, JDBC source connector, and not any fancy CDC stuff. Was there a-
No.
How come?
Yeah. So the database that we were using was actually an in house brand of database. Yeah. So CDC wasn't really an option. And then also the way that the database was set up, I'm not saying the data model wasn't, it wasn't perfect. So it really didn't lend itself well to traditional change data capture. And we actually, the way that we set up the query for the JDBC source connector, and it was, we were using query mode. We weren't just pulling data, all of it from a bunch of tables. We really had to massage the data within the query to get what we wanted.
You needed a query that was doing the joining and the other thing that needed to happen because otherwise you'd hate life. Okay. That makes sense.
Yes. Yes. So it actually ended up being kind of a complicated query in the end, but yeah. And then we were able to get what we wanted or at least as close to what we wanted or needed. Over time, we revisited that and sort of made some small changes to the database, added some columns that would make that query a lot cleaner. But it took a little bit of time.
Yeah. Okay. Sounds like I felt that pain. What was the last step? What did you do after, say this is, you had connectors, first pass of stream processing, next level of, we need other data sources to join and that's connect and JDBC source. What was after that?
Yeah. So finally, with our pipeline actually in our minds complete, the algorithms there, all the data we needed to process was there. We then wanted to make sure it's good, that the data actually looks like we expected to. Because, as a reminder, we took one of the most difficult applications that we had available to us, it is a very visible application and people who are consuming this data are making decisions based off of it. So it needed to be correct, right? So while we were building out this whole pipeline, the legacy application continued to write its data, produce its data. And so we pulled that legacy data into Kafka using our initial connector. And so now we had two streams of data, the legacy data and also what we were producing with our Kafka Streams pipeline. And so now we are able to leverage ksqlDB to join those streams and conduct some validation to see how different they were within a bucket of time, a couple second or so.
That's pretty fancy.
Yeah. It was really, really-
I mean, that's like a nice validation scenario. You could create knowledge and have [inaudible 00:21:36] confidence and all these things that you always want.
No, it was absolutely wonderful. Actually that probably took the most time out of the entire thing, because we really wanted to make sure that our ksqlDB queries were doing what we needed them to do. And then we used the results to prove to the stakeholders that this is what we wanted. And then we were able to, based on those results, go back to the algorithm and tweak it, because where we maybe missed something in one of the processors we were able to correct that.
Absolutely.
And yeah, really prove that it worked.
How cool is that? I love it.
Yeah.
Lessons learned? So looking back, you've kind of hinted at some of those actually going along, but summarizing, what are the things that you took away from this and if you had to do it over again, or if it was your job to help people understand how to do this kind of thing, what would you want them to know?
Yeah. So the first thing I think is just in terms of Kafka Streaming and just streaming in general is you have to think in a certain way to really be successful at it. And what we really learned as we were moving that algorithm from, it's just sort of black box monolith and really breaking it down into processing steps, whether you're doing it in Kafka Streams, or even in ksqlDB, you really need to think about exactly what you're doing with the data in each step to be most efficient. And so we were in Kafka Streams, that's breaking it into separate processing stages. And then in ksqlDB, that would be multiple queries, you'd implement those.
Right.
So you really needed to take a step back and think about how to break up your algorithm to maybe most efficient.
Got you.
Yeah.
Any connect lessons?
Yeah. I mean, I love connect. I pretty much, I own that component for the most of that project. And even though it's a little bit tricky at first, it's invaluable, especially with a legacy architecture, you need to find a way to plug into that. And really that's what connect allows you to do. As you're playing around with data pipelines as we were, being able to bring in legacy data and connect with the rest of the legacy architecture is just, it's so important.
Yeah. And not have to write all that code yourself.
Yes. No, we learned that, [crosstalk 00:24:11].
Oh, go ahead.
Oh, we built sort of our own connectors in the beginning, right? To plug in with that legacy architecture. So we've done it both ways. And I will tell you that leveraging Kafka Connect was definitely the easier component.
Admiral Ackbar is right, it's a trap to write your own connector.
Mm-hmm (affirmative).
And that's saying something, because like you said, it's finicky. There's a lot of little configuration dials and you have to get them right. It doesn't just work. But it's a more complex problem than it seems like.
Yes.
Because as developers, we look at it and we're like, "Oh, come on. I'm just reading from this database and then putting it in a topic. I can do that in an afternoon." But there are corner cases and it's good to have that done for you. How about, last question. This is a successful streaming pipeline, you know how to do this, you know how to explain to people how to do it, but did you get a sense of boundaries? Like would there be things that you wouldn't want to do this? And I mean, I always think it's important, especially in our line of work when we're trying to say, "Hey, this is a good idea. Let me show you how simple this is. It's understandable. You can do it. It's the future." That's sort of what we do. I think credibility demands we also say, "But you know, it's not everything." So what did you see there?
Yeah. I mean, especially, we were definitely riding that high afterwards. Everything works, streaming is incredible, everyone should use it. And a lot of people in the company were coming to us saying, "Okay, your team implemented this. Obviously it's a very important algorithm. How do we use Kafka Streams now?" So we had a lot of groups sitting down with us asking, "All right, how do we implement this in terms of event driven pipeline?" And it's really tough. It's really tough to say no. Some people approached with an architecture diagram that would involve, "Oh, well, we're going to process it in Kafka Streams and then put it back into database and then read it back out and whatever." I'm like, "Maybe this isn't the right way to do it." When you have a hammer, everything looks like a nail, but sometimes it's just not a nail and you should move on.
And that answer is contextual too. Because like that speaks of architectural constraints outside of the pipeline where maybe the problem is perfectly pipeline shaped, but there are constraints elsewhere in the business that just make that ugly and not worth it.
Yes. Yeah. And it's definitely difficult in somewhere with so much legacy architecture. And it was difficult because a lot of these teams were looking to the future and yes, I do believe that an event driven architecture is the way to go when everything lines, right?
Why would we be on this podcast if we didn't believe that? Definitely true. Yes. But, and the part that you did build that way, certainly successful project. Like you said, 10 out of 10 would do again.
Mm-hmm (affirmative). Absolutely. And we did have a lot of groups that were able to implement something similar because we spent all this time proving that Kafka could work, that Kafka Streams would work, and all these technologies could successfully plug in to the architecture. So yeah, we had a lot of successes follow after that. I'm not saying we were an inspiration to the company, but we were an inspiration to the company.
My guest today has been Danica Fine. Danica, thanks for being a part of Streaming Audio.
Thank you so much, Tim. I appreciate being here.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
Implementing an event-driven data pipeline can be challenging, but doing so within the context of a legacy architecture is even more complex. Having spent three years building a streaming data infrastructure and being on the first team at a financial organization to implement Apache Kafka® event-driven data pipelines, Danica Fine (Senior Developer Advocate, Confluent) shares about the development process and how ksqlDB and Kafka Connect became instrumental to the implementation.
By moving away from batch processing to streaming data pipelines with Kafka, data can be distributed with increased data scalability and resiliency. Kafka decouples the source from the target systems, so you can react to data as it changes while ensuring accurate data in the target system.
In order to transition from monolithic micro-batching applications to real-time microservices that can integrate with a legacy system that has been around for decades, Danica and her team started developing Kafka connectors to connect to various sources and target systems.
As a final tip, Danica suggests breaking algorithms into process steps. She also describes how her experience relates to the data pipelines course on Confluent Developer and encourages anyone who is interested in learning more to check it out.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us