You know the world will always have databases in it and so integrating them with Kafka will always be an interesting proposition. On the show today, I talked to Adam Mayer and John Neal, they're with a company called Qlik and they make change data capture solutions. Now we've talked about CDC on the show before, but you know what? There is always a new angle, and honestly, this was an interesting conversation. I enjoyed talking to these guys about CDC, I think you'll enjoy listening in.
First, always good to point out that Streaming Audio is brought to you by Confluent Developer—that's developer.confluent.io, the website that's got everything you need to start learning about Kafka, instructional videos, even- driven design patterns, executable tutorials, long-form articles, all that stuff. And when you do any of the tutorials or exercises, you want to sign up on Confluent Cloud, and when you do, you want to use the code PODCAST100 for an extra $100 of free stuff. Now let's get to the show.
Hello and welcome to another episode of Streaming Audio. I am, as usual, your host, Tim Berglund, and I'm joined in the virtual studio today by Adam Mayer and John Neal. Adam and John are with a company called Qlik and we're going to talk about data integration and CDC today. Adam and John, welcome to the show.
Thank you, Tim. Great to be here.
Yeah, same for me as well.
Tell us a little bit about yourselves. Who are you? How did you come to be in your role? What do you do at Qlik? And if, dear listener, if you don't know what Qlik is, obviously we're going to cover that. But Adam, tell us about yourself.
Who am I and what do I do? So, hi, Adam Mayer from Qlik. I'm actually based in the UK. Been at Qlik for well over five years now which is good, it's been a fun ride and I'm in the global products team under product marketing. A lot of things I do, I'm responsible for the CDC streaming proposition first of all in Qlik and work very closely with you guys at Confluent and with my colleague, John, here.
Some of things I've done at Qlik is help out go-to market activities, particularly around the interesting things like IoT and GDPR as well for marketing, so it's quite varied. But generally, I've got a strong kind of technical background in computing in general, spending more years than I care to remember, 20 odd years now I think it is, so always been an avid follower of technology. I've kind of cut my teeth, my dad used to work for IBM, he bought home this big, hulking great screen with him of a computer with a tiny little screen on it. It was all black and there was a little flashing seeker and I remember thinking, "Oh, what's that?" And then I was kind of hooked from there.
The rest is history.
The rest is history, exactly. So that's a little bit about me and [crosstalk 00:03:02].
Thank you. I love it. John, tell us about yourself.
Well, just a little bit about me as well then. I've been at Qlik for just a couple of years now. I came in as part of the Attunity Acquisition. So the data integration solutions that we'll be talking about today are from the Attunity portfolio from back in the day. I've been working in the data integration space for 20 some odd years, so quite a long time. Been working with Kafka since 2015.
I actually wrote the first production-ready Kafka adapter for one of our esteemed competitors, who I won't mention here. I've been working kind of in the nitty gritty, I'm a full-time geek and I really enjoy the things I get to do. I work at Qlik on the department engineering team and basically diving into new things and figuring out how to make them work.
Nice. 2015 was a banner year in the Kafka community for data integration, that's the year that Connect became a thing.
Yeah, it was and I wrote the Kafka handler actually with one of your early customers as well. So it was a direct request from a customer that wanted to integrate. And that also then forced me to integrate the scheme of registry just a short time later as that started becoming a thing. And so again, I've been around the block just a little bit here.
Nice. Yes, longer than I have. It's an honor to have you on the show. So we're talking broadly about CDC today, but there's always a lot to dig into on that topic, and maybe good to set the stage. So what we'll call legacy or traditional data management processes or integration practices or what I like to call, "the received tradition," the thing that has come down to us before these new things we're trying to do. Describe that and what's wrong, what's bad about that world. A question for either of you, jump ball.
John, do you want to jump on that first?
Sure. So I'm not sure I could really pontificate much on what's wrong per se. What I've seen and how things have evolved has been an uptick in the desire, over time, to have data streaming from more static data sources. So if we think about message oriented middleware which you could broadly say, Kafka was an open source at the time [crosstalk 00:05:47] solution, if you will. An awful lot of the data moving back and forth was of a more realtime nature. Think about precursors of IoT, think about error logs, message streaming, the kind of things that you might have seen back in the day. But things have evolved to have a greater desire to have actual business data streaming into the environments as though that business data is being developed, as it's being written into a static data source or into a database, right?
So wanting to get that data out and in motion again so that it can be acted upon in real time. An example might be, an order comes in through an order management system of some sort. In the old days, there might be reports that go off or things that people use to then process. But we might want to have a more real-time response to that, to get a workflow going automatically from that. And so, getting those changes from the database and getting them moving again into our infrastructure is where things have evolved. I guess again, not finding fault, but I'm seeing that sort of change: more and more interest in that real-time movement.
I appreciate you reframing my question in that way. I'm sort of taking a pathology based approach and you're, I think, looking through more of a developmental lens, which is a lens I try to use whenever I can. So I think that's good. It's not that it was wrong, it's just it was what we were doing at the time and now we have needs because our context has changed to do other things and so we want to stream stuff and...
Yeah. We've done a few episodes on CDC in the past. Imagine someone, though, is new to the show, new to the space, and they think I'm talking about the Centers for Disease Control. Most people in our space know what CDC is, but some don't and, dang it, those people are welcome here and I want to take care of them. So what is CDC? John, you were just hinting at it but if I had to give a definition, what would that be?
Well, CDC is an acronym, obviously, one of those three letter words. It is short for change data capture and what the technology is about is this idea of capturing changes as they occur in a database in real time. Very typically, but not always, by reading the transaction logs, sometimes it occurs by triggers that have been put in place on tables. And for database people to understand what a trigger is, basically it's an event that fires that says, "Something happened," and then some [inaudible 00:08:47] will figure out what happened and cause changes to be propagated into the change stream.
So it can vary. The commercial solutions that are out there, process by reading the transaction logs directly and are less invasive, intrusive, if you will, than say a trigger on a database or a query. "Tell me everything that's happened in the last five minutes," might be a question that would be asked by a query based solution. Okay?
And so, in the case of our solution, we're reading the transaction logs and so the moment that transactions are committing, we're seeing them, picking them up and beginning to propagate them so that there's less latency. But in the end, CDC is just about capturing changes as they occur from a database.
Got it. And you just differentiate between the log based and query based forms; log based being the harder to implement but more sophisticated. You see everything because you're [crosstalk 00:09:50].
Yeah, more sophisticated in terms of performance, in terms of the real timeness. Devils in the details with some of the stuff.
Least invasive as well, right?
What's that, Adam?
Least invasive on the kind of source systems [crosstalk 00:10:09].
Exactly, unlike say, the trigger based thing and that's database trigger, not trigger, like you see somebody ordering a croissant and you have a panic attack or something like that. Although, sometimes not all that different, trigger based CDC, not a thing you want, but again I'm making a value judgment, but that's a more likely... [crosstalk 00:10:35].
[crosstalk 00:10:35] where triggers are the only approach. There are databases that don't have logs that can be read so there are situations where it's not feasible to troll the transaction log looking for changes and so [crosstalk 00:10:50].
Can you give me an example of one?
Can you give an example of a database where you'd have to do that?
Sure. Teradata would be a prime example of a situation where, while it is a true database, under the covers there are logs that have the ability to roll back transactions and things like that. They're not accessible, so the logs aren't accessible in a way that would allow a product to read the logs directly. And so, for a solution like Teradata, some of the other data warehouse type solutions that are out there, not all, but some, then require us to find alternate ways to capture changes that might be occurring.
There you go.
But log based is definitely the preferred approach.
Certainly. Now, John, you framed the rise of CDC in terms of the rise of real-time business requirements and real-time computing requirements. People who build systems are under pressure to make things that respond right away and we could talk about the drivers in the world that have caused that but as a consumer of things, I expect my phone to wiggle right away when I need to know about something and that trickles down to the lives of developers everywhere. Now we need real time and so CDC and Kafka become interesting, because now I can get an event that I consume, that I respond to and I can make application functionality, build that application functionality on that basis.
But that's not the only driver here. How about, just to throw some sort of canned phrases around like "data silos." Now, any large organization that's a problem. The great blessing of client server, a generation, a generation and a half ago was that now we can build all these small scrappy systems that are responsive to business requirements locally. And then a few years later we're like, "Oops, okay, now there's just a thousand databases and nobody knows what's true anywhere." So all this siloed data, the days of the mainframe, it was all in one place, but kind of the lives of most of us who have grown up and who are currently working, writing software have been lives characterized by databases everywhere, which are inconsistent and siloed. So talk about that problem. And it's less obvious to me how CDC and Kafka fix that on the face of it. So how do we approach that?
True. Adam, do you want to take that? Or if not, I'm happy to.
I can take it from a higher level. You are my Kafka guru, but I think from a general business problem, it's [inaudible 00:13:36], right, because you've got so much data out there and the challenges is how do you unlock that value? Like you mentioned, mainframe is one, you also have those large sort of engineered ERP type systems, like SAP and Oracle and those kinds of stuff, as well as all the other kind of databases and different sources, you've got internal and external data. So the challenge is kind of knowing what data you've got, what insights are in there and then how you can kind of harness that. I think just from a kind of Qlik perspective, that is one kind of sweet spot that we can help with.
Not only on the kind of data integration side to actually get access to the data, bring it into platforms like Confluent to stream it for the real time stuff, but also start joining the data and the real power comes when you can actually use combinations of data to then start analyzing on finding the insights, making informed decisions and things like that. So just from a kind of high level perspective, that is definitely kind of Qlik sweet spot in terms of helping that organizations kind of combined data a lot easily.
I had a conversation last month, we're recording this in early November, 20, 21 last month. I was all over Europe and a little bit in the Middle East. I was a part of a tour of marketing events that we were doing and got to talk to customers in person again, which is amazing because it's been a while and there are just conversations you have over dinner that you don't otherwise. And SAP integration came up at one point and that's a difficult thing.
So obviously, and I'm talking about this big software company, like they're not listening and they could be because this is a podcast with all respect. There are people who might want data out of that and into their sort of data in motion, real time, Kafka, Confluent life. And it's hard to do, because if you're, of course, if you're that vendor, you want to keep data inside your garden because that's where you believe that you can serve your customers best and coincidentally also, where could extract value from that. So it's not easy to get out of there, but do you guys have an SAP integration story? And this is not, by the way, I don't know the answer to that question. So this isn't like some [inaudible 00:16:03] tell us about your [crosstalk 00:16:03] don't know.
The answer is absolutely we do, but so it's even better that this wasn't teed up in some fashion for you. So we do have the ability to extract data from SAP and we're able to do that natively and that is key. So we're actually able to leverage under the covers, the SAP APIs that are needed to decode some of the data. So while SAP itself, the application, the data lives in a database, right. Historically it's been in Oracle or something like that, SQL server, but more recently, SAP's been encouraging people to migrate toward HANA, right, and starting to store their data-
Even for their transactional stuff?
Yes. Even the transactional stuff.
So HANA becomes a native host for SAP data as well and so you think rows and columns and tables in the context of getting the data out. So in that context, well, it's Oracle, I can capture from Oracle so I could get this SAP data and the reality is not so much. While you still have the ability to... Anybody would have the ability to sort of just query a table behind the scenes, not through the application. SAP does a lot of tricky stuff under the covers in terms of how it manages the data. So it's not evident for just random query, what, what the heck is stored in there. So our ability, which is somewhat unique to be able to go after the data in SAP through the APIs brings a huge value because we're able to decode the data and present the data still in a tabular rows and column sort of form that then become messages in Kafka that then become something that can be acted upon downstream.
And understood by mortals, without reading the book of the dead or something, that decoding is done by your layer.
Yeah, it's just SAP think of... Some of the data that's stored there, think of tables that are stored within tables, right but you don't see the tables that are inside the tables. All you see are the tables that are on the outside. And so it becomes tricky then to try to understand what's going on under the covers.
My respect to the teams that build that layer.
It is a unique skillset.
It is a unique skillset and a unique maybe set of personal resources to bring to the table. Reminds me of completely unrelated, but just kind of riffing on that painful integration layer stuff. I use a travel app called TripIt. Haven't used it all lot for the last year and a half, but I was on this big trip last month using it all the time and the big value prop initially was you get your itinerary email from wherever. You do your car here, your hotel here, this airline, and you just forward it to one email address that [parses 00:19:03] it and puts it into this nice singular view. Cool, that's nice. This message is not brought to you by TripIt. I'm just saying I use it and I like it, but you think there's somebody who writes the code, who parses those emails and you just want to give them a hug. That's not easy work, but that's really kind of core. So in that case, a lot of what you guys do is plumbing, meeting logs.
Absolutely. As is Kafka, yeah?
I mean, I tell lay people who don't know what I do, tech lay people. I say, "Well, it's like plumbing. On a good day, you're not touching plumbing. Something has gone very, very wrong, if you're touching plumbing. You're using fixtures and that's where you want your life to be." I keep the plumbers excited about fittings and [inaudible 00:19:52] and pipes and that kind of, so it's critically important to our literal health that that works well. So plumbing is not a [crosstalk 00:19:59].
Not at all sexy from a product perspective. [crosstalk 00:20:03].
Yeah. But then you get into layers like this kind of integration where it's up the stack a little bit from pipe to, there's some really hard work that you're doing there in that kind of integration. And that's not intended to be our subject here. It's just interesting to me that that's there and it's admirable work from a developer's perspective to look at that and see that being done.
Now, plumbing is extremely important.
Yes. Other things like why this is a part of our life, John, you started out framing it in terms of real time. Maybe, guys, expand on that. Again, I always think of the received tradition and contrasting that with where we're going now and the received reporting or analytics tradition, the ETL world, was you pull stuff out at night because things are slow then. And you've got all this work to do, and you do all this pre-computing and transform things into this other kind of schema loaded into this other database, because in the morning, Crystal Reports needs to run and run this report and maybe it gets printed out and put on the desk of someone who's in an important room, even if it's not printed out for a long time, like that software maintained this idea of pagenation.
So this data, we have this process that takes data from a transactional system and puts it somewhere so I can put it on pages to look at. And that, again, there's nothing bad about that. That was a game changing technology that worked, but that's not where we are now. So contrast that with where you see things being now in the analytics world and how Qlik fits into that.
Yeah. I can take, well, it fits in very nicely with the kind of vision statement, if you like, that we have at the moment. So, what you were talking about there was very much kind of oriented around batch processing. And if any, the last two years have taught us anything now, [inaudible 00:22:08], the need to be able to kind of react in the moment, right, and be able to kind of pivot on that dime and be more data informed in order to sort of make the right decision going in the right direction, that's absolutely paramount now and the ability to kind of feed data in real time is quite foundational to be able to do that, right. Because if you're analyzing data, like you said, in those reports, you're very much kind of looking in the rear view mirror, aren't you, in terms of what happened yesterday, what happened to you before that, and trying to base the decisions on that. The ability to keep the data in motion, which Confluent and Kafka does-
Love it when you guys say that.
Yeah. To be able to do the analysis on the most freshest data that you've got, then really allows you to grasp what I like to call, the power of now really, and you can start acting in the moment. And then if you can have, again, sort of people technology, or not just technology, but people as well is quite a key part. But if you can have technology that can then sort of start alerting and start doing triggers, kind of fresh old alerts I'm talking here, you can start notifying people that something's changed in the data and something needs attention. So those reports change from being static, much more active and then you can start creating a more active environment even, streaming dashboards and things like that potentially.
But where we see it going is going beyond just sort of what is business intelligence today, which is traditionally kind of quite stacked reporting in the past, but it's not really designed to compel action, whereas keeping the data in motion and be able to, if you can build on that, you can start shifting into more of a state of active intelligence, if you like. And that's the kind of vision statement where we're going in terms of being able to compel action and take action based on kind of data driven or data informed decisions and the foundation of that is real time data.
Because that action is a decision that someone is making and it is closer to the event that informs the decision, closer in time. It's not what happened yesterday or last week, but some, oh, something just happened and now I need to do something differently.
Yeah, exactly. And if you've got that foundationary and then you can build upon it, you can exactly, either make that supply and map information, give the right insights to compel that action or make the decision, or take it a step further in certain use cases and actually automate decision itself between applications or something like that. So you can start automating workflows and taking kind of on of our key, I think, key things we bring to customers is speed. The speed of doing things, the speed of getting data. So not only trying to make life a little bit easier for developers, you mentioned it's a lot of heavy work, heavy lifting going off in the background in creation of pipelines and things like that. Our solutions try and make the development life a little bit easier in reducing complex processes, shall we say, all the way through to actually get in the data, make it more human readable, understandable, and actually start driving, showing insights and then driving the kind of compelling actions.
So it could just be as something as simple as that report that you used to see and the now gets kind of emailed to you and the bit that's most important in order for you to take action, or it could be that we know what it is. We know what action needs to happen. If it's a repetitive manual process, for example, then you can start automating that [inaudible 00:26:02] as well.
Operationalizing decisions in software that people make.
Yeah, yeah. Absolutely.
Love it, love it. I see an insight that I get from Zhamak Dehghani, the founder of Data Mesh, is that those decisions are getting pushed down in the org chart. Well, like you just said, Adam, sometimes turned into software, that's pretty far down in the org chart. The individual contributor that's not even an individual, but the received tradition is that reporting informs decision makers who are in oak paneled rooms and wear suits, according to stereotype. And now that goes down the org chart and so the people who are taking action based on stuff that happened could be anywhere. They could be leaves on the tree, organizationally, people doing their jobs. Future of CDC, future of data integration. What's next? What hurts now that you think could hurt less in two years?
That's a tough question.
Yeah, we didn't put that in the notes, did we? But I just thought of it. I thought you guys probably have some idea.
I would say, I don't know, John, I'll take it and then you can either expand on it or rip it apart, but rather than just focusing just on CDC, I think one thing we talk about a lot is, what do you do the data after it's landed. Once you've got it in motion and you put it wherever it needs to go, kind of what happens next? So, I mean, it's what we're doing today that we can kind of help in is outside of the streaming side of things is the sort of data modeling day warehouse, data-like creations, trying to help simplify that. But then you can put the rest of your [inaudible 00:27:59] data, the kind of non-streaming data into and get that combinations of join it together.
And then applying kind of more analysis, analytic type capabilities on top and bringing in more sort of, kind of machine learning at that level. That's something we are sort of bringing into our platform. And I think from a future perspective... So this is about not only trying to draw out the insights that kind of live in the data that may be interesting. And one of the key things about the automation process that we talk about, I'm talking more in the analytics world now, but we believe humans are an important element. They should always be ultimately a human making a decision. Particularly, I talk about AI. One example I like to bring is I definitely like the idea of AI technology, helping doctors to make a decision because it's a whole load of data they've got to sort through. But I don't want a machine to tell me that I kind of need to see a doctor or what the problem is. I want the professional to tell me that.
So that kind of equates to how we approach things. So we definitely have AI in drawing out insights in the data to then sort of inform the action and then using that to drive kind of automation process as well from a future perspective could be interesting, but I'd like to see continue trying to explore how can we apply of that kind of AI machine learning capability down into the data integration side of things and use it to... Maybe it's like a data quality side of things. Maybe it's improving the data flow, those kind of things.
I'm sure there's this interesting avenues to poke around with and play there. So that could be quite an interesting concept, but we're certainly using, from a today perspective, the ability to draw out information insights [inaudible 00:30:00] from data and in a more sort of, kind of prevalent way. The equivalent of drawing, it's always like looking for a needle in a hay stack to get the data out of the analytics. And this is more literally placing the needle in your hand for those important critical insights.
So yeah, I might part just a little bit from the question of where is CDC may be challenged and evolve over time. I'm not sure answer on that in that the databases are prime bread and butter, aren't changing that rapidly, right. There aren't new ones coming on the market, on the source side where a business is running itself on an OLTP database. But they do come and things are evolving from that perspective but where I see challenge is in streaming more broadly. As we start thinking about the fact that we're bringing together this data, that our solutions are capturing from databases.Let's say that SAP data we talked about, some data from other organizational databases that we might be wanting to bring together, but then you have all this other data that may be coming into play.
I mentioned IoT earlier, right? So there's data out on the edge that may be coming in that also might impact the decisions we need to make. And in that frame of reference, then it's how we bring all that data together in a way and efficiently provide the ability to do that analysis that Adam was talking about. It's one thing to say, "I've got all this data from SAP and I'm going to move it and it's going to go into Confluent, into Kafka and that's going to pop out the other end and somebody's going to do some analytics on it." We've got a single pipeline that we're talking about. Where I see the challenges, things are evolving, or when these pipelines become more complicated, we've got the data coming from SAP, but we've also got the data coming from, again, out on the edge somewhere.
And that stuff needs to merge together and then be easily combined in a way that we're able to derive knowledge from it to derive action from it. And I think that that problem is one that as things get to can evolve, is going to be one that's going to be more in front of things. Crafting an architecture, crafting a solution that isn't just put together with, bailing wire, cobbled together in some fashion or other, but actually have a way that all of this data comes together in a cohesive way that can easily then be exploited. That's the challenge that I see.
My guests today have been Adam Meyer and John Neal. Adam and John, thanks for being a part of Streaming Audio.
Thank you. It's been a pleasure.
Happy to be here. It was fun.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
Getting data from a database management system (DBMS) into Apache Kafka® in real time is a subject of ongoing innovation. John Neal (Principal Solution Architect, Qlik) and Adam Mayer (Senior Technical Producer Marketing Manager, Qlik) explain how leveraging change data capture (CDC) for data ingestion into Kafka enables real-time data-driven insights.
It can be challenging to ingest data in real time. It is even more challenging when you have multiple data sources, including both traditional databases and mainframes, such as SAP and Oracle. Extracting data in batch for transfer and replication purposes is slow, and often incurs significant performance penalties. However, analytical queries are often even more resource intensive and are prohibitively expensive to run on production transactional databases. CDC enables the capture of source operations as a sequence of incrementing events, converting the data into events to be written to Kafka.
Once this data is available in the Kafka topics, it can be used for both analytical and operational use cases. Data can be consumed and modeled for analytics by individual groups across your organization. Meanwhile, the same Kafka topics can be used to help power microservice applications and help ensure data governance without impacting your production data source. Kafka makes it easy to integrate your CDC data into your data warehouses, data lake, NoSQL database, microservices, and any other system.
Adam and John highlight a few use cases where they see real-time Kafka data ingestion, processing, and analytics moving the needle—including real-time customer predictions, supply chain optimizations, and operational reporting. Finally, Adam and John cap it off with a discussion on how capturing and tracking data changes are critical for your machine learning model to enrich data quality.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us