Get Started Free
October 20, 2022 | Episode 239

Build a Real Time AI Data Platform with Apache Kafka

  • Transcript
  • Notes

Kris Jenkins: (00:00)

Hello. You're listening to Streaming Audio, and today we're talking to CTO and Apache Kafka tool builder, Ralph Debusmann, who's running the technical side of a company called Forecasty, who are building out an artificial intelligence platform for predicting commodity prices. What's the price of copper going to be three months hence? Not an easy task and not an easy task for him. I remembered as we were talking, when you're the CTO of a startup, you end up doing a little bit of everything, absolutely all the different nooks and crannies you can get into. And so that's the way this podcast went. We ended up talking about a little bit of everything. How do you migrate from batch machine learning on a group of flat files to a more real time streaming approach That's going to work better when you are dealing with the real time nature of financial markets?

Kris Jenkins: (00:57)

More importantly, perhaps, how do you make that kind of migration as painless as possible for all the data scientists you need on your team? How do you bring them along gently? What tools do you build? What parts of the developer experience are missing? And if they're large ones, how do you fill that gap? Not easy questions, I think you'll agree, but Ralph has some really thoughtful answers. Before we get into it, Streaming Audio is brought to you by our free education site, Confluent Developer. More about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. My guest today is Ralph Debusmann. Ralph, welcome to the show.

Ralph Debusmann: (01:42)

Thank you.

Kris Jenkins: (01:43)

You're the CTO of Forecasty.ai, and I like that URL because I can begin to guess what it is you do, Forecasty.ai, but maybe you should give us the summary.

Ralph Debusmann: (01:56)

Sure. So, I mean, what we do is actually building a forecasting platform focused now on commodities. So we are actually building it for predicting commodity prices into the future so you can actually decide on what to do with your commodities. For example, if you're a manufacturer and you procure commodities, or if you're a trader or you do hedging, you can actually see how the price will develop over the next few months or few days, you can change the frequency, and then decide what to do, to buy, to sell, to short, these kind of things.

Kris Jenkins: (02:38)

Okay.

Ralph Debusmann: (02:39)

Yeah, so that's basically it, So it's a product called Commodity Desk and we are focusing completely on this one now, even though there's two others on the web page, which are now frozen. That's what we do. And, actually, I mean, just to use the time here, we are fundraising at the moment, so if you are any VC whom we haven't yet contacted, which is improbable, but maybe, or some angel investor.

Kris Jenkins: (03:11)

You're going in with the shameless pitch early. Okay, fair enough. I respect that.

Ralph Debusmann: (03:13)

It's very, very shameless. But I have to.

Kris Jenkins: (03:17)

We all have to pay the bills.

Ralph Debusmann: (03:20)

But yeah, that's it. And, at Forecasty, basically what we have done now since I started there at the beginning of this year is to make it more real time-y. Actually, the platform was very batch driven, and still is in large parts, but we have started to change it, that it becomes more Kafka based, so to say.

Kris Jenkins: (03:50)

Yeah, this is why I wanted to get you onto the show, because the idea of doing artificial intelligence modeling in real time is really interesting.

Ralph Debusmann: (03:59)

It is.

Kris Jenkins: (03:59)

So take us into it. What kinds of AI techniques are you using to try and forecast commodity prices? Let's start there.

Ralph Debusmann: (04:09)

So, I mean, we have a bunch of models already. We have basically first built a product called Business Desk, which was a no code platform actually for forecasting as a whole so you could forecast anything, not just prices. You could do all kinds of forecasts. And for that platform we've developed a large set of models which we can select from to basically see what model matches or works best for the given commodity at hand. But, I mean, these models, they are still edge driven. So basically they have to see the entire data set at once and then they run and they do the hyper permanent optimization, feature selection, all these kind of things. And that takes also a lot of time, especially if you scale the platform and you do, for example, we have 60 commodities on the platform now. We will have more than 200 very soon.

Ralph Debusmann: (05:08)

So when you want to do daily updates for all these commodities, daily predictions for them, and you rerun your models every day, it just gets very expensive computationally. So you have to pay a lot of money to Azure, whatever you use.

Kris Jenkins: (05:25)

Oh God, yeah.

Ralph Debusmann: (05:25)

Because it's just a lot of compute what you need. It's also not very environmentally friendly, actually.

Kris Jenkins: (05:32)

Yes. This is because, the naive implementation is, you get a massive data set, you run a training model on the whole thing as a big batch process overnight or over a few days, and you get a model out that will try and predict new prices, and you do that for 6,200 commodities.

Ralph Debusmann: (05:53)

Yep.

Kris Jenkins: (05:53)

But then that's the great thing about the financial markets, there's always new data being added all the time as you were to rerun it. That's what you're saying?

Ralph Debusmann: (06:03)

Yes. And if you adopt this batch mode, it's very hard to incrementally do the training. That's what we are pushing now. I mean, we're trying to find more models which you can train incrementally and where it actually makes sense of this. For example, a bunch of reinforcement learning models, which basically lend themselves to being incrementally trained. We also added, I mean, that's also related to this oven behind me, I guess, because some of our forecasts, let's go back, when the Russian invasion started in the Ukraine, a lot of our forecasts were wrong because, one example is nickel, nickel, which is what you need, for example, for building batteries, so Elon Musk is a good customer of that, the nickel price went up significantly when the crisis started at the end of February, beginning of March. And we just couldn't capture this spike by our models because the models are not trained. They hadn't seen any crisis like that in the last 20 years or something.

Kris Jenkins: (07:29)

I would be almost scared by artificial intelligence that could have predicted that coming.

Ralph Debusmann: (07:34)

Exactly. And because we wanted to be able to capture this in a better way, at least shortly before the crisis arrives, we actually set out to also on the side build a real time sentiment analysis module, which takes in data from Twitter, from LinkedIn, also from various news sources, also Reddit I think, where we combined this data, filter it, sort it, and use it for essentially improving the models. So we show a fear index for the customer as well, where the customer could see how much fear we have in this market surrounding the commodity in question.

Kris Jenkins: (08:24)

Oh crikey.

Ralph Debusmann: (08:26)

And that was actually the starting point for really bringing in Kafka and streaming into this startup. I mean, we can talk about this later because the interesting journey is also to take your data scientists by the hand and bring them on because they're actually not used to that. Most of them have heard about Kafka, for example, and stream processing, but they haven't really done it and they might be scared.

Kris Jenkins: (08:57)

Yeah, it's a different world from, I'm assuming, mostly relational databases and large flat CSVs.

Ralph Debusmann: (09:05)

Not even relational databases too much. Really, just files.

Kris Jenkins: (09:09)

Okay.

Ralph Debusmann: (09:09)

The local file system, more or less. CSV files, for example, which you convert into these Python data frames most often and then you do your magic on the tables, essentially. So that's what they are doing. They have big problems sometimes to grapple with the cloud as such. I mean, we have a cloud native platform, everything on Kubernetes, everything cloud native, open source stuff. So teaching the data scientists to embrace that approach is one thing, but the next step is actually to bring them to the streaming world. So that's what I did.

Kris Jenkins: (09:51)

How do you do that? I mean, what's the model you are moving towards from flat files into a Kafka stream processing world?

Ralph Debusmann: (10:00)

Yep. So there's multiple aspects. I mean, one aspect...

Ralph Debusmann: (10:03)

Yeah, so there's multiple aspects. One aspect which I tried to cling to from the beginning was to really make it easy for them, so to hide a lot of the complexity which you have with Kafka. We chose Kafka because it's the industry standard, because I worked with that for the last seven years or so. So we didn't look at others like Apache Pulsar or Azure Event Hubs or anything. We wanted to have Kafka, but Kafka as such still has all these nitty gritty things. You have to understand compacted topics. If you want to go into s, it gets even more messy, and you can't possibly start with that when you teach people to adopt Kafka.

Kris Jenkins: (10:52)

Yeah, I wouldn't start that team-

Ralph Debusmann: (10:55)

So I made it as simple as possible.

Kris Jenkins: (10:55)

I wouldn't start that team on Kafka streams.

Ralph Debusmann: (10:57)

Yeah. Also, even the simple thing of partitions, even that is a low level thing which I try to hide. I basically told them, "First of all, just use single partition topics. Don't worry about partitions. Don't worry about the retention. Just set it to unlimited." Make it very easy for them, actually. "Don't think about compacted topics. Don't think about keys, just about values." So really trying to simplify it as much as possible for them. And then what we also did is, at least at the beginning, we didn't want to pick up stream processing, not the stateful one at least. So that was the idea to use real time database, which is more powerful than the Druids of this world. Not a shameless plug, but I'm not actually... Yeah, but we used Rockset. Rockset is actually a bit more powerful than this at the moment, than Druid and Pinot, so you can do some joins and stuff, which you can't in those softwares.

Ralph Debusmann: (12:10)

We chose that because the idea was to avoid doing too much stateful... or actually avoid stateful stream processing completely, to just have Kafka topics, simple consumer producer microservices, and then just push it or pull it all into this Rockset database without [inaudible 00:12:36].

Kris Jenkins: (12:35)

Presumably with an off the shelf connector?

Ralph Debusmann: (12:39)

Actually, Rockset allows you... It's even simpler. It has this built in native connector, so you can just tell it the topic to connect to, and then it continuously streams in the data. It's very simple, and there was the idea to make Kafka as simple as possible, to avoid stream processing at the moment and go for the most powerful realtime database we could get our hands on in terms of SQL functionality.

Kris Jenkins: (13:08)

So in that sense, you're using Kafka as the next step up from large flat files with more options for the future.

Ralph Debusmann: (13:18)

Exactly. We know at some point we would have to also venture into stateful stream processing with whatever software we can get. There's a huge market now which has sprung up. There's not just ksqlDB. There's also Delta Stream. There is Immerok, there is all these Flink based technologies now springing up. Let's see what comes, but we wanted to keep it away from the data scientists at the beginning. Yeah, so.

Kris Jenkins: (13:51)

So you're just teaching them, "Here's how to make a Python consumer and a Python producer," and just starting there?

Ralph Debusmann: (13:59)

Yes.

Kris Jenkins: (14:00)

Okay. How's that been going?

Ralph Debusmann: (14:03)

Pretty well. The pure Python consumer producer stuff was very easy for them, so they did understand it. I told them this caveat about the consumer groups so that if you want to consume again, and you wonder why it's not showing you the first elements again, because the consumer group has already set the offset to a higher value. But yeah, it worked pretty well. But what we found out next was... I could have known that from the beginning because I had similar problems at Bosch, where I was the Bosch Kafka evangelist for the entirety of the company. What was missing was a nice user interface to Kafka, and that's still missing in a way. So I'm coming to this KASH with K point.

Kris Jenkins: (14:57)

Yeah, because you've got this tool, KASH, yeah or Kafka Shell thing.

Ralph Debusmann: (15:02)

Exactly. Yeah. That's why I build that one. Currently you have some nice Kafka UIs now. There's Kowl, which has now been acquired by Redpanda, I think. You have... What is it called? Of course you have the Confluent UIs, which are pretty good. You have Conduktor, this one as well. There's a bunch of new UIs around, also an open source UI called Kafka UI, I think, simply. But what's actually missing still for me is of a nice shell. You have all these Apache Kafka commands, which you can get with the Apache Kafka or Confluent distributions, Kafka topics or Kafka console consumer.

Kris Jenkins: (15:48)

Yeah, that whole back to shell scripts. Yeah.

Ralph Debusmann: (15:50)

All these shell scripts built.

Kris Jenkins: (15:51)

And there's kcat, which used to be kafkacat.

Ralph Debusmann: (15:56)

And there's kcat. Yeah, okay. Exactly. Which is I think by the same guy who did the C library.

Kris Jenkins: (16:01)

I believe so.

Ralph Debusmann: (16:03)

And also the Python bindings. That's also pretty nice, but kcat is really more of a cat version. It can show you... It can also be used for producing, but it's... I like it, but it's not complete. What I wanted is to build a shell based on a RPL, so based on a program languages interpreter in a way, and I did that already with a program called Stream Punk, which was... I showed that at Kafka Summit in London, actually, in a small session just before the party, so maybe you might have missed it.

Kris Jenkins: (16:47)

Where's the toughest slot in a conference? Just before the party.

Ralph Debusmann: (16:49)

Yeah the best slot actually. But the party was nice. Anyway, the thing is that this was a too ambitious thing. What I did there with Stream Punk was I built a shell on GraalVM. GraalVM is a polyglot JVM from Oracle, basically, which allows you to run Python, JavaScript, bunch of languages, Ruby, inside of the JVM, and also R actually. Nice language for forecasting. And the idea of that tool was to build a polyglot shell, so it was a thin wrapper around the Java-Kafka client library, and you could use it from all kinds of languages, from Python, from R. You could use different RPLs, different interpreters on top of that, and then use that as your shell, whatever programming language you like most, like JavaScript, Python, R, whatever. But it was too ambitious. They didn't have the time actually to really build this out.

Ralph Debusmann: (18:03)

The next step was now, being in a data science environment, to come up with the same thing, actually, a thin wrapper around a basic Kafka library. And I just brought a wrapper around the Python library for Kafka. It's called confluent_kafka, so that's the-

Kris Jenkins: (18:22)

Yeah, I've used that a lot. Yeah.

Ralph Debusmann: (18:24)

Yeah. You know that. What I found is it's of course very nice, but the confluent_kafka library is in some ways overcomplicated, I found. You can't easily use it as a shell. You can perfectly use it in your code. That's fine. But if you want to use it in a shell, where you just want to have very simple steps, so you want to create a topic, list the topics, have a look at the size of the topics, see the watermarks, like the lowest offsets, highest offsets, all these kinds of things, you maybe want to upload a file where one line of text is one message and these kinds of easy things... for this, you would have to write some wrapper code.

Ralph Debusmann: (19:15)

I did that basically with kash.py. So kash.py is actually using the power of confluent_kafka, of this Python library, makes it a little bit more accessible so that you can actually use it as a shell. It's very easy. Commands are just... You just create a cluster object, basically, and then you can list the topics, the groups, ACLs, and do all kinds of things there just with a very minimalistic parameter set, so you don't have to worry about any objects. It's just extremely minimalistic, so that it's easy to use.

Kris Jenkins: (19:55)

In a way, it's really a user experience project. Rather than programming a library that gives you more power, it's about-

Kris Jenkins: (20:03)

Rather than programming a library that gives you more power, it's about making that power more accessible.

Ralph Debusmann: (20:06)

Yeah, that's it actually. So it kind of tries to make it... In this sense, really Kafka more accessible in a way. Because you can then use that thin rapper on top of the Kafka thing to build scripts, which are then much nicer looking, because you don't have to worry about creating objects and all kinds of things. And Kafkas on the celebrities also requires you to dive deep into the resulting objects which you can get. It's just a bit more streamlined, and that allows you to write scripts much quicker than before.

Kris Jenkins: (20:51)

This paints a picture of a very friendly CTO, if you're writing these UX projects for the data scientists.

Ralph Debusmann: (20:58)

In a startup you can-

Kris Jenkins: (21:00)

Yeah.

Ralph Debusmann: (21:02)

You couldn't do this. It would be a large organization. It's a small-

Kris Jenkins: (21:07)

One of the joys of being a programmer at a startup is you get to do a bit of everything. You get to jump on what the real problem is.

Ralph Debusmann: (21:14)

Yeah, that's it. I wouldn't like to kind of just be a CTO in the sense of... In the normal sense. You are completely detached from the actual coding.

Kris Jenkins: (21:26)

Yeah, you're just making management decisions about coding.

Ralph Debusmann: (21:28)

Exactly. But I think you can't really lead a team if you... Especially this sort of I found... You can't really lead the teams when you can't connect to what they're actually doing. So I was actually having a great time with these guys kind of understanding. So when I came in at the beginning of this year, they kind of saw, "Oh, we have a sparring partner, so we have someone who we can talk to who understands what we are doing." They kind of really got great ideas, and they did. They excelled, actually. And I was really happy to see that, how they would excel after giving someone to talk to in a way, because that was... Yeah.

Kris Jenkins: (22:14)

Yeah, and if you can give people in that position the right tools, they can do so much more.

Ralph Debusmann: (22:19)

Yeah, and if you could also... Oh, sorry.

Kris Jenkins: (22:24)

No, I was just wondering how many people we're talking about. How many data scientists, how many pure technical people in your team?

Ralph Debusmann: (22:31)

It's actually six technical people basically outsourced in India.

Kris Jenkins: (22:37)

Okay.

Ralph Debusmann: (22:38)

Another shameless plug, last one. 12 from Emphasis, and four are from [inaudible 00:22:45], which is a consultancy, really great. Actually what they did is to really begin thinking on their own, and having great ideas how to improve our infrastructure. Yeah, extremely well. So they started thinking about GitHubs, really advanced kind of things.

Kris Jenkins: (23:09)

Oh cool. And how many data scientists are we talking on the AI side?

Ralph Debusmann: (23:13)

It's five. It's also not too many. It's a pretty small team

Kris Jenkins: (23:22)

That feels like... My kids are in school, and I'm thinking about class sizes. How many teachers per student?

Ralph Debusmann: (23:30)

Yeah, yeah, yeah.

Kris Jenkins: (23:31)

It's a bit like they each have one fifth dedicated programmers. This is very nice ratio for them.

Ralph Debusmann: (23:38)

Yeah, it's pretty nice, because usually you get these bigger teams.

Kris Jenkins: (23:46)

What kind of stuff has that enabled them to do?

Ralph Debusmann: (23:52)

One particular thing I built is kash.py tool for also was to deal with historical data. Because it's one thing to have consumer producer things where we're just looking at the current data flow. But if you want to also incorporate historical data. In our case for example, we had all the Twitter news about nickel for example, from the last seven years or something. And you want to upload them in one file, or in one go to Kafka basically, as the historical data. Then, we first started to run a script on our local machine, a producer basically going through this large text file of 150 megabytes or whatever, and sending each message from a local machine into the cloud, Kafka cluster, we had. I think we used Confluent Cloud at the time as well. And it took ages, and whenever your laptop was kind of used for something else, it took longer. Or maybe you would have to switch it off or something. It was just not really... It was cumbersome to write to run it from the local machine.

Kris Jenkins: (25:08)

For 150 megabyte file?

Ralph Debusmann: (25:11)

Yeah, it took lot.

Kris Jenkins: (25:11)

Or do you mean 150 gigabytes?

Ralph Debusmann: (25:14)

No, it was just megabytes.

Kris Jenkins: (25:14)

I wonder what's going wrong there.

Ralph Debusmann: (25:14)

It was pretty slow. Maybe we also have a better connection, I don't know, but it took some time. So the idea was to kind of have this shell in the cloud. That was the original idea. So we built kash.py to be used in the cloud or KASH. Kash.py is I think the name now, because KASH is already taken as a name from a Kotlin shell. It's Kotlin shell, I've seen that at the top.

Kris Jenkins: (25:45)

Oh, okay. Naming things remains hard, right?

Ralph Debusmann: (25:51)

Yeah, so the idea was to kind of take a shortcut, bring the entire file of historical data to the Kubernetes cluster, then log in there into the pod on the Kubernetes cluster, and push it from there, so that you would have this shorter distance basically between... Basically, you would've the file already in the cloud and it would be easier to send it over to Kafka. And we wrote this simple upload function in kash.py, which allows you to simply take a text file. Basically, the command is upload, file name, topic name, and then off you go.

Kris Jenkins: (26:38)

That's the kind of thing that sounds easy, but if you are not comfortable with the programming of it, just having a one line as to do it.

Ralph Debusmann: (26:46)

Yeah.

Kris Jenkins: (26:47)

It's like back to developer experience-

Ralph Debusmann: (26:50)

I think that's this tool... Yeah, I was talking over you again, sorry. But the thing is, I think this tool is actually useful for also outside of forecast. So this is... It's on GitHub, also. Oh I can put another shameless plug. It's on Github kash.py. You have to kind of cut them off. But it's already beginning to be very useful. It includes also ACLs. The only thing I need to add now is also to add support for schemas like Avro and Protobuf. Because there was also some omission I intentionally did at the beginning to not think about schemas at the beginning, so to just use JSON payloads. Which is also dangerous on one hand. On the other hand, because if you don't do it right then you can easily corrupt your topics, right? If you send one message with a bad payload, then...

Kris Jenkins: (27:53)

Yeah, you just make life harder for all the readers going forward.

Ralph Debusmann: (27:57)

That's one of the first things I would like to bring in, bring back in soon. Because without schemas, it's... Yeah.

Kris Jenkins: (28:06)

Yeah. I can see how... Stream Punk sounds like it was a bit too abstracted out. You can bring your own programing language. Maybe KASH. There's always a balance to strike between flexibility and being specific for the job right now.

Ralph Debusmann: (28:22)

Yeah, yeah. It was to... Also too much work to do, because you would have to actually build, not just this wrap around the Java library, the Kafka Java client library, but you also would have to build wrappers and write documentation for all these bindings basically for the different programing languages, which is too much.

Kris Jenkins: (28:44)

Yeah. That's the kind of approach that takes a heck of a lot of work, because if you've got five programming languages, and four of them are excellent supported and excellent documentation, then you'll still get complaints about the fifth one all the time.

Ralph Debusmann: (28:59)

Exactly, yeah. This is why I kind of froze that at for the moment. Yeah, that's the thing. It was cool. In the Kafka science presentation, I basically used Python and R. So I used both languages in one, so to say.

Kris Jenkins: (29:17)

In one script?

Ralph Debusmann: (29:18)

Yeah, because you can use GraalVM kind of for mixing and matching these languages. So whenever you have one language for which you have a library, which is not available for the other language, you can use that, which is-

Kris Jenkins: (29:30)

Pretty new tech mad.

Ralph Debusmann: (29:32)

It's nifty.

Kris Jenkins: (29:34)

It sounds really cool, and it also sounds a little bit Frankenstein.

Ralph Debusmann: (29:37)

Yeah.

Kris Jenkins: (29:40)

Stitching together different parts. How does that work? Out of interest, is that working because there are JVM flavored Rubies and Pythons out there? Or is it somehow actually running the real Python and the real Ruby?

Ralph Debusmann: (29:57)

No, it's JVM flavored. I think it started out with JavaScript, actually. So this is called Renewal or something-

Ralph Debusmann: (30:03)

JavaScript actually. So this was called Renew or something. So I think that was where it was started. But it's all basically reimplementation of these languages on the JVM essentially.

Kris Jenkins: (30:14)

Okay.

Ralph Debusmann: (30:16)

Which also means, I mean it's not also perfect because the Python part, for example, is not complete. So I think you still can't use PIP, so this installation script. For installing libraries, you will have to build them using other commands and only a subset of the existing libraries is really supported.

Kris Jenkins: (30:40)

More of a dialect than the same language.

Ralph Debusmann: (30:43)

Yeah. It's a bit more complete for R I think. But it's also... I found it pretty slow. So yeah. It's not complete, but it's a cool idea. And I thought, I mean building Kafka around this would be cool, but it was kind of too much. Too Frankenstein to really get a lot of interest out of this.

Kris Jenkins: (31:06)

Yeah. Yeah. Too many options are sometimes a deadly thing.

Ralph Debusmann: (31:10)

Yeah.

Kris Jenkins: (31:10)

Yeah. So going back to the world of Forecasty and Kafka, so you've got, at the moment, it is a sign of Smartpipe to feed your analytics engine in Rockset.

Ralph Debusmann: (31:24)

Yep.

Kris Jenkins: (31:24)

But what's the future plan for that?

Ralph Debusmann: (31:28)

Well, we are actually doing lot of other things now based on Kafka thanks to this kind of head start which we have. We're also building a notification engine now, which is also, of course, based on microservices talking together with Kafka. So basically you get alerts about when is a price drop, for example. When the price of that community goes down pretty much, then you get a notification on your phone or email. And this is all based on Kafka now. We also want to kind of attack the main forecasting in a way that we also use Kafka there. That's currently not being done. So at the moment, there's still the batch engine running and we are trying to make it more amenable to incremental updates, which you can also do without using streaming. And then adding Kafka on top.

Kris Jenkins: (32:29)

Right. A kind of two phase approach to move it over.

Ralph Debusmann: (32:32)

Yeah. Because you have to always be incremental if you come from some existing piece of architecture.

Kris Jenkins: (32:39)

Because that's the trick with... To my knowledge, most AI models are built around the idea that we batch train at once. Do you end up using different algorithms to do the incremental stuff or is it just refining the process?

Ralph Debusmann: (32:55)

Yeah, partly. I mean, this is all time series AI basically. For most of them, you can modify the algorithms in a way that you can just add new data and you keep the model which you've developed up until this point in time intact. You can also drop the hyper parameter optimization of feature selection and just use what you have done before. So you can kind of reuse your time series models up to a certain point and add more data. So that's possible for some of the existing ones, which we have. You just have to do it in a way. But as I said, there's also other models called based on reinforcement learning, which there's also a Python library for that called River. With that, you can also use it for more incremental time series machine learning. It's different ways. So different models and reimplementing the same models which you already have.

Kris Jenkins: (34:04)

Yeah.

Ralph Debusmann: (34:05)

That's kind of it. Because we actually also want to go into day trading, we need to go to the incremental approach. Because day trading means you have to be really fast.

Kris Jenkins: (34:18)

Yeah, super fast.

Ralph Debusmann: (34:19)

It can be like two seconds. That's the next step. And it'll be based on Kafka and this amount.

Kris Jenkins: (34:28)

Yeah, I can see for your kind of business model, day traders being eager adopters of this. But you've got to react in minutes because fortunes are made and lost in minutes on that.

Ralph Debusmann: (34:39)

Exactly. We're not there yet. Because it started out from being a procurement solution actually, which the latency requirements [inaudible 00:34:48].

Kris Jenkins: (34:48)

People who are looking three months ahead on the price of copper, that kind of thing.

Ralph Debusmann: (34:52)

Exactly. Yeah.

Kris Jenkins: (34:55)

So we all seem to be back again at the journey from a long batch process up to the minute real time processing.

Ralph Debusmann: (35:02)

Yeah.

Kris Jenkins: (35:04)

Well, good luck with the future, Ralph. It sounds like a fun journey.

Ralph Debusmann: (35:09)

It is. Yeah. Absolutely. It's really a lot of fun to kind of instill new ideas in people and then see how they excel. So kudos to my team.

Kris Jenkins: (35:23)

Yeah. Absolutely. A good note to end on. Ralph, thanks very much for joining us.

Ralph Debusmann: (35:27)

Thanks a lot also for allowing me for pulling off these shameless plugs.

Kris Jenkins: (35:32)

Yes, yes. We were billing you later for those. You'll get in trouble in the end.

Ralph Debusmann: (35:35)

No worries. We'll be the biggest company in the world anyway at some point.

Kris Jenkins: (35:40)

You have to get one more in. I'm going to stop before you do another one. Thanks Ralph.

Ralph Debusmann: (35:44)

Thank you.

Kris Jenkins: (35:44)

Thank you Ralph. I will be fining him for putting so many shameless plugs in. That was a bit much. But on the other hand, I have some sympathy. When you're the CTO of a startup, you're kind of the midwife of a newborn technical baby, and being a midwife is sometimes messy. We know that. Much cleaner will be your experience of learning Kafka if you head to Confluent Developer, our free education site that has a wealth of courses covering everything from writing your first Python consumer to stateful stream processing with Kafka Streams. Check it out at developer.confluent.io. And if you have the knowledge already, but you need the cluster, then take a look at our cloud service at Confluent Cloud. You can sign up and have a Kafka cluster running reliably in minutes. And if you add the code PODCAST100 to your account, you'll get some extra free credit to run with. And with that, it just remains for me to thank Ralph Debusmann for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.

Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless. 

Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task. 

With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database. 

Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team. 

EPISODE LINKS

Continue Listening

Episode 240October 27, 2022 | 58 min

Running Apache Kafka in Production

What are some recommendations to consider when running Apache Kafka in production? Jun Rao, one of the original Kafka creators, as well as an ongoing committer and PMC member, shares the essential wisdom he's gained from developing Kafka and dealing with a large number of Kafka use cases.

Episode 241November 3, 2022 | 48 min

Security for Real-Time Data Stream Processing with Confluent Cloud

Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale.

Episode 242November 9, 2022 | 43 min

If Streaming Is the Answer, Why Are We Still Doing Batch?

Is real-time data streaming the future, or will batch processing always be with us? In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should).

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free