Hello. You're listening to Streaming Audio, and today we're talking to CTO and Apache Kafka tool builder, Ralph Debusmann, who's running the technical side of a company called Forecasty, who are building out an artificial intelligence platform for predicting commodity prices. What's the price of copper going to be three months hence? Not an easy task and not an easy task for him. I remembered as we were talking, when you're the CTO of a startup, you end up doing a little bit of everything, absolutely all the different nooks and crannies you can get into. And so that's the way this podcast went. We ended up talking about a little bit of everything. How do you migrate from batch machine learning on a group of flat files to a more real time streaming approach That's going to work better when you are dealing with the real time nature of financial markets?
More importantly, perhaps, how do you make that kind of migration as painless as possible for all the data scientists you need on your team? How do you bring them along gently? What tools do you build? What parts of the developer experience are missing? And if they're large ones, how do you fill that gap? Not easy questions, I think you'll agree, but Ralph has some really thoughtful answers. Before we get into it, Streaming Audio is brought to you by our free education site, Confluent Developer. More about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. My guest today is Ralph Debusmann. Ralph, welcome to the show.
You're the CTO of Forecasty.ai, and I like that URL because I can begin to guess what it is you do, Forecasty.ai, but maybe you should give us the summary.
Sure. So, I mean, what we do is actually building a forecasting platform focused now on commodities. So we are actually building it for predicting commodity prices into the future so you can actually decide on what to do with your commodities. For example, if you're a manufacturer and you procure commodities, or if you're a trader or you do hedging, you can actually see how the price will develop over the next few months or few days, you can change the frequency, and then decide what to do, to buy, to sell, to short, these kind of things.
Yeah, so that's basically it, So it's a product called Commodity Desk and we are focusing completely on this one now, even though there's two others on the web page, which are now frozen. That's what we do. And, actually, I mean, just to use the time here, we are fundraising at the moment, so if you are any VC whom we haven't yet contacted, which is improbable, but maybe, or some angel investor.
You're going in with the shameless pitch early. Okay, fair enough. I respect that.
It's very, very shameless. But I have to.
We all have to pay the bills.
But yeah, that's it. And, at Forecasty, basically what we have done now since I started there at the beginning of this year is to make it more real time-y. Actually, the platform was very batch driven, and still is in large parts, but we have started to change it, that it becomes more Kafka based, so to say.
Yeah, this is why I wanted to get you onto the show, because the idea of doing artificial intelligence modeling in real time is really interesting.
So take us into it. What kinds of AI techniques are you using to try and forecast commodity prices? Let's start there.
So, I mean, we have a bunch of models already. We have basically first built a product called Business Desk, which was a no code platform actually for forecasting as a whole so you could forecast anything, not just prices. You could do all kinds of forecasts. And for that platform we've developed a large set of models which we can select from to basically see what model matches or works best for the given commodity at hand. But, I mean, these models, they are still edge driven. So basically they have to see the entire data set at once and then they run and they do the hyper permanent optimization, feature selection, all these kind of things. And that takes also a lot of time, especially if you scale the platform and you do, for example, we have 60 commodities on the platform now. We will have more than 200 very soon.
So when you want to do daily updates for all these commodities, daily predictions for them, and you rerun your models every day, it just gets very expensive computationally. So you have to pay a lot of money to Azure, whatever you use.
Oh God, yeah.
Because it's just a lot of compute what you need. It's also not very environmentally friendly, actually.
Yes. This is because, the naive implementation is, you get a massive data set, you run a training model on the whole thing as a big batch process overnight or over a few days, and you get a model out that will try and predict new prices, and you do that for 6,200 commodities.
But then that's the great thing about the financial markets, there's always new data being added all the time as you were to rerun it. That's what you're saying?
Yes. And if you adopt this batch mode, it's very hard to incrementally do the training. That's what we are pushing now. I mean, we're trying to find more models which you can train incrementally and where it actually makes sense of this. For example, a bunch of reinforcement learning models, which basically lend themselves to being incrementally trained. We also added, I mean, that's also related to this oven behind me, I guess, because some of our forecasts, let's go back, when the Russian invasion started in the Ukraine, a lot of our forecasts were wrong because, one example is nickel, nickel, which is what you need, for example, for building batteries, so Elon Musk is a good customer of that, the nickel price went up significantly when the crisis started at the end of February, beginning of March. And we just couldn't capture this spike by our models because the models are not trained. They hadn't seen any crisis like that in the last 20 years or something.
I would be almost scared by artificial intelligence that could have predicted that coming.
Exactly. And because we wanted to be able to capture this in a better way, at least shortly before the crisis arrives, we actually set out to also on the side build a real time sentiment analysis module, which takes in data from Twitter, from LinkedIn, also from various news sources, also Reddit I think, where we combined this data, filter it, sort it, and use it for essentially improving the models. So we show a fear index for the customer as well, where the customer could see how much fear we have in this market surrounding the commodity in question.
And that was actually the starting point for really bringing in Kafka and streaming into this startup. I mean, we can talk about this later because the interesting journey is also to take your data scientists by the hand and bring them on because they're actually not used to that. Most of them have heard about Kafka, for example, and stream processing, but they haven't really done it and they might be scared.
Yeah, it's a different world from, I'm assuming, mostly relational databases and large flat CSVs.
Not even relational databases too much. Really, just files.
The local file system, more or less. CSV files, for example, which you convert into these Python data frames most often and then you do your magic on the tables, essentially. So that's what they are doing. They have big problems sometimes to grapple with the cloud as such. I mean, we have a cloud native platform, everything on Kubernetes, everything cloud native, open source stuff. So teaching the data scientists to embrace that approach is one thing, but the next step is actually to bring them to the streaming world. So that's what I did.
How do you do that? I mean, what's the model you are moving towards from flat files into a Kafka stream processing world?
Yep. So there's multiple aspects. I mean, one aspect...
Yeah, so there's multiple aspects. One aspect which I tried to cling to from the beginning was to really make it easy for them, so to hide a lot of the complexity which you have with Kafka. We chose Kafka because it's the industry standard, because I worked with that for the last seven years or so. So we didn't look at others like Apache Pulsar or Azure Event Hubs or anything. We wanted to have Kafka, but Kafka as such still has all these nitty gritty things. You have to understand compacted topics. If you want to go into s, it gets even more messy, and you can't possibly start with that when you teach people to adopt Kafka.
Yeah, I wouldn't start that team-
So I made it as simple as possible.
I wouldn't start that team on Kafka streams.
Yeah. Also, even the simple thing of partitions, even that is a low level thing which I try to hide. I basically told them, "First of all, just use single partition topics. Don't worry about partitions. Don't worry about the retention. Just set it to unlimited." Make it very easy for them, actually. "Don't think about compacted topics. Don't think about keys, just about values." So really trying to simplify it as much as possible for them. And then what we also did is, at least at the beginning, we didn't want to pick up stream processing, not the stateful one at least. So that was the idea to use real time database, which is more powerful than the Druids of this world. Not a shameless plug, but I'm not actually... Yeah, but we used Rockset. Rockset is actually a bit more powerful than this at the moment, than Druid and Pinot, so you can do some joins and stuff, which you can't in those softwares.
We chose that because the idea was to avoid doing too much stateful... or actually avoid stateful stream processing completely, to just have Kafka topics, simple consumer producer microservices, and then just push it or pull it all into this Rockset database without [inaudible 00:12:36].
Presumably with an off the shelf connector?
Actually, Rockset allows you... It's even simpler. It has this built in native connector, so you can just tell it the topic to connect to, and then it continuously streams in the data. It's very simple, and there was the idea to make Kafka as simple as possible, to avoid stream processing at the moment and go for the most powerful realtime database we could get our hands on in terms of SQL functionality.
So in that sense, you're using Kafka as the next step up from large flat files with more options for the future.
Exactly. We know at some point we would have to also venture into stateful stream processing with whatever software we can get. There's a huge market now which has sprung up. There's not just ksqlDB. There's also Delta Stream. There is Immerok, there is all these Flink based technologies now springing up. Let's see what comes, but we wanted to keep it away from the data scientists at the beginning. Yeah, so.
So you're just teaching them, "Here's how to make a Python consumer and a Python producer," and just starting there?
Okay. How's that been going?
Pretty well. The pure Python consumer producer stuff was very easy for them, so they did understand it. I told them this caveat about the consumer groups so that if you want to consume again, and you wonder why it's not showing you the first elements again, because the consumer group has already set the offset to a higher value. But yeah, it worked pretty well. But what we found out next was... I could have known that from the beginning because I had similar problems at Bosch, where I was the Bosch Kafka evangelist for the entirety of the company. What was missing was a nice user interface to Kafka, and that's still missing in a way. So I'm coming to this KASH with K point.
Yeah, because you've got this tool, KASH, yeah or Kafka Shell thing.
Exactly. Yeah. That's why I build that one. Currently you have some nice Kafka UIs now. There's Kowl, which has now been acquired by Redpanda, I think. You have... What is it called? Of course you have the Confluent UIs, which are pretty good. You have Conduktor, this one as well. There's a bunch of new UIs around, also an open source UI called Kafka UI, I think, simply. But what's actually missing still for me is of a nice shell. You have all these Apache Kafka commands, which you can get with the Apache Kafka or Confluent distributions, Kafka topics or Kafka console consumer.
Yeah, that whole back to shell scripts. Yeah.
All these shell scripts built.
And there's kcat, which used to be kafkacat.
And there's kcat. Yeah, okay. Exactly. Which is I think by the same guy who did the C library.
I believe so.
And also the Python bindings. That's also pretty nice, but kcat is really more of a cat version. It can show you... It can also be used for producing, but it's... I like it, but it's not complete. What I wanted is to build a shell based on a RPL, so based on a program languages interpreter in a way, and I did that already with a program called Stream Punk, which was... I showed that at Kafka Summit in London, actually, in a small session just before the party, so maybe you might have missed it.
Where's the toughest slot in a conference? Just before the party.
The next step was now, being in a data science environment, to come up with the same thing, actually, a thin wrapper around a basic Kafka library. And I just brought a wrapper around the Python library for Kafka. It's called confluent_kafka, so that's the-
Yeah, I've used that a lot. Yeah.
Yeah. You know that. What I found is it's of course very nice, but the confluent_kafka library is in some ways overcomplicated, I found. You can't easily use it as a shell. You can perfectly use it in your code. That's fine. But if you want to use it in a shell, where you just want to have very simple steps, so you want to create a topic, list the topics, have a look at the size of the topics, see the watermarks, like the lowest offsets, highest offsets, all these kinds of things, you maybe want to upload a file where one line of text is one message and these kinds of easy things... for this, you would have to write some wrapper code.
I did that basically with kash.py. So kash.py is actually using the power of confluent_kafka, of this Python library, makes it a little bit more accessible so that you can actually use it as a shell. It's very easy. Commands are just... You just create a cluster object, basically, and then you can list the topics, the groups, ACLs, and do all kinds of things there just with a very minimalistic parameter set, so you don't have to worry about any objects. It's just extremely minimalistic, so that it's easy to use.
In a way, it's really a user experience project. Rather than programming a library that gives you more power, it's about-
Rather than programming a library that gives you more power, it's about making that power more accessible.
Yeah, that's it actually. So it kind of tries to make it... In this sense, really Kafka more accessible in a way. Because you can then use that thin rapper on top of the Kafka thing to build scripts, which are then much nicer looking, because you don't have to worry about creating objects and all kinds of things. And Kafkas on the celebrities also requires you to dive deep into the resulting objects which you can get. It's just a bit more streamlined, and that allows you to write scripts much quicker than before.
This paints a picture of a very friendly CTO, if you're writing these UX projects for the data scientists.
In a startup you can-
You couldn't do this. It would be a large organization. It's a small-
One of the joys of being a programmer at a startup is you get to do a bit of everything. You get to jump on what the real problem is.
Yeah, that's it. I wouldn't like to kind of just be a CTO in the sense of... In the normal sense. You are completely detached from the actual coding.
Yeah, you're just making management decisions about coding.
Exactly. But I think you can't really lead a team if you... Especially this sort of I found... You can't really lead the teams when you can't connect to what they're actually doing. So I was actually having a great time with these guys kind of understanding. So when I came in at the beginning of this year, they kind of saw, "Oh, we have a sparring partner, so we have someone who we can talk to who understands what we are doing." They kind of really got great ideas, and they did. They excelled, actually. And I was really happy to see that, how they would excel after giving someone to talk to in a way, because that was... Yeah.
Yeah, and if you can give people in that position the right tools, they can do so much more.
Yeah, and if you could also... Oh, sorry.
No, I was just wondering how many people we're talking about. How many data scientists, how many pure technical people in your team?
It's actually six technical people basically outsourced in India.
Another shameless plug, last one. 12 from Emphasis, and four are from [inaudible 00:22:45], which is a consultancy, really great. Actually what they did is to really begin thinking on their own, and having great ideas how to improve our infrastructure. Yeah, extremely well. So they started thinking about GitHubs, really advanced kind of things.
Oh cool. And how many data scientists are we talking on the AI side?
It's five. It's also not too many. It's a pretty small team
That feels like... My kids are in school, and I'm thinking about class sizes. How many teachers per student?
Yeah, yeah, yeah.
It's a bit like they each have one fifth dedicated programmers. This is very nice ratio for them.
Yeah, it's pretty nice, because usually you get these bigger teams.
What kind of stuff has that enabled them to do?
One particular thing I built is kash.py tool for also was to deal with historical data. Because it's one thing to have consumer producer things where we're just looking at the current data flow. But if you want to also incorporate historical data. In our case for example, we had all the Twitter news about nickel for example, from the last seven years or something. And you want to upload them in one file, or in one go to Kafka basically, as the historical data. Then, we first started to run a script on our local machine, a producer basically going through this large text file of 150 megabytes or whatever, and sending each message from a local machine into the cloud, Kafka cluster, we had. I think we used Confluent Cloud at the time as well. And it took ages, and whenever your laptop was kind of used for something else, it took longer. Or maybe you would have to switch it off or something. It was just not really... It was cumbersome to write to run it from the local machine.
For 150 megabyte file?
Yeah, it took lot.
Or do you mean 150 gigabytes?
No, it was just megabytes.
I wonder what's going wrong there.
It was pretty slow. Maybe we also have a better connection, I don't know, but it took some time. So the idea was to kind of have this shell in the cloud. That was the original idea. So we built kash.py to be used in the cloud or KASH. Kash.py is I think the name now, because KASH is already taken as a name from a Kotlin shell. It's Kotlin shell, I've seen that at the top.
Oh, okay. Naming things remains hard, right?
Yeah, so the idea was to kind of take a shortcut, bring the entire file of historical data to the Kubernetes cluster, then log in there into the pod on the Kubernetes cluster, and push it from there, so that you would have this shorter distance basically between... Basically, you would've the file already in the cloud and it would be easier to send it over to Kafka. And we wrote this simple upload function in kash.py, which allows you to simply take a text file. Basically, the command is upload, file name, topic name, and then off you go.
That's the kind of thing that sounds easy, but if you are not comfortable with the programming of it, just having a one line as to do it.
It's like back to developer experience-
I think that's this tool... Yeah, I was talking over you again, sorry. But the thing is, I think this tool is actually useful for also outside of forecast. So this is... It's on GitHub, also. Oh I can put another shameless plug. It's on Github kash.py. You have to kind of cut them off. But it's already beginning to be very useful. It includes also ACLs. The only thing I need to add now is also to add support for schemas like Avro and Protobuf. Because there was also some omission I intentionally did at the beginning to not think about schemas at the beginning, so to just use JSON payloads. Which is also dangerous on one hand. On the other hand, because if you don't do it right then you can easily corrupt your topics, right? If you send one message with a bad payload, then...
Yeah, you just make life harder for all the readers going forward.
That's one of the first things I would like to bring in, bring back in soon. Because without schemas, it's... Yeah.
Yeah. I can see how... Stream Punk sounds like it was a bit too abstracted out. You can bring your own programing language. Maybe KASH. There's always a balance to strike between flexibility and being specific for the job right now.
Yeah, yeah. It was to... Also too much work to do, because you would have to actually build, not just this wrap around the Java library, the Kafka Java client library, but you also would have to build wrappers and write documentation for all these bindings basically for the different programing languages, which is too much.
Yeah. That's the kind of approach that takes a heck of a lot of work, because if you've got five programming languages, and four of them are excellent supported and excellent documentation, then you'll still get complaints about the fifth one all the time.
Exactly, yeah. This is why I kind of froze that at for the moment. Yeah, that's the thing. It was cool. In the Kafka science presentation, I basically used Python and R. So I used both languages in one, so to say.
In one script?
Yeah, because you can use GraalVM kind of for mixing and matching these languages. So whenever you have one language for which you have a library, which is not available for the other language, you can use that, which is-
Pretty new tech mad.
It sounds really cool, and it also sounds a little bit Frankenstein.
Stitching together different parts. How does that work? Out of interest, is that working because there are JVM flavored Rubies and Pythons out there? Or is it somehow actually running the real Python and the real Ruby?
Which also means, I mean it's not also perfect because the Python part, for example, is not complete. So I think you still can't use PIP, so this installation script. For installing libraries, you will have to build them using other commands and only a subset of the existing libraries is really supported.
More of a dialect than the same language.
Yeah. It's a bit more complete for R I think. But it's also... I found it pretty slow. So yeah. It's not complete, but it's a cool idea. And I thought, I mean building Kafka around this would be cool, but it was kind of too much. Too Frankenstein to really get a lot of interest out of this.
Yeah. Yeah. Too many options are sometimes a deadly thing.
Yeah. So going back to the world of Forecasty and Kafka, so you've got, at the moment, it is a sign of Smartpipe to feed your analytics engine in Rockset.
But what's the future plan for that?
Well, we are actually doing lot of other things now based on Kafka thanks to this kind of head start which we have. We're also building a notification engine now, which is also, of course, based on microservices talking together with Kafka. So basically you get alerts about when is a price drop, for example. When the price of that community goes down pretty much, then you get a notification on your phone or email. And this is all based on Kafka now. We also want to kind of attack the main forecasting in a way that we also use Kafka there. That's currently not being done. So at the moment, there's still the batch engine running and we are trying to make it more amenable to incremental updates, which you can also do without using streaming. And then adding Kafka on top.
Right. A kind of two phase approach to move it over.
Yeah. Because you have to always be incremental if you come from some existing piece of architecture.
Because that's the trick with... To my knowledge, most AI models are built around the idea that we batch train at once. Do you end up using different algorithms to do the incremental stuff or is it just refining the process?
Yeah, partly. I mean, this is all time series AI basically. For most of them, you can modify the algorithms in a way that you can just add new data and you keep the model which you've developed up until this point in time intact. You can also drop the hyper parameter optimization of feature selection and just use what you have done before. So you can kind of reuse your time series models up to a certain point and add more data. So that's possible for some of the existing ones, which we have. You just have to do it in a way. But as I said, there's also other models called based on reinforcement learning, which there's also a Python library for that called River. With that, you can also use it for more incremental time series machine learning. It's different ways. So different models and reimplementing the same models which you already have.
That's kind of it. Because we actually also want to go into day trading, we need to go to the incremental approach. Because day trading means you have to be really fast.
Yeah, super fast.
It can be like two seconds. That's the next step. And it'll be based on Kafka and this amount.
Yeah, I can see for your kind of business model, day traders being eager adopters of this. But you've got to react in minutes because fortunes are made and lost in minutes on that.
Exactly. We're not there yet. Because it started out from being a procurement solution actually, which the latency requirements [inaudible 00:34:48].
People who are looking three months ahead on the price of copper, that kind of thing.
So we all seem to be back again at the journey from a long batch process up to the minute real time processing.
Well, good luck with the future, Ralph. It sounds like a fun journey.
It is. Yeah. Absolutely. It's really a lot of fun to kind of instill new ideas in people and then see how they excel. So kudos to my team.
Yeah. Absolutely. A good note to end on. Ralph, thanks very much for joining us.
Thanks a lot also for allowing me for pulling off these shameless plugs.
Yes, yes. We were billing you later for those. You'll get in trouble in the end.
No worries. We'll be the biggest company in the world anyway at some point.
You have to get one more in. I'm going to stop before you do another one. Thanks Ralph.
Thank you Ralph. I will be fining him for putting so many shameless plugs in. That was a bit much. But on the other hand, I have some sympathy. When you're the CTO of a startup, you're kind of the midwife of a newborn technical baby, and being a midwife is sometimes messy. We know that. Much cleaner will be your experience of learning Kafka if you head to Confluent Developer, our free education site that has a wealth of courses covering everything from writing your first Python consumer to stateful stream processing with Kafka Streams. Check it out at developer.confluent.io. And if you have the knowledge already, but you need the cluster, then take a look at our cloud service at Confluent Cloud. You can sign up and have a Kafka cluster running reliably in minutes. And if you add the code PODCAST100 to your account, you'll get some extra free credit to run with. And with that, it just remains for me to thank Ralph Debusmann for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.
Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless.
Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task.
With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database.
Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us