Get Started Free
January 6, 2021 | Episode 137

Event Streaming Trends and Predictions for 2021 ft. Gwen Shapira, Ben Stopford, and Michael Noll

  • Transcript
  • Notes

Tim Berglund:

Before 2021 really gets rolling, we wanted to venture some predictions for how it would go in the world of event streaming. I figured everybody's 2020 predictions in general were so on target that Gwen Shapira, Ben Stopford, Michael Noll, and I should try our hand at prognostication and punditry, too. Hopefully, this won't be too funny or too tragic a year hence. Anyway, your first step is to listen to today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the Cloud.

Tim Berglund:

Hello and welcome to another episode of Streaming Audio. I am, as usual, your host Tim Berglund, and we're coming up on the end of the year. It is, as I record this, Monday, November 16th. It's going to be rather later than that when this episode airs, but it's within striking distance. I am looking longingly at the boxes of Christmas decorations, some of which, if I just crane my neck a little bit, I can actually see from where I'm seated in my office, and that's exciting. US Thanksgiving is next week, I think. Yeah, this is an exciting time.

Tim Berglund:

But given that we're coming up to the end of the year, I wanted to talk about some predictions for next year. Everybody made predictions for this year, and they were all horribly, horribly wrong. So to do better this year, I wanted to invite three colleagues to the show, Ben Stopford, Michael Noll, and Gwen Shapira. Ben, Michael, Gwen, welcome to the show.

Ben Stopford:

Thanks, Tim.

Gwen Shapira:

It's great being here.

Michael Noll:

Hi, everyone. Thank you, Tim.

Tim Berglund:

You bet. Now, if you're very new to the podcast, Ben and Michael work in the office of the CTO here at Confluent, and if you don't know Gwen, you should Google Gwen. She has, I would say, some footprint in the Kafka community. This is the Gwen Shapira, maybe you heard of her, kind of thing. But she works as an engineering manager running a team of developers related to Confluent Cloud. Gwen, is that where it is now?

Gwen Shapira:

Yeah, certainly.

Tim Berglund:

Okay. Anyway, we're going to talk about predictions for next year, and I just can't think of three people in the Kafka world I would rather talk about. This is really your time. They can be Kafka-related predictions; in fact, I think I would be gratified if we generally focused on Kafka-related predictions, but they don't have to be just that. The broader software development, software architecture world, are we still going to be locked in our homes in a year? I'll entertain predictions on that, not that any of us is an expert on that question. But what have you got? Gwen, I'm going to start with you.

Gwen Shapira:

I mean, this is going to be the biggest year in Kafka history in ages, right? It's just so exciting with all the big KIPs coming up, and basically, Kafka is basically getting a whole new architecture, right?

Tim Berglund:

It really is. It was, I think, August of 2018. We'll have to link it, but the first time we talked about KIP-500 on this podcast.

Gwen Shapira:

Oh, wow. It's been a while. [crosstalk 00:03:24]

Tim Berglund:

It was the end of '19.

Gwen Shapira:

Okay. So we haven't been that slow about it.

Tim Berglund:

No, no. That was complete [inaudible 00:03:24]. There was not a KIP-500 in 2018. It was August of '19 that I believe Jason and Colin came on and talked about it.

Gwen Shapira:

I want to predict that by the end of 2021, and probably way before the end, I will be able to run a Kafka cluster with 10 million partitions on it.

Ben Stopford:

I'm going to predict that by the end of 2020, I can run a Kafka cluster with 2 million partitions in it. Maybe not 10. I don't know if we can do 10, but definitely should have two by the end of the year. And it will be a fairly shaky cluster.

Tim Berglund:

But it'll be there.

Gwen Shapira:

[crosstalk 00:04:06] I meant in production. Not shaky clusters.

Ben Stopford:

Well, yeah, it's going to get [inaudible 00:04:12], isn't it, I think, to start with, where it is. But yes, I think that's a great prediction, Gwen. And I think that kind of unlocks some of the restrictions that we have that people sometimes hit inside from having too many partitions. It reminds me, actually of this, we have another project, which is just coming out, and this is a bit of a, I'm just going to drop this in here, slightly unrelated, but one of our colleagues wrote a parallel consumer, and that also helped [crosstalk 00:04:50].

Tim Berglund:

Yeah.

Ben Stopford:

It's kind of interesting. We should get him on the podcast, Antony Stubbs, to talk about this. But it's quite interesting computer code. It's relatively simple; it allows you to run consumer data in Parallel. But the really nice thing that it does is it sort of changes the level of parallelism in the consumer from being a partitioned down to being, let's say, a key. There are other modes, but probably the most interesting one is taking you down to the keys. You have key-based ordering that you can actually have more concurrency than you have partitions. And I think it's a combination of these things; we're kind of eroding away at some of these issues that we've seen over the years with having to think ahead of time about how many partitions you need in Kafka. So that's pretty exciting, too.

Gwen Shapira:

Yeah. So it's exactly , I think, a Scholar client that already does different level of parallelism based on the same consumers, like on the number of partition.

Tim Berglund:

Yes. [crosstalk 00:05:56] What's that library called? It's the one that Wix developed in open source codes.

Gwen Shapira:

Yeah, exactly. That's a gray one? It was like a totally non-Kafka-esque name.

Tim Berglund:

Yeah, yeah, yeah. [crosstalk 00:06:07] And we've talked about that in the last three months on this podcast. Folks, it's going to be in the show notes. As per the usual, when your host struggles to remember things that he's talked about only weeks in the past, they show up in the show notes later. So we'll get there, and with an apology.

Gwen Shapira:

Good thing that some of the production team has good memory.

Tim Berglund:

Well, I actually do end up just going through the archives and searching. It's just I can't do that in real time. I always wonder. I've talked about this before, but one of my favorite podcasts is Econtalk, and I was just listening to a recent episode this morning. And he's been doing this like 15 years once a week, right? And I don't know if it's a production thing. It seems like he just has this encyclopedic memory of every person he's ever talked to. Because something will come up spontaneously, he'll be like, "Oh, yeah, these two people have been on the show. Actually, she's been on like three times." I'm like, "How do you... " I can't remember what I had for breakfast. Anyway. It's impressive. Michael, what do you think? What do you see coming up? And we're going to drill into all these things. I just kind of want to get stuff on the table.

Michael Noll:

I would predict that at least one of the two prior predictions by Ben and Gwen will be wrong. [inaudible 00:07:18]

Gwen Shapira:

It's such a low level of trust here.

Ben Stopford:

We weren't saying the actual numbers. If it's like two million, one partition. We're still claiming victory.

Tim Berglund:

But Michael, ground rules for our predictions. This is something Bob Metcalf, writing in the 90s, I used to read a column of his that he wrote, I think in Info World. He made some extremely specific prediction about when the dot com bubble was going to burst. It was a day like in April or something, and he said it's better to be wrong than vague. In the spirit that it is better to be wrong than vague, try again.

Michael Noll:

Well, first I would say, I think, even in the next year, like in this year, Kafka will continue to be used for COVID tracking. But hopefully, next year, the urgency for such tracking will be lessened very quickly. We are not the expert in this, so I will not make a prediction on that front, but I do remember that Kafka has been used for COVID tracking, which I found quite interesting.

Tim Berglund:

Yes.

Michael Noll:

But to your point, I would say we will see more and more usage of event streaming in general, and I don't just mean more Kafka, but I mean more event streaming and more event streaming functionality. And what do I mean by that? What I'm seeing at the moment is that more and more other systems, adjacent technologies, are adding their own event streaming-like functionalities, so becoming a bit more Kafka-like. Examples are traditional database systems that are working on streaming sequel, traditional messaging systems, so they're adding streaming data structures or stream queues to their functionality. We've seen already that Redis has a stream data structure. I think this is also very great validation for what Kafka pioneered, this concept of event streaming. And I think we're seeing more and more next year, not just from Kafka but also from people, let's say, at the perimeter of Kafka, if that makes sense.

Gwen Shapira:

So Michael, will also Kafka become more database-like? If databases become more Kafka-like?

Michael Noll:

Possibly. Yeah, possibly. There is already case SQL DB, right? Which is essentially the idea that we print together the two related worlds of event streaming and relational databases. If I should continue on this front, I can maybe just say, I'm Confluent's representative in the US national body that works on this sequel standard, so the database standard, and the past few months we have been collaborating with other companies, I will forget several of them, I'm sure, like Oracle, IBM, Microsoft, Amazon, Google, et cetera, but also newer companies like Snowflake on proposing a streaming SQL standard. Or to be technically more precise, a streaming SQL extension to the SQL standard, just like at the moment that there is a collaboration on adding [inaudible 00:10:34] to the SQL standard. I think this is an example of other technologies realizing the value, and maybe also seeing the customer demand or user demand, for these event streaming features.

Tim Berglund:

The thing that strikes me in that is that somehow, Michael, I didn't know you were the representative on that body. That seems like the kind of thing that I would organically just have heard of, and I'm delighted to learn that here in front of some hundreds or thousands of listeners. That's excellent, and I think that's a good perspective.

Gwen Shapira:

Tim, what's your prediction on the podcast listenership for the end of 2021?

Tim Berglund:

Oh, okay. I'm not going to give you ... I'm going to violate my own rule, better be wrong than vague. No, no, I'm going to try to come up with something very specific. Podcast listening in general has taken a hit in the pandemic because it's a commute activity for a lot of people, so it's not down, but the growth rate is ... well, it was down initially, but the growth rate is not what we would like it to be. So I think we'll come back up to kind of like a 20-30 percent year over year growth rate, annualized fourth quarter of next year, assuming a vaccine the way people are talking about vaccine preliminary results and all the rest of the immunological details of the virus are like other viruses of its kind and not like popular media articles are representing it to be, that is this is a thing that we can actually fix, and we won't be foraging for nuts and berries in Q4 of next year. So I think we'll be back to like a 25 percent annualized growth rate in podcast listening next year.

Gwen Shapira:

Nice.

Michael Noll:

In other words, Tim, you juts said, I'm paraphrasing, of course, that you're looking forward to an effective vaccine because then the podcast audience will increase.

Tim Berglund:

Yeah. I feel like cause and effect, that linear cause and effect, is maybe a little oversimplified. But that is-

Ben Stopford:

One question. I actually listened to the podcast significantly more this year than I did do previously.

Tim Berglund:

Okay. All right. Good. I thank you. You're helping.

Ben Stopford:

[inaudible 00:12:53] that I worked out, for some reason this helps. I can say, "Alexa, play podcast, Streaming Audio, from Confluent." And it works. Which is awesome.

Tim Berglund:

I need to set that skill up. Also, to everyone who's listening not on headphones, we're sorry that we just activated your Amazon Home Assistant, and we'll have a link in the show notes about how to set up podcasts on that device.

Michael Noll:

It's funny that you mention this, Tim, because I can deactivate the lighting in Ben's living room when we're having a meeting, because he has Alexa in the living room.

Tim Berglund:

I'm usually wearing headphones, so don't bother trying, but you could say "work lights," would be the name of the thing that would turn off all my nice office lighting. Anyway, Ben, have I gotten a prediction from you? I think you responded to Gwen, but what are you seeing?

Ben Stopford:

Yeah, I didn't. So I think probably a few things. Obviously, as Gwen said, there's some really awesome stuff around. ZooKeeper coming out. ZooKeeper removal and tiered storage. I think I was going to go for a hard prediction on [inaudible 00:14:05], to sort of complement Gwen's one, it would be that yeah, you can double the size of the Kafka cluster in a matter of a few seconds and have all that data [crosstalk 00:14:20]

Gwen Shapira:

I'll definitely drink to that.

Ben Stopford:

For me, tiered storage is wonderful because it gives you this lovely price efficiency of using s3, but the real killer feature from my perspective is the auto-scaling, or at least the ability to scale quickly by having this two tier architecture. [inaudible 00:14:42] supervising, I'm really looking forward to that.

Michael Noll:

Ben, maybe for the viewers, it's a bit similar to what Tim said before in the opposite way, is I think you need to connect the dots for the viewers that are just listening in, how you think we will get there.

Tim Berglund:

Yes. Why is tiered storage associated with that kind of scalability?

Ben Stopford:

Yeah, sure. So basically, when you add storage to Kafka, it sort of moves it to a two tier architecture, so the frontier is holding the most recent data, the active segments, and then the backing tier is holding the majority of the log. What that really means is if you want to expand the size of a cluster, you can do that today. You can just add new nodes, but then we actually haven't any of this historical data, so you have to run that Kafka reassign partitions tool in order to get that data to move on to those new nodes. And that all works wonderfully, so you have this sort of elasticity today, but it just takes quite a long time to move all that data.

Ben Stopford:

You sort of [inaudible 00:15:54] elasticity, but it's not brilliant, because it takes a while to move data around, whereas with tier storage, you actually don't need to do that. You can just run the same tool, but actually all it's going to do is move these active segments. In testing pretty chunky data sets, we've seen that expanding in a matter of a few seconds, just to move these active segments and then they're just reading and catching up from the data that's held in s3 or whatever the object store is. So yeah, I think that's going to be really exciting from my perspective, this elasticity. You can get this two tier architecture, but we also get all the efficiency of having the log data held locally on disk so we can do those very fast read and writes, faster replication, and we're not kind of sacrificing really much, in terms of the architecture system.

Ben Stopford:

For me, we've thought about this a lot. It's kind of like the perfect architecture in many ways. It gives you more of the benefits that you want for most workloads. I'm super excited about that. Does that cover it, Michael? Is there anything you'd like to add?

Michael Noll:

No, I think that's exactly it. I think I would maybe just say in summary, at least from when I talked to people about this, that most people would associate the addition of tiered storage as being pretty much a storage thing. So yeah, you get more storage or you get cheaper storage because you can tier it now. But because Kafka is a storage system, this is not the only use for it. As Ben just said, it's also that it actually makes Kafka more scalable and more elastic along the way, so I think that is a great improvement for Kafka in general.

Tim Berglund:

Michael, I was thinking that very thing, that tiered storage is kind of sold as a storage solution. The presenting problem is, "Okay, everybody, we're going to do long retention periods and Kafka's our system of record, because we believe in this database inside out vision, and that all works best if you keep data around for longer, and gee golly, this is expensive." And so tiered storage sort of flies in to the rescue to get a lot of that data to a lower cost storage tier. It feels like it entered the world and gets sold as the storage solution, but lo and behold, it's actually also a scale solution, because there's all that state that you just don't need to move. There's a fix that-[crosstalk 00:18:34]

Gwen Shapira:

It's so much more than just those two, actually. It's kind of funny how much good architecture would go for the price of one, but because reads from tier storage are going [inaudible 00:18:47], which is more, I think, than the [inaudible 00:18:51]. But more important, it doesn't go via the page cache, at all. Which means that if you do historical reads, the page cache stays 100 percent in tact. Nobody touches it. And if you size everything correctly, you can actually basically have everything that is not tiered in the page cache ready. Everything that is tiered will be read via the network, not from the page cache. Basically, it's almost like an entirely disk-less architecture, in a way, which is, I don't know, kind of nifty. I think that it solves a lot of concerns people used to have around latency.

Tim Berglund:

Yes. I mean, not entirely disk-less, because you still have log segments on disk for the hot set, but it's a relatively more disk-less-

Gwen Shapira:

If you have the hot set side of your memory, you never really have to read from the disk, even though it's there.

Tim Berglund:

Okay.

Ben Stopford:

The main thing for the disk is really just to give you that ... one of the reasons that mostly you want this even in a database and memory databases can be quite difficult to get to work in realtime use cases is if your dataset grows, you kind of want to be able to overflow onto disk so you can degrade gracefully rather than just kind of getting out of memory. The approach that Gwen's describing, you are still right. It's a disk that gives you this nice, sort of graceful degradation if you do go beyond the memory that you have available, which you don't get in memory architecture. But yeah, the nice property of [inaudible 00:20:41] is you're not paying too much for that, so you're getting in-memory performance, but this nice degradation also.

Tim Berglund:

Yes. I want to drill into, Michael, your prediction of more event streaming. You mentioned Redis is getting a feature there. I think you mentioned another database that has sprouted a stream-y feature?

Michael Noll:

There are actually quite a few now that are doing this. Snowflake, for example, is currently re-architecturing their streaming functionality. I'm using my words now, so if anyone from Snowflake team is listening to any of these, I apologize. What they're doing-

Tim Berglund:

We'll have Michael's [crosstalk 00:21:24] account in the show notes. You can at him.

Michael Noll:

Thank you, Tim. What they're doing is, for Snowflake, they're using streaming, at the moment, pretty much exclusively for streaming ETL for data that is flowing from external systems into a Snowflake table, and this is also what they presented in this working group that works on a streaming SQL standard. That is still far away from what we in the Kafka community would consider to be event streaming. But they're now building their own stream processing system from scratch.

Michael Noll:

That is an example that I could name. But also other companies like Oracle, I know they have already some stream-y functionalities in some of their products, all of which have been presented in one way or another in this working group that I mentioned. And there is this ubiquitous interest in adding these features to all these systems, so I would predict that 2021 will be very interesting time for that working group because everybody comes with their own preconceived ideas and way of thinking about how from their vantage point streaming SQL should be added to the SQL standard.

Ben Stopford:

I was going to say the other thing is, I'm not sure if this is going to happen by the end of 2021, but I do think there's going to be a lot more emphasis around the storage of events in databases and support for that. What we've got at the moment is a bunch of different technologies coming at the idea of events from different angles. So you've got time series databases, which are very much optimizing around, unsurprisingly, time series, which is your typical events. There's actually quite a close correlation in some ways to stream processing, which is obviously also event-based.

Ben Stopford:

Then you've got whole event sourcing community, which is out there originally building databases with events as the read model, and then as Michael mentioned earlier, you've got these other databases. The one that I tend to think of is Mongo or Redis or Rethinkdb, when that was a bigger thing, and again, it's a different type of database and event-based approach, because it's like a traditional table that omits events. So it's like, events aren't really first-class citizens. But I think what we're hopefully going to start to see more of is these databases that combine these various different patterns. You've got, events are stored, you have this concept of history, a mutual journal of every single state change, which allows you to always make sure that you know what's actually happened in system. But then also starting to communicate those between different processes.

Ben Stopford:

When I think about event sourcing, event sourcing database, it's very much about a single database, and it's the same, actually, when you look at something like Mongo in streams. It's very much about a single database, all your data in one place, whereas event streaming is taking the event-based approach saying, "Well, look. You can move this data around from one microservice to another microservice." And some of the weirdest places you've seen this come up is actually where we're getting sort of almost pipelines, that people are using streaming technologies to transfer data so it can appear in a different form in another part of the organization. You're starting to see companies do this with data warehousing tools, also. So it's seen streaming approach, [inaudible 00:25:15] a transformation [inaudible 00:25:16], and then that sort of materializes somewhere for a microservice to read. You're also seeing the same thing happening in some of these data warehousing pipelines. Netflix is an example of one that's being doing that recently.

Ben Stopford:

I think we're going to start to see actually more of these event-based databases. A lot of data warehouses are event-based as well, at least in some context. And then maybe even one day there's going to be the resurgence of my favorite topic, which is the bitemporal database, which is widely overlooked by everyone in my mind, but seems to be only a small number of people that think that bitemporal databases are a good idea. I just happen to be one of them.

Tim Berglund:

Soon they'll appreciate your genius.

Ben Stopford:

Well, one day I'm going to be vindicated, yes. That's-

Michael Noll:

Yeah, Ben keeps on saying, Tim.

Ben Stopford:

But occasionally I'll just bore Michael [inaudible 00:26:10] bitemporal databases.

Tim Berglund:

Michael puts himself on mute and checks his email.

Ben Stopford:

Michael's actually interested in this, because it does relate very closely the way event streaming databases work, because they use our actually temporal databases. In fact, a lot of the future work that Mike's been working on, as I'm sure he would like to tell you, is about how you handle this notion of a historic table and a table that gives you a "at this moment" view and how you handle that in a streaming context. So I think we're going to see a lot more development of that in particular, not just in the streaming space, but actually in some of these other associated spaces. Maybe one day they'll be bitemporal, and bitemporal tables as well.

Tim Berglund:

Could you remind us what a bitemporal database is? I believe we have talked about that on this podcast late time you were on it, but ...

Ben Stopford:

Yeah, it's basically a database table that holds every single version of a record. So it holds data at an event level, but the events have to be the whole facts of it, so if you write a customer record and then you update the customer record, the database will contain two records, one insert, one update, so it shows the original one and then the second one, the updated one. So you kind of get this record of everything that's happened. In a temporal database, you can basically get a query at a point in time. You can let those events build up, and you can say, I can run a query now or half past four, whatever time might be.

Ben Stopford:

In a bitemporal database, you take the same model and you add a second notion of time. You have this time it was inserted into the database, but then you have some notion that's sometimes called business time. It's just a second time access. And the reason that this is important is that it allows you to backdate changes, so you can basically have an append-only data model, where you're always appending new records, you're keeping the entire history. But the way it will be viewed, you can sort of backdate and say, "Well, actually you know what, we had a mistake in that data that was added at half past four. We want to be able to change that, but we can't obviously change it because it's an immutable dataset. So this allows us to basically add an amendment, an update, which you can backdate to a point in time.

Ben Stopford:

And that sort of gives you this notion of synthesized immutability whilst maintaining the immutable data structure of a log. If you're into event sourcing, it's quite an interesting different paradigm. I think there's only one sort of commercial database out there, and it's pretty small, that does this, although, there are sort of extensions to some other SQL databases, which-

Tim Berglund:

You're making me think of Datomic and completely unrelatedly, the VMS file system. I need to stop thinking about that.

Ben Stopford:

Yeah, Datomic is actually quite ... the only people I know that actually do this, a little company that is basically a closure shop. But yes, it's an interesting paradigm.

Tim Berglund:

Yeah. It's a very closure-y sort of thing, just because the way it thinks about state and mutability and things.

Ben Stopford:

But it's also very similar to the way that, eventually went to event streams. Michael might like to talk about this because a lot of the future that we see is tables that have this sort of duality between point-in-time queries and they're going to have a history view whether or not that's running a query that returns history results, or something that streams changes to you. Having that very neat model that encapsulates the whole thing, I think that's a big part of making not just event streaming but also databases what they should be. In today's day and age, you're going to need to be able to deal with data that's moving from place to place. This idea that you can just lock it all up in once place is obviously no longer true or really applicable to most digital organizations.

Tim Berglund:

Right, right. That last statement is entirely consistent with, I think, our shared worldview about where data infrastructure and the associate application architectures are going for the next generation, not just 2021. But I think, what I heard, and Gwen and Michael, correct me if you heard something else, but Ben said 2021, definitely the year in which bitemporal databases become mainstream, and also Linux on the desktop.

Michael Noll:

I can only hope, Tim, that this will not be true, that this will be a wrong prediction, because otherwise, I can't stop Ben pointing out again and again that he was right.

Tim Berglund:

Yeah, that's going to be relationally difficult.

Gwen Shapira:

[crosstalk 00:31:25] My question is that if we really believe that databases that combine multiple paradigms are becoming mainstream, maybe we can make a prediction [inaudible 00:31:35] then databases will be mainstream at the end of 2021. It kind of means that we need new data architecture patterns. Do you think we'll get those by the end of 2021? You have those [inaudible 00:31:50], as canonical data warehouse. What would be those canonical building blocks of a database that combines streaming and state full storage? [crosstalk 00:32:02] feel like writing a book, but it seems like the only way to [inaudible 00:32:08] prediction true is to make sure we have those design patterns by the end of 2021.

Ben Stopford:

I think some of them are already out there. This whole database thing has been around for a long time. So I think that that's fairly well known. I think the big difference is the realtime view. That's obviously never been in databases, and if you're going to have events as your data model, then it would be [inaudible 00:32:34].

Gwen Shapira:

And also the Push model was [crosstalk 00:32:38].

Ben Stopford:

That's what I mean, sorry. Using the wrong term. It's that the database is half done, not only do you want to run queries on data that's resident, you want to be able to create streams which you can push into other places. Define materialize views, which can be viewed to streams, and whatnot. If there's a prediction for me, it would be, as Gwen said, this kind of multimodal database, but certainly, I would say that by the end of 21 we're going to start to see more than a few databases with events as their primary storage model. And yes, maybe there'll be some patterns, certainly around the thing that tends to get tricky around that, time becomes even more tricky than it does in event streaming, in some regards, because you might have a different notion of how you want to represent a table if it's a reference table to if it's a fact. Like in a data warehouse, often facts will be versioned entities. There'll be events, whereas the dimensional tables won't be; they'll just be like traditional tables with point-in-time queries.

Gwen Shapira:

[inaudible 00:33:55] brings this really big trend that I think is kind of something like if you live long enough, you just see things moving back and forth between two extremes. It used to be that you had database, Oracle, DBTOOL, whatever, do it all. Like try to be everything for everyone. And then people were more like, "Oh, we need the right tool for the job, and we need Mongo DB, and we need Kafka, and we need all those different database types." And it seems like maybe now the pendulum is swinging back. Database has become more Kafka-like; Kafka has become more database-like, [inaudible 00:34:30], in the conversions part of the pendulum.

Michael Noll:

Yeah, maybe. But I think it's slightly different in the sense that, in my opinion, that what we're doing here is, I think, talking about a permanent addition to the tool set of every database that wants to stay relevant in 2020 onwards. I think this idea of fast data being important and mission critical for business, that is something that will not go away wherever the pendulum is currently swinging towards to. I think what Gwen is probably highlighting is, and Gwen, tell me when I'm wrong, is that the way and the extent to which these existing data systems are incorporating this into event streaming functionality, that will vary. And that will probably vary over time as well.

Gwen Shapira:

Even more, maybe the adoption. You're right. Every business will want to do this stream processing. If I am in business who decides to [inaudible 00:35:41] use a tool that is best of grade, say Kafka in Kafka streams, Kafka in ksqlDB, or will I use Oracle's version of event streaming, whatever it may be, because I have a bunch of Oracle people in my company and they're used to working with Oracle, systems tools, conferences, events, marketing, et cetera, and we'll just gravitate in that direction. It's an interesting one. I guess you live long enough, you end up where you've seen it all. I guess so. I don't have a good prediction around that.

Tim Berglund:

I thought it was, if you live long enough you either die a hero or live long enough to see yourself become the streaming database. I'm asking you for your predictions; I'm not making predictions. But I like the direction of that one. I'm just thinking maybe a year from now we're going to see more databases on the margins embrace that feature set, maybe more features like that will creep in to mainstream databases, and I think there's a drumbeat that's building, and there's going to be a more strongly emerging consensus a year from now that this is a thing a database needs to do.

Tim Berglund:

My guess is it'll feel, when I'm starting to think about Christmas trees and things and starting to play my Christmas playlist in 2021, which it's technically October, it'll feel like, "Okay, this is an inevitability." And if you're not doing it, you don't have plans for it. You're behind. And there'll be some products but they'll be ... I don't know that it will be mainstream, but it's going to feel like, "Oh, wow, this is happening. We need to get on board."

Tim Berglund:

I don't know why I felt the need to interact with that one, but-

Michael Noll:

Part of the answer to Gwen's question is probably that up to a certain level of scalability, you can tack on streaming functionality to an existing database. Once you reach a certain threshold, however, and you know, in an always "on" world where so much data is generated every second, you're reaching a certain level of scalability that would be very hard for you to implement when the original design didn't accommodate for the fact from the very beginning.

Michael Noll:

I think this will also limit the extent to which an existing database will be able to implement streaming functionality, and I think an example is the temporal tables that Ben described earlier. I'm not talking about the bitemporal one; for that, it might be even worse. But for the temporal ones, you can think of them, and I'm using my words now, so if you've ever worked on temporal tables or worked on the standard that introduced temporal tables, please forgive me. But temporal tables were kind of like a hack, to add streaming or the history for a table to a database that only knows about tables.

Michael Noll:

And apart from, in my opinion, a very crude syntax for dealing with those temporal tables, the other downside is that they are really not fast. You have to update two databases at the same time, two tables at the same time, the way they interact with them has certain constraints, et cetera. This is anything but fast in a database compared to a system like Kafka that was built for streaming. So I think that, going back to Gwen's question, that will certainly limit the extent to which these existing databases can implement streaming functionality. At the same time, not everybody's running a large scale, right?

Gwen Shapira:

I'm thinking about Leatherman, right? If you are okay having a lot of convenient and somewhat mediocre tools, you use a Leatherman, but then if you actually have an actual maintenance job to do, then it's probably not the right tool for the job. You will get no specialty screwdrivers and specialty knives and specialty [inaudible 00:39:52] and everything that is actually a good fit for what you're trying to do. And I think that's kind of true for almost anything in software as well. If you try to be everything to all people, you're probably not the right thing for any specific thing.

Ben Stopford:

The other thing I'd add to that is that Kafka is, at its heart, kind of middleware. That's where its roots are. It's definitely different to middleware that came before it. But middleware is something that by definition kind of sits in the middle, and you've got through processing, which is helping the process of this data moving around. I think any database that takes the event streaming route or goes in that direction, and it's always a bit limited because it's always going to be very specific to that one database. So if you use Mongo vid streams, that you can sort of synthesize and event streaming system, but you're going to have to like Mongo everywhere. And then you can't do that because it doesn't really have great guarantees for processing data and it's going to make assumptions, like you're going to hold all the data and everything will stay, and all these kind of things.

Ben Stopford:

The devil's, I think, in the details, and although I just saw a 50,000-foot view, you might kind of look and say, "Well, they look kind of the same, don't they?" In practice, they're really very, very different beasts. I think the path of conversions is interesting just because different ... basically what's happening I think is these different technologies are all moving towards very different styles of info mutations which provide realtime data pushed to users in different ways.

Tim Berglund:

It seems like the transition to consider systems or to understand systems, events first is well underway and, to me, seems fairly well-routed like a generational paradigm shift, that "I'll be retired, and maybe a couple of decades from now young people will be saying, 'Oh, this old event streaming thing that we inherited from our foremothers and forefathers, this is no good anymore.'" That kind of thing. It's going to take a while for that to happen, but it seems like it's a generational thing. And I think what we're talking about here is the community, including open source and including vendors, really struggling to come to terms with what's the right tool set for this. It's happening. Events are happening. I don't know how to deny that. What is the best way to represent them to applications and application developers? Well, that's tough.

Michael Noll:

And I would also say for the people that are listening in right now, the listeners of the podcast, you folks are part of this journey. You are using these various technologies, whether it's Kafka content or something else, in order to get your job done. So you're part of this journey. You're part of the people that see firsthand how this architectural and this generation change is taking place. So that should be something exciting.

Tim Berglund:

And those personalities are definitely here. My guests today have been Gwen Shapira, Michael Noll, and Ben Stopford. Gwen, Michael, and Ben, thanks for being a part of Streaming Audio.

Ben Stopford:

Thanks, Tim.

Gwen Shapira:

Thank you, Tim.

Michael Noll:

Thank you, Tim.

Tim Berglund:

Thanks, everyone.

Tim Berglund:

Hey, you know what you get for listening to the end? Some free Confluent cloud. Use the promo code 60PDCAST—that's 6-0-P-D-C-A-S-T—to get an additional $60 of free Confluent cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeited, and there are a limited number of codes available, so don't miss out.

Tim Berglund:

Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community Slack. There's a Slack signup link in the show notes if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.

Coming out of a whirlwind year for the event streaming world, Tim Berglund sits down with Gwen Shapira (Engineering Leader, Confluent), Ben Stopford (Senior Director, Office of the CTO, Confluent), and Michael Noll (Principal Technologist, Office of the CTO, Confluent) to take a guess at what 2021 will bring. 

The experts share what they believe will happen for analytics, frameworks, multi-cloud services, stream processing, and other topics important to the event streaming space. These Apache Kafka® related predictions include the future of the Kafka cluster partitions and removing restrictions that users have found in the past, such as too many variations and excessive concurrency as it relates to your number of partitions.

Ben also thinks that ZooKeeper will continue to maintain open source servers for highly reliable application distribution. Kafka clusters will still be able to keep the most important data while growing in size at record speed with ZooKeeper, although it will no longer be required with KIP-500 removing ZooKeeper dependency. This upgrade allows Kafka and ZooKeeper to run independently in deployment while Kafka’s cluster capability will increase.

Michael expects a continued need for COVID-19 tracking as well as enhanced event streaming capabilities. Ben believes that scalable Tiered Storage for Kafka will increase productivity and benefit workloads. Gwen predicts that databases will become more conventional by the end of next year, leading to new data architectural design with the support of Kafka.

Continue Listening

Episode 138January 11, 2021 | 43 min

Change Data Capture and Kafka Connect on Microsoft Azure ft. Abhishek Gupta

What’s it like being a Microsoft Azure Cloud advocate working with Apache Kafka® and change data capture (CDC) solutions? Abhishek Gupta would know!

Episode 139January 20, 2021 | 34 min

Scaling Developer Productivity with Apache Kafka ft. Mohinish Shaikh

Confluent Platform and Confluent Cloud run efficiently largely because of the dedication of the Developer Productivity team. Mohinish Shaikh (Developer, Confluent) talks about how his team builds the product pipelines for the entire event streaming platform and ensures seamless delivery of several engineering processes across engineering and the rest of the org.

Episode 140January 25, 2021 | 44 min

Distributed Systems Engineering with Apache Kafka ft. Guozhang Wang

Tim Berglund picks the brain of a distributed systems engineer, Guozhang Wang, tech lead in the Streaming department of Confluent. Guozhang explains what compelled him to join the Stream Processing team at Confluent coming from the Apache Kafka® core infrastructure.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free