It's time for another release of Confluent Platform: 6.0. There's some important stuff in here. Tiered Storage, self-balancing clusters, the preview of the all-important Cluster Linking feature, things you're really going to want to check out. I was able to get back out into the wilderness again to record this so you may hear a little stream in the background of today's episode of streaming audio, a podcast about Kafka, Confluent, and the cloud. Confluent Platform 6.0 is, as they say, action-packed. And there's a unifying theme to some of these things I'm going to be talking about. You may have heard us talking about a thing called Project Metamorphosis in the past few months. Project Metamorphosis is of our vision of what event streaming in the cloud ought to be.
If you look at a mature event streaming platform such as Confluent Cloud is becoming and then you look at a truly cloud-native system, you put those things together, and you get really something where the whole is greater than the sum of its parts. It's not just, here's a bunch of nice features. But you really get something that we think is transformative. And when we talk about Project Metamorphosis, it's usually cloud first since we're talking about our cloud service and not Confluent Platform. But in 6.0, a lot of those features are starting to come to Confluent Platform so if you run on-prem, you start to benefit from these things too. It's pretty exciting. And as I'm recording this in August of 2020, there are four metamorphosis themes that we've unveiled so far and those all participate in 6.0 one way or the other.
Those are elastic, cost-effective, infinite, and global, and we'll dive into each one of those more as I touch on the related features. All right, let's start with Tiered Storage. This is super cool. Now, you're running Confluent Platform and you write something into a topic and that's stored durably potentially forever, for whatever your retention period is. Stored where though? Sometimes you actually have to remind yourself that it gets stored in the simplest form, a disc that's owned by a broker. And, of course, several brokers, there's replication happening, but it's written to local disk. And I know in your deployment, there could be all kinds of complex things. Maybe you're running in the cloud and that's an EBS volume that's attached. There's all kinds of qualifiers here, but bottom line, there is something that looks like disk to the system that's running that Kafka broker. And traditionally, that's where topics are stored. Log files on disc.
Tiered Storage now gives us the option of deploying a lower cost secondary tier of storage. So you have, say, your newest data that's newer than some certain threshold that's still on those, we'll call them local disks for purposes of simplicity, and the data that's older than that threshold will get written to this other lower cost store. Think of it like a cloud blob store. And, in general, the lingua franca of blobstores is the Amazon S3 API. All of them are basically compatible with that, and there's various on-prem blobstore solutions that you can deploy that still look like S3 to the outside world. So something like an S3 Blobstore is that secondary tier and then the rest of it is still on local broker storage. So you got a few things there. You've got effectively infinite storage. Infinite, of course, is a touchy word. There's still only so many hard disks in the physical universe that we're aware of.
But we round off. I mean, in photography, infinity is 20 feet. So just give me this one. It's basically infinite storage. You don't have to deploy more brokers to get more storage. You have a farm of disks. You have S3, you have Google Blobstore, you have something like that where you're really not going to fill it up. Also, you've reduced infrastructure costs. Kafka brokers can get pretty big in terms of the disks you can stuff into them, but you're still being a broker. It's a confluent platform node. There's a machine. There's all that infrastructure cost associated with that and you'd rather be able to now make a disk farm, and that's storage, as we know, at a much lower cost per byte when it lives off in a blobstore. So you get much cheaper infrastructure costs for all of that data.
You also have more elasticity. If you're storing things for a longer period of time, you don't have to worry about deploying more brokers. You can just grow your blobstore usage. And if that is a cloud blobstore, then that is truly elastic. You don't think about that scaling. You're not thinking about deploying the servers that babysit the disks or managing, anything like that. It's just, as they used to say, crunch all you want, we'll make more. So you've got a little bit more elasticity there in your deployment. Next up, self-balancing clusters. Now, let's just remind ourselves what a confluent platform cluster is and, again, let's just use some simplifying terms. It's a bunch of computers that are connected together that have disks on them and what gets stored on those disks, well, topics, specifically topic partitions, and those partitions are distributed among the nodes in the cluster.
Now, when I add a node to the cluster, which is a good thing, this is a scalable system, I can do that, by default, if you're just thinking about Apache Kafka and I add a node to that Kafka cluster, that's great when I create new topic., It's available for partitions to be assigned to it. It's going to participate in the life of the cluster. But all of the data I had on, say, the first three, it does not magically get reassigned to that new node. I could do that. There's a way to manually do that, but that's a bummer. Self-balancing clusters in Confluent Platform 6.0 is a way of making that happen automatically. And that manual rebalance process is an included Apache Kafka command. People do this in the real world. It's just a manual step. And you've got to write a little bit of JSO, think about what partition is going to go, where run it.
I don't exactly get out of bed in the morning thinking I want to write JSON config files. I get out of bed in the morning thinking I want to write YAML config files, and I know you do too. But honestly, this works, but it's very manual and there's just a lot to do. So with 6.0, now this is an automated process. I click self-balancing in the control center and away I go. Confluent Platform 6.0 includes ksqlDB 0.10. CP is as usual keeping pace with the rapid development of ksqlDB. Some cool new features to 0.10 include pull queries. Now, it's funny. I've been talking about pull queries for almost a year now, but they've been a preview feature. They are now GA, they're generally available in 0.10, therefore in Confluent Platform 6.0. And what a pull query is, when I create a table in KSQL, say, there's some stream and I group by a key and aggregate. That forms a table. There's a key and then a value that's the output of my aggregate function. That table is queryable by key. That's what a pull query is. It's like a regular database query.
Now, in this streaming database, we've got the ability to do that same thing. Embedded Kafka Connect is now also a generally available feature. It's the thing we've been talking about since about last Fall, but it's been in preview and it's there for you now. So now you can, with a create connector statement, actually, spin up a Kafka Connect connector, and that can be in the embedded connect to instance, inside ksqlDB itself, or if you're already running Connect standalone, you've already got a Connect cluster, you can say, "Hey, KSQL, I want you to use that Connect cluster that I'm already administering and loving and caring for." And I just get to configure it with SQL rather than posting pieces of JSON to the rest API. Again, much cooler to write some SQL than to write a piece of JSON config.
And I don't want you to miss the significance of these two features becoming generally available things. I mean, honestly, step back and look at it. To build a streaming application with the current tools, there's a lot of stuff. There are several distributed systems that you have to stand up and manage, and pretty much zero people want that. If you want to just make the application work, you didn't also want to use six different distributed systems to make it happen. So the more functionality ksqlDB integrates within itself, like this Connect integration, for example, and its native stream processing capability, that's two distributed systems that just get checked off the list right there. So it's a really important thing, and as ksqlDB grows, I think you're going to find it becomes more and more your go-to for how you get stream processing done as it lets you cross off more and more distributed systems elsewhere in your system that you don't have to operate and probably, as I said, didn't wake up this morning aspiring to operate.
Confluent Platform has also grown some new rest admin APIs. This is tremendously good news for the scripting and automation that you want to probably do. And up to now, this has been a little bit of a mix and match of different things. Maybe this thing I have to do through this command-line tool, this thing I have to do through the control center, maybe this thing had an API but bringing more and more of that into documented rest APIs so that there are a uniform and scriptable automated way to get this stuff done. This doesn't make the control center an unimportant part of your life. If you're a Confluent Platform user, you're still really going to want to use control center. It's the purpose-built admin and monitoring application for Confluent Platform, and there's all kinds of great things about it, but you also want to automate things. We are adults. We would like a rest API and increasingly there is a better and better rest API with every release of Confluent Platform.
What can you actually do? Well, here are a few things. I can describe, list, configure brokers, create, delete, describe, list topics, delete and describe and list consumer groups. Can't create a consumer group, of course. To create a new consumer group you have to deploy a new consumer group, but I can look at them and that's a good thing. Same thing, basic crude operations on [Ackles 00:09:40]. And I can also take a look at partitioned reassignments, all the stuff that you're going to want to do all in the rest API. We have a new preview feature, which is Cluster Linking. And this is really, really important. Now, traditionally, when you want to deploy one confluent platform cluster in one place and another in another, so these could be different regions in your cloud provider, these could be different physical data centers of yours, maybe a physical data center of yours and a cloud provider, whatever that story is they're not all in the same data center with a low latency, high availability connection between them.
Historically, to do that, you use Kafka Connect. You stand up a Connect cluster and Connect will consume from one topic that you want to replicate in another data center and then produce it there. So two downsides to that. One is, now I think about that Connect cluster or that connector in that Connect cluster. It's just an extra moving piece to babysit. The other downside is topic offsets. If I'm consuming from one data center and producing to another, that topic over there, presumably I would like applications over there to be able to use it. Or maybe I have an application that's in some third place that I want to be able to consume from this data center and then produce to the other, I'm assuming those topics are the same. If the offsets are different, it's difficult to failover from one data center to the other. So getting off system match is a classical problem in this approach.
With Cluster Linking now, this is running at the broker level. This is running in what we call Confluent Server, the broker in Confluent Platform, and it's happening at the broker level, this linking of one cluster to another, with offsets preserved as those topics, are linked between clusters. You've got much stronger guarantees about whether messages have been consumed and consumed only once, all that kind of stuff. It basically simplifies your thinking about hybrid cloud and multi-cloud deployments, and really just what I like to call multi-datacenter deployments in general. And of course, Confluent Platform always gives you the most recent version of Apache Kafka. In this case, that's 2.6. So if you want to know about 2.6, you'll want to check out the blog post on the Confluent blog that's all about Apache Kafka 2.6. There's another video where I'm talking about some of those features in detail that's in front of a different stream elsewhere in Colorado. So be sure and check that out.
So there you go. You've got Tiered Storage, game-changing feature, if I may use that somewhat cliched term. It really is, I think you'll find, for what you can do. ksqlDB 0.10, some cool new stuff becoming generally available there, and a preview of Cluster Linking. You'll want to start using that if you are doing any kind of hybrid cloud, multi-cloud, multi-DC deployment. Get your hands on it. You'll really want to start using that now because it's going to be the way this gets done in the future. So get busy, download it now, check these features out. As always, we would love to hear from you in Community Slack, on Twitter. Always want to know what you're building and hear those stories. Thanks a lot.
Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST, that's 60PDCAST, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021 and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit and there are a limited number of codes available so don't miss out. Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at @tlberglund. Or you can leave a comment on a YouTube video or reach out in our Community Slack. There's a Slack signup link in the show notes if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support and we'll see you next time.
The feature-rich release of Confluent Platform 6.0, based on Apache Kafka® 2.6, introduces Tiered Storage, Self-Balancing Clusters, ksqlDB 0.10, Admin REST APIs, and Cluster Linking in preview. These features enhance the platform with greater elasticity, improved cost effectiveness, infinite data retention, and global availability so that you can simplify management operations, reduce the cost of adopting Kafka, and focus on building event streaming applications.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us