course: Mastering Production Data Streaming Systems with Apache Kafka®

Design the Streaming Platform

11 min

Gilles Philippart

Software Practice Lead

Self-managed streaming systems come with many disadvantages. Learn how a fully managed platform automates data ingestion, cluster management, security, performance, governance, and more.

Do you have questions or comments? Join us in the #confluent-developer community Slack channel to engage in discussions with the creators of this content.

Use the promo codes KAFKAPROD101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Get Started

Design the Streaming Platform

Now that we’ve gathered the data requirements, let’s talk about how we should go about and design our Data Streaming Platform.

In the previous module, we talked a lot about the data and how it can influence the design of your platform. We also said before that a platform is more than just the infrastructure. Just as grease is essential to the smooth functioning of a bike's mechanical parts, capabilities are essential for the smooth functioning of a streaming platform. In both cases, the goal is to reduce friction and enable efficient performance. The better the capabilities, the faster you will reach the production readiness qualities we’re striving for: Security, Reliability, Performance, Operational excellence, Cost Optimization Let’s have a look at which capabilities we need from the platform:

To begin with, it must ingest and store the data durably and provide ways to process it in real time. Next, you need excellent development experience to create applications and a large choice of connectivity options to flow the data across the enterprise. Now, upon entering mission-critical territory you need powerful management and monitoring capabilities. Of course it also needs to be secure to prevent unauthorized access and also be resilient in face of failure or disasters by making the data highly available and being able to recover quickly. It must also ensure performant data access and exhibit elasticity of scale to meet the ever-changing business needs.

Ultimately, with the passage of time, the underlying technology will necessitate maintenance including updates and migration. A committed team must stand ready to deliver steadfast support and tackle any failure. And finally, it's crucial to have robust governance capability to enable autonomy while ensuring the safety and stability of the platform, the discoverability and the sharing of the data across teams or with external entities, but also the ongoing quality and trustworthiness of that data. To build a platform with those capabilities, you need to choose between two approaches.

You can either choose the self-managed way by running the infrastructure yourself on premise or in a public cloud, or you can go the fully-managed route with Confluent Cloud. Let’s see how they differ. At the heart of both lies Apache Kafka, so you get the benefits from Kafka: the ability to ingest your data, store it durably, and process it thanks to a powerful streaming API like Kafka Streams. You also have the connect API that allows you to connect Kafka to other data sources.

Next, you have Development and Connectivity. First, there are hundreds of connectors available to connect the different data sources across your company to Kafka. Over the years, Confluent has built great tools to make it easier to handle and process the data:

ksqlDB, non-java clients, the REST Proxy, and the Schema Registry to share schemas across teams to name just a few. These are source-available and you can install and use them free of charge. As you move up the ladder of capabilities, it will be hard to find off-the-shelf components in the self-managed column. There are a range of potential solutions to fill the gap but they are mostly Do-It-Yourself solutions, which will be costly in terms of integration and maintenance effort. When you reach production, you’ll need to install and manage the platform, ideally in an automated way. You will also need to manage and update assets like environments, clusters, topics, schemas, ACLs, accounts. You will also need to build monitoring dashboards and spend time to aggregate logs, configure JMX exporters and ingest Kafka metrics into your enterprise monitoring solution. As you do that, you’ll probably end up thinking that it would have been great to have APIs and out of the box connectors to bridge Kafka with those tools. Well, that’s exactly what Confluent has done with Confluent Cloud, we’ve built a set of APIs to administer the platform, collect metrics, and integrate with third-party tools in one fell swoop.

We also have developed an officially supported Terraform Provider to manage all assets with Infrastructure as Code. Going further, those APIs will be very handy when the time comes to build your own self-serve capability allowing any team to access the platform under your specific rules. When it comes to performance and elasticity, Apache Kafka in the self-managed option presents some serious challenges, as it was not initially designed to run in the Cloud. On-premise deployment complicates resource provisioning and deprovisioning, while cloud deployments still demand meticulous infrastructure management and ties you to a specific provider. Kafka's speed comes at the cost of using highly-performant, and expensive instance-local storage. There is a practical limit to how much you can store on a single Apache Kafka broker.

Once that limit is hit, you have to provision additional brokers and pay for more resources than otherwise necessary. You also risk cluster downtime, data loss, and a possible breach in data retention compliance. This can become an issue when retaining historical data, because it’s hard to scale compute and storage independently. It becomes the operators’ responsibility to distribute, limit, and balance throughput and storage between their internal tenants to avoid running out of storage capacity or throughput.

Cluster imbalances may arise, requiring prompt detection and partition reassignment to maintain equilibrium.

You might have to do this several times, as access patterns change or new applications come online. With Confluent Cloud, you can scale the platform with just one knob: the CKU which is just a number. CKU stands for Confluent Unit for Kafka – yes I know the letters are in a different order – and it determines the processing capacity of your cluster. Increasing the CKU will provision more CPUs allowing you to scale in seconds or minutes instead of hours or days sometimes. When facing unbalanced clusters, Confluent Cloud has a self-balancing feature which doesn’t require any tuning and will keep your cluster perfectly balanced. On the storage front, Confluent has built Infinite Storage which separates the compute and storage resource and can transparently off-load your old data to the object store of your Cloud Provider so S3 or Google Cloud Storage.

Furthermore, historical data in the remote object store is accessed via a separate path to avoid cache interference with hot data retrieval, so Infinite Storage actually increases the performance of your consumers if you have a large amount of historical data. By the way, Infinite Storage is handled transparently by Confluent Cloud, you don’t have to do anything special about it in your applications, you just need to set the topic retention to infinite. Now, security and resiliency are probably the two most crucial concerns when building a platform. Kafka supports multiple authentication and authorization methods and we’ve published a course on Kafka Security so go and watch it if you want to learn more.

It also comes with ACL which gives you a few basic constructs to restrict access to various Kafka resources. If you’re in a regulated industry, maybe your regulation authority has required that you have an audit log to ensure that you’ve set the right permissions for the right persons.

In this case, you should record information about who tried to do what, when they tried, and whether or not the system gave permission to proceed.

That’s not something that Kafka provides out of the box so you’ll have to build it yourself using the raw Kafka logs. Resiliency is the second really important thing to get right if you’re building a mission critical system.

The most common approach to do that is to replicate your data in real-time to a standby cluster, usually in a different availability zone or region. With Apache Kafka you can do that with MirrorMaker 2, an open-source tool that leverages Kafka Connect to replicate data between clusters.

Well, at Confluent we had both security and resiliency in mind when we built features like Cluster Linking, RBAC and Audit Logs.

Cluster Linking brings byte for byte topic replication and is much easier to set up than MirrorMaker 2 as you don’t need to set up a variety of Kafka Connect clusters.

Cluster Linking makes it easy to build Hybrid Cloud and Multi Cloud for use cases like high availability, disaster recovery, data sharing and more.

RBAC goes above and beyond ACLs by offering greater flexibility in managing access to Kafka resources.

Stream Governance in Confluent Cloud brings you powerful features like broker-side schema validation and data quality rules to enforce the quality of data before it enters the cluster.

Finally, ease of maintenance and support can make a difference between a successful platform which accelerates implementation of business use cases, and a liability which drags everyone down and requires top expertise and precious engineering time.

When you opt for Confluent Cloud, the world's leading Kafka experts handle your maintenance and support

Confluent is responsible for the availability, reliability, and uptime for your Kafka clusters.

Patch, upgrades and more are performed systematically, in a safe fashion with no or minimal impact.

You just have to worry about governing the platform and extending it further to reap the benefits instead of spending time and effort maintaining the infrastructure.

So, consider all these roadblocks when opting to manage Apache Kafka yourself.

Confluent Cloud comes with a lot of features which make it easier to create a global, complete, and cloud-native data streaming platform which is well-governed and fully managed.

If you aren’t already on Confluent Developer, head there now using the link in the video description to access other courses, hands-on exercises, and many other resources.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

Modules: Start from lesson 1
Total 10