Software Practice Lead
Learn how to maximize the streaming platform uptime, handle data loss risks, and quickly remediate problems. Adjust capacity, control costs, and prepare for growth.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
We’ve seen how to build a data streaming platform, but the real deal will begin just after the go-live.
In this module, I will tell you what you need to know to secure, monitor, support and grow the platform.
One of the main objectives when operating a data streaming platform is to maximize the uptime.
Monitoring, alerting, and troubleshooting are the foundations to maintain a reliable and high-performing data streaming platform.
So, the first step is to monitor key metrics like throughput, latency, error rates and resource utilization.
Plug your favorite monitoring tool right into Apache Kafka via JMX or use the Metrics API if you’re using Confluent Cloud. If you don’t have a monitoring tool yet, the Prometheus and Grafana combo is a popular choice to get started.
Next, determine thresholds for each key metric and configure automated alerts when these thresholds are exceeded. Prometheus has a decent alert manager to do that if that’s your tool of choice.
When troubleshooting, you will need to gather and cross analyze broker log files, error messages, and performance metrics.
It may sound obvious but running retros frequently, with post mortems for example, will allow you to spot gaps in your incident management process, the metrics you collect, or the new dashboard panels and alerts you need to add.
When you use Confluent Cloud, this is done by Confluent, not your teams.
In case you didn’t know, after hardware and infrastructure issues, operational issues are the biggest potential cause for data loss.
Software bugs, operator errors and mis-configurations can happen and wreak havoc.
The key to prevention is the proactively detect anomalies.
It is strongly recommended to monitor sensitive operations which can change the data, for example, changing the replication factor of a topic.
You should alert in real-time when undesired changes are detected.
When you detect an anomaly, it’s crucial to be able to swiftly remediate or at least mitigate the problem.
If you’re using Apache Kafka, using Gitops and security policies with ACLs can be a good start to reduce the risk and get you going when you begin.
But it relies only on human expertise.
As you evolve and grow the platform, many changes will be made by an increasing number of developers or operators, and inconsistencies across hundreds of configuration points may become much more difficult to spot. Confluent has leveraged its operational experience to build a very robust audit, alert and mitigation mechanism which can analyze all sensitive operations in real time.
When it comes to security and governance, try to keep it simple.
For the security aspect, start by configuring authentication, authorization and encryption.
We have a course on how to implement security with Apache Kafka and another one for doing the same with Confluent Cloud.
Here’s a few other tips:
Use a Service Account to create and manage clusters, also it’s required to implement the GitOps approach.
Use a Service Account per application, do not share the same credentials across apps.
Also, don’t forget to disable anonymous access.
If you need a greater degree of security and privacy which is frequently required by government, health, finance, and many other industries, you can self-manage the encryption key with Confluent Cloud’s “Bring Your Own Key” encryption.
It’s also recommended to apply Role Based Access Control to logical namespaces, via topic prefixes.
For example ‘accounting.*’ would refer to all accounting topics. This way you can prevent non-accounting applications from writing to this topic.
On the governance side of things, try to keep it simple too.
At level 3, you just need to do the bare minimum.
Keep track of who’s doing what on the platform: who owns which applications, what topics they read from and write to.
Also, manage those topic prefixes we mentioned before, to keep things tidy.
When you’re onboarding new applications, it's essential to work closely with the teams and gain a deep understanding of their applications.
The worst situation is when platform teams don’t want to know anything about which applications run on the platform and what they are doing. Platform operators must talk to the teams to gain a deep understanding of their applications and what their challenges are.
Ensure developers regularly update their Kafka clients and libraries to avoid issues with brokers caused by outdated client versions.
Review client-side configuration parameters as most defaults are optimized for low latency and not high throughput.
Also, provide clear guidelines for client and topic configurations such as partition count, replication factor, retention and compaction policies, or compression settings. It really helps when all teams understand the various processing semantics: at-least once, at most once, exactly once. For example, how consumers will see messages which are part of an ongoing transaction depending on their isolation level.
It’s often helpful to have a few off-the shelf recipes ready for application teams to use, depending on their throughput and latency requirements but also the criticality of the workload.
Before the holiday sales rush, you want to make sure that you can keep up with customer demand by increasing capacity.
But once it’s over, you don’t want that highly provisioned cluster sticking around, costing you money. Teams working with Apache Kafka in a self-managed setup usually have a hard time doing this safely and without downtime.
It’s quite complex to size and provision brokers and networks.
And afterwards, you always must rebalance partitions across nodes for optimal performance.
You can use the eventsizer.io calculator we mentioned earlier, but it will just give you guidance, and won’t carry out the changes for you.
Most of this has to be done manually even though there are tools in the open source community to alleviate some of the pain.
So, if you need to onboard new applications, spend time to study the throughput and volume requirements, and then do another round of compute and network capacity estimation.
You will need to closely monitor the costs of the platform too, as they can rise quickly with all the new cloud resources and more importantly, additional network traffic.
When you’re using Confluent Cloud, you don’t have to do much ahead of time, you just turn the CKU knob up or down.
The cluster will automatically balance itself, to keep an optimal performance.
For the cost aspect, it’s pay-as-you-go as in you only pay for what you use. You can also choose an annual commitment for a better discount. Oh and there’s also a billing API to understand where you spend your money, which is always a good thing!
Now, growing your platform is not just a technology problem, you also need to think about people and process. For example training, support, funding, but also self-service. You want to keep the data streaming platform easy to operate without it getting in the way of teams building customer experiences or data pipelines.
To achieve that, you want the data streaming platform to be a deliberate shared service, not an accidental one.
Control it tightly and document clearly the process for new applications to come onboard.
Each small effort you make early on will pay off over time, and can make a big difference as you grow the data streaming platform, for example if you hand it over or must split the cluster.
When the time for growth arises, it’s crucial to get the relevant stakeholders buy-in. They will need to come up with a way to fund the growth and most often than not, the platform will require:
a new operating model for multi-tenancy, increased security, more governance, more self-service, more support and of course, a way to allocate costs.
If you aren’t already on Confluent Developer, head there now using the link in the video description to access other courses, hands-on exercises, and many other resources.