Software Practice Lead
Strategies for handling cloud volatility in cloud environments. Learn to configure Kafka clients and Kafka Streams for error handling, and techniques for application resilience.
Developers who run applications in the cloud for the first time are often surprised by the volatility of the cloud environment.
In this module, we’ll dive into the various types of issues you can encounter and see how you can deal with them in your applications.
First, you should assume that servers will restart.
This happens everyday in the cloud when clusters are rolled for an upgrade.
You can also experience unexpected peaks in latency, IP addresses can change, certificates can expire, and network packets going across the internet are lost more frequently than in most on-premise data centers.
The key to surviving those cloud events is to build and configure your streaming applications in a way to handle these kinds of problems gracefully.
If you’re using Kafka Streams, most transient failures like network partitions, lead broker changes, or producer fencing will be retried and will only generate warnings most of the time.
Now, the default retry and timeout settings in the Kafka client aren't sufficient to survive across a normal cloud maintenance event like a broker restart or the assignment of a new ip address. The last thing you want is for these normal operations to trigger unnecessary internal alerts, so it’s best to increase those settings.
Don’t use certificate pinning when using Confluent Cloud, but if you absolutely must – maybe it’s required by your infosec policy – then pin the root certificate authority which delivers the Confluent Cloud certificates instead.
Unexpected errors when consuming, processing or producing data must be dealt with explicitly.
When consuming, there might be cases where you cannot even deserialize the data. It’s called a ‘poison pill’.
No matter how many times the consumer tries, it will alway fail to consume the message. When using Kafka Streams you can specify a Deserialization Exception Handler to deal with this kind of errors.
Either use the “Log and Stop” handler or the “Log and Continue” one. You can even write your own, for custom treatment.
If you choose the log and stop strategy, which is the safest course of action if you need to process all the messages, then you would have to manually skip offsets. It’s recommended for your team to learn how to do that before going to production as it could hit anytime, for example, if you have a rogue producer which doesn’t play by the rules and doesn’t use the schema registry.
Now, past the deserialization step, when it comes to processing the message, maybe the data you’ve received has some invalid content when you do further validation.
For example, an email address in the “UserId” field isn’t well formed or maybe it’s an invalid social security number. In this case, you might want to route the message to a Dead letter queue for further analysis.
You might even go one step further and create a retry topic and a retry application. Of course, if you retry on a separate track, you run the risk of processing messages out-of-order.
If you want to guarantee that order, special care must be taken. There’s a blog on confluent.io which describes how to do that. The link will be available in the course guide below.
To get rid of these kinds of unexpected errors, Confluent Cloud has two great features: Broker-side schema validation and Data Quality rules.
With Broker-side Schema validation, the broker checks that the message has a valid schema ID from the Schema Registry.
Note that the Schema Validation does not perform data introspection.
Confluent's Stream Governance suite now includes Data Quality Rules to better enforce data contracts, enabling users to implement customizable rules that ensure data integrity and compatibility and quickly resolve data quality issues.
With Data Quality Rules, schemas stored in Schema Registry can be augmented with several types of rules.
Domain Validation rules validate the values of individual fields within a message based on a boolean predicate. Domain validation rules can be defined using Google Common Expression Language (CEL), which implements common and simple semantics for expression evaluation.
Event-Condition-Action rules trigger follow-up actions upon the success or failure of a Domain Validation rule.
Users can execute custom actions based on their specific requirements or leverage predefined actions to return an error to the producer application or send the message to a dead-letter queue.
Finally, Complex Schema Migration rules can simplify schema changes by transforming topic data from one format to another upon consumption.
This enables consumer applications to continue reading from the same topic even as schemas change, removing the need to switch over to a new topic with a compatible schema.
When deploying Kafka Streams applications, make sure to configure standby replicas for fault tolerance and availability.
Standby replicas are tasks in additional application instances started on different nodes.
You just have to set the num.standby.replicas configuration parameter to 1 or more.
Just to be clear, these standby replicas just keep their state store in sync with the changelog topic, they won’t perform any processing.
The benefit is that you get a much faster restoration from large changelog topics when your Kafka Streams application crashes.
We’ve mentioned the idempotence word several times already during this course.
Here’s what it means:
An idempotent operation can be performed many times without causing a different effect than only being performed once.
An idempotent producer ensures that duplicate messages are not created when producing messages to a topic.
It is achieved through broker-side message deduplication and producer configuration.
An idempotent consumer can safely process the same message multiple times without causing unintended side effects.
Exactly Once Semantics is a framework that allows stream processing applications such as Kafka Streams to process data through Kafka without loss or duplication.
This ensures that computed results are always accurate.
The combination of the idempotent producer and the transaction API in Kafka is what makes Exactly Once possible.
Now, here’s a few things worth mentioning:
An idempotent producer will write messages to Kafka only once, even under error conditions.
On the processing side, Kafka Streams use transactions to commit offsets and write processing results to kafka atomically, which means that either both fail or both succeed.
A few important caveats though which are often poorly understood:
Transactions extend from the data read from Kafka, any state materialized to Kafka by the Kafka Streams application, to the final output written back to Kafka.
They do NOT extend to external services or datastores called during the message processing.
See, even though the processing will be semantically done only once, it can be retried internally multiple times under error conditions.
So, you should be very careful if you’re performing external side-effects such as a call to a REST API or an insert into a database.
If your consumer times out or is restarted before committing the offsets, the transaction will time out or be aborted and the consumer will try to process the same messages again. This will have no impact from the Kafka perspective as the record won’t be visible to other kafka clients but your own application might call those third party services several times.
If this external service itself isn’t idempotent, then you need to implement that idempotence logic in your own Kafka Streams application.
Note that when you use Exactly Once, there are a few cases which aren’t covered, for example:
a consumer rewinding offsets to the beginning of a topic.
or, producers not being configured to be idempotent.
or even, streams with existing duplicates, where messages were produced before exactly-once was available.
Now, let’s see how we can deal with those kinds of duplicates.
If you want to avoid unwanted side-effects, you can do the following:
First, assign a unique business key to each message, for example a transaction id.
Then, create a state store in which you will store the processed messages from the input topic.
Finally, skip incoming messages if you can find that their business key in the state store.
Be careful about how much data will be stored in this state store, you might want to clean it on a regular basis if you know you won’t see any more duplicates for a particular business key.
An out-of the box solution is to use one of the Kafka Streams or ksqlDB recipes available on developer.confluent.io to filter out duplicates from a stream.
When it comes to testing applications, we’ve already covered in the automation module how you can leverage testing tools such as Kafka test support classes and TestContainers.
We’ve seen many situations where developers would focus only on the green-path kind of tests.
For example “given these 3 messages in this input topic, I should have this aggregated result or these 3 slightly different messages in this output topic”.
It’s fine to get your feet wet, but to be complete, you should also test your application under failure scenarios for example:
Broker restart or slow broker, so your consumer could time out during a poll before receiving a message.
Rebalancing with partition reassignment, maybe your app was already in the middle of processing, so what should it do?
Network partitions can be tested too, to check whether your application has the right acknowledgement configuration.
When running integration tests, try to spin up a multiple nodes cluster and several instances of your application for a more realistic environment.
It’s also a good practice to run performance tests on a dedicated environment after each significant application change like a topology update or a kafka client upgrade.
If you want to know more about testing, check out the “Testing Kafka Streams” blog post on confluent.io.
We mentioned in a previous module how to configure security for the platform and a few guidelines when onboarding apps.
I’m going to mention it once again, it’s a good practice to use different service accounts, one for each application.
Also, use ACLs or RBAC to control and restrict access to various Kafka resources.
If you want to know more about security, checkout the courses on developer.confluent.io.
In a production environment, you also need to monitor Kafka applications to identify and respond promptly to potential failures.
For consumers, you want to keep an eye on the Consumer Lag and the commit rate metrics.
For producers, you’ll be tracking Request rate, Response rate and “Outgoing byte rate” instead, to gauge throughput and Request Latency Average for Latency.
Getting the observability story right will enable optimal performance, help detect issues promptly, and maintain the overall reliability and stability of the Kafka ecosystem even as the production environment changes.
When it comes to documenting your application, there’s the documentation in the code approach, often using comments or meaningful variable and function names. There’s also the documentation in the repo or wiki approach, which is more visible but harder to maintain.
What if you could marry up the two and also get a nice “resilience benefit” as a bonus doing so?
By default, in Kafka Streams topologies, processor nodes use generated names.
They get this name based on the type of operation and the position in the topology.
If you change the order of the operations, this could change the generated names.
For most topologies, it’s not problematic but if you have stateful operators or repartition topics, this name shifting could prevent you from doing a rolling re-deployment of your updated topology. Check out the “Naming Kafka Streams DSL Topologies” documentation for more information.
In a Kafka Streams application, it’s now a recommended practice to give meaningful names to the processors in the topology and to the state stores.
Let’s have a look at an example.
When you call the “describe” method on the Topology below, you will get the result on the right.
The output is text normally, but I converted it into a diagram using an online tool which uses Graphviz.
Now, If we give the various operations a meaningful name, and describe the topology again, we will get the result on the right.
As you can see, this is a big improvement to quickly understand what a topology does.
So, if you generate the diagram in your build pipeline, you can get a nice visual documentation for free.
As developers, we all know and love the OpenAPI specification for describing, producing, consuming, and visualizing RESTful web services.
What if we had the same thing in the data streaming space?
Well, AsyncAPI is an open-source initiative that provides a specification to describe and document asynchronous applications in a machine-readable format.
With Confluent Cloud, you can now programmatically export your topics and schemas in AsyncAPI format to document your event-driven APIs.
Ok, before wrapping up, let’s see a few additional tips when going to production with data streaming applications.
First of all, you should upgrade your Kafka clients frequently as older versions can be a major source of incidents.
If you can, always use well-supported clients: prefer the officially supported Confluent clients or if you can’t, use libraries based on librdkafka which is also supported by Confluent.
Remember to check the default config of non-java Kafka clients. This can be surprising sometimes, if you’ve only used the java client so far.
That includes:
the auto-commit setting,
when the client performs those commits,
the transaction isolation level,
or even the hashing method for partition keys.
if you’re running on Kafka 2.5+, make sure to use `exactly_once_v2 for higher throughput and improved scalability.
Ok, last one!
Try enabling the Kafka Streams Optimizations flag, which can improve performance when you’re using the DSL.
If you aren’t already on Confluent Developer, head there now using the link in the video description to access other courses, hands-on exercises, and many other resources.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.