Senior Curriculum Developer
Many Kafka workloads, whether consisting of financial information, healthcare records, or personally identifiable details, have demanding data privacy and integrity requirements. These could be in accordance with corporate policies, industry standards, national and international regulations, or a combination of the above. In order to be confident that your data is protected from eavesdroppers throughout its journey, both in transit and at rest, you'll need to use encryption. (The only way that you might be able to get away with not using encryption is if your Kafka system fully resides in a secure and isolated network, and you don't have to answer to any authorities or auditors.)
Encryption uses mathematical techniques to scramble data so that it is unreadable by those who don’t have the right key, and it also protects the data’s integrity so that you can determine if it was tampered with during its journey.
The simplest encryption setup consists of encrypted traffic between clients and the cluster, which is important if clients access the cluster through an unsecured network such as the public internet:
The next thing to consider is encrypting traffic between brokers, and between brokers and ZooKeeper. Even private networks can be breached, so you want to be sure the traffic on your private network is resistant to eavesdroppers or anyone who wishes to tamper with it while it is in motion.
Another thing to consider is your data at rest, which will be extensive as Kafka makes data durable by writing to disk. So you need to think about encrypting your static data, to protect it from anyone who gains unauthorized access to your filesystem.
Finally, there are other ways users could gain unauthorized access to your data, including data residing in memory that could appear in a heap dump, as well as data in logs.
Next, we will cover the three encryption strategies, in turn, beginning with data in transit, the only type for which Kafka provides direct support.
An out-of-the-box Kafka installation doesn’t use encryption, but rather sends everything in the easily intercepted plaintext. Fortunately, as we discussed in the Authentication Basics module, it’s relatively simple to implement SSL or SASL_SSL in order to TLS encrypt data in transit. For this, you'll need a self-signed certificate (primarily for internal environments) or one signed by a certificate authority (a must for production environments).
In the Authentication with SSL and SASL_SSL module, we demonstrated how, in addition to brokers providing certificates to clients, you can also require that clients provide certificates to brokers. This is accomplished by enabling the SSL security protocol and setting ssl.client.auth=required in the broker config, and it is sometimes referred to as mutual TLS or mTLS. Conversely, if all you want to do is encrypt and you don’t want to check client certificates (which will reduce the scope of your certificate management duties), you can set set ssl.client.auth=none.
TLS uses private-key certificate pairs, and each broker needs its own. Each client does too—if client authentication is enabled. Note that if you want to enable TLS for inter-broker communication, add security.inter.broker.protocol=SSL to your broker properties file.
You should keep in mind, that enabling TLS can have a performance impact on your system, because of the CPU overhead needed to encrypt and decrypt data.
Apache Kafka doesn't provide support for encrypting data at rest, so you'll have to use the whole disk or volume encryption that is part of your infrastructure. Public cloud providers generally provide this, for example, AWS EBS volumes can be encrypted with keys from AWS Key Management Service. For on-premises solutions, you might consider platforms like Vormetric or Gemalto (Thales).
By this point in the course, you’ve likely set up some certificates, encrypted your data at rest, and set strict filesystem permissions. However, you may wish to go even further and encrypt your data from start to finish, so that it will show as encrypted even in places like heap dumps and logs. For this, you'll need end-to-end encryption, which in the context of Kafka, uses a key management system that encrypts/decrypts when serialization/deserialization happens. End-to-end encryption provides the greatest amount of security since brokers never see the unencrypted contents of messages:
In addition to end-to-end encryption, you should add a key rotation policy, since clients will come and go and changes will be made to your system. This ensures that in any security breach the number of compromised messages will be limited to those from the time since your keys were rotated.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
How can you know your data is protected from eavesdroppers and prying eyes throughout its journey from producer to consumer, both as it passes over the wire and comes to rest on disk? How can you be sure no one has tampered with your data while in Kafka? Many workloads have demanding data privacy and data integrity requirement. Corporate policies, industry standards, and national and international regulations may all require you to protect privacy and integrity of the data you handle, whether it's financial information, healthcare records or personally identifiable details that can only be stored and applied for very specific use cases. Encryption uses mathematical techniques to scramble the data so that it's meaningless if observed by someone lacking the necessary key to decrypt it and protect its integrity, allowing you to determine whether it has been tampered with during its journey through your system. If your entire Kafka-based system comprising producers, clusters, consumers and any streaming applications resides entirely within a secure isolated network and you don't have any policies or regulations requiring you to protect the privacy and integrity of your data, you may just get away with not using encryption. Outside of that very specific scenario, however, you will likely want to employ some form of encryption. The simplest setup is one in which the traffic between producers and consumers and the cluster is encrypted. This is important if any of the producers or consumers access the cluster via an unsecured network such as the public internet. On top of this, you should think about encrypting the traffic that passes between brokers and between the brokers and your ZooKeeper cluster. if you're using ZooKeeper in your deployment. Even private networks can be breached. If that happens, you wanna be sure that the traffic on your private network is resistant to eavesdroppers and anyone wishing to tamper with the data while it's in flight. But at certain points in its journey, this data will also come to rest. Kafka, for example, makes it durable by writing to disk so you should also consider encrypting this data at rest to protect it from anyone who gains unauthorized access to the file system on the brokers. Finally, you should consider other ways in which users or applications could gain unauthorized access to your data. Data, resident and memory, can sometimes appear in heap dumps. Logs, too, can potentially record sensitive data. Unfortunately, Apache Kafka only provides direct support for encrypting data in transit. For the other use cases, you'll have to employ infrastructure capabilities provided by the environment in which your cluster is situated, whether this is the cloud or on premise, and application-level end-to-end encryption as well. We'll discuss these three approaches in turn starting with encryption in transit. An out-of-the-box Apache Kafka installation doesn't use encryption. All traffic is sent as plain text or unencrypted. This allows anyone that sits between your producer and consumer to intercept the data and see what's being communicated. Fortunately, it's simple to enable encryption in transit. As we discussed in the authentication module you can either use SSL or SASL SSL to secure and encrypt your data in transit. You can either use a self-signed certificate or a certificate that is assigned by a certificate authority. For environments that are set up and used for internal development, a self-signed certificate should be sufficient for most use cases. Once you're ready to move to production, we recommend using a certificate that is signed by a well known and trusted certificate authority. In the authorization module, we showed how, if you enable the SSL security protocol, you can also require each client to authenticate to the broker using a client-side certificate. You do this by enabling the SSL security protocol and then setting ssl.client.auth=required in the broker configuration. This is called SSL client authentication or sometimes mutual TLS or MTLS. If all you want to do is encrypt traffic and not use certificates for authenticating clients, you can set ssl.client.auth=none. With this, you only need broker certificates not client-side certificates, thereby reducing the amount of certificate management you have to do. TLS uses private key and certificate pairs which are used during the TLS handshake process. Each broker will need its own private key and certificate pair, which are used to encrypt the traffic. If client authentication is enabled, each logical client will also need a private key and certificate pair. The broker uses the client certificate to authenticate the client. If you wanna enable TLS, inter-broker communication, add security.inter.broker.protocol=SSL to the broker properties file. Be aware that enabling TLS for encrypting data in transit can have a performance impact on the system because of the CPU overhead needed to encrypt and decrypt the data. Let's now look at encrypting your data at rest. As of today, Apache Kafka doesn't provide support for encrypting data at rest. If you have this requirement, you have to employ whole disk or volume encryption using infrastructure capabilities particular to your platform. If you're deploying Kafka in the cloud, your public cloud provider likely supports volume encryption. AWS EBS volumes, for example, can be encrypted using keys that you manage using the AWS key management service. For on-premise solutions, you can also consider appliances such as Vormetric and Gemalto. So you've set up your certificates, your system is running on encrypted volumes, and you've set up file system permissions to restrict access. But what if you need to encrypt the entire data flow in transit, at rest, and anywhere else sensitive data might appear, such as in heap dumps or logs? In these situations, you may have to resort to end-to-end encryption. In order for Kafka to process your data, messages are converted to and from byte array using serializers and de-serializers. In order to use end-to-end encryption, we can integrate this process with an encryption library, so that each message is encrypted and decrypted when serialization and de-serialization happens. Let's take a look at what this looks like. Here we have a pretty standard Kafka setup with the message coming into the producer, being serialized, and being sent to the Kafka broker. It then leaves the broker and makes its way to the consumer where it's de-serialized and consumed. The easiest way to add end-to-end encryption is to set up a key management service, or KMS, where we can store our symmetric encryption keys. Now, the message is encrypted by the producer and sent to the Kafka broker. This encrypted message is then sent to the consumers where the same key from the KMS is used to decrypt the message. End-to-end encryption provides the greatest amount of security as the brokers never need to see the contents of the message. In addition to configuring end-to-end encryption, you should set up a key rotation policy. This ensures that, as people come and go, producers and consumers come and go, and changes are made to your system, the system remains secure. Even if there's a breach of security, the number of messages that would be compromised is limited to the last time your keys were rotated.