Software Practice Lead
Learn about data requirements including data quality, modeling, security, and more, as well as which to prioritize when building real-time data systems and apps.
Data has become the single most valuable asset in today's business world. A good practice before considering going into production is to gather your data requirements. By conducting this analysis, you will gain a better understanding of the kind of platform you need to build.
In this module, our initial focus will be on the data quality. We will then discuss additional factors such as:
Data Modeling, Format, Ingestion Rate, Retention, Locality, Security, Privacy, and Compliance, and finally, Availability.
For each of those, I’ll give you a few tips to get your bearings. The importance of these factors may vary across industries of course. For example, on Data Availability, a few missed messages may be acceptable for an advertising company,
but it might be a serious problem for highly regulated industries such as finance or healthcare.
When creating a streaming platform, you must be careful of not ingesting bad data. Non-sense input data leads to non-sense output data. It’s called the Garbage-in Garbage-out principle. This can lead to poor decision-making and business assessments in the long run. Let’s have a quick refresher on what we mean by data quality. Completeness measures whether all required data elements are available.
For example, if a piece of data represents the sale of a product, it wouldn’t be complete without the price or the quantity.
Consistency measures how much the data matches up with other data sources on the same subject.
For instance, a customer may have changed their email address, but the new address is only updated in one of the databases.
Timeliness measures how up-to-date the data is, and how well it reflects the current state.
For example, in finance applications, the stock price from several hours ago isn’t timely enough to make an informed investment decision,
most of the time you need real-time data.
Accuracy measures how close the data is to the true value.
In a driving application, a speed sensor reporting a speed limit being trespassed by 5 miles/hour would be accurate (although not very precise).
Relevance measures how useful the data is to the intended purpose.
Let’s say you’re building a customer analytics report for VIP customers.
Data pertaining to regular customers would be considered irrelevant and would require further filtering.
Now, validity refers to the correctness of the value.
For example, a time period with a start date greater than the end date or using 13 as a month number would be good examples of invalid data.
Precision: measures how granular or detailed the data is.
Let’s take a timestamp. Sometimes, it’s expressed in seconds, milliseconds or nanoseconds since nineteen-seventy.
When it’s expressed in seconds, it might not be precise enough for some use cases, for example for sensor data.
Let’s have a look at a few tips now. Before you start ingesting data into your system, you should assess its quality, that is how complete, consistent, timely, accurate, precise, relevant and valid it is. Try to estimate the impact that a low data quality would have on your business. A good way to assess the quality with concrete numbers is to run analytics on the data source:
for example an email address is collected in 80% of the cases, but only two thirds of the values seem to be valid entries.
Is that sufficient for your business case?
Use tools like greatexpectations.io to build a comprehensive report on the data quality. If the quality is subpar, maybe there’s a way to complement it with another datasource
or you can work with the data producer to try and improve its collection and validity rates?
Enforcing and monitoring data quality at every ingress point is also another important practice.
This means making sure that all data is validated before it's ingested into your system. To do that, you can enforce the use of schemas and build a data validation layer inside your application to ensure that the incoming data meets your standards. Actually, Confluent Cloud has built such a validation layer with declarative Data Quality Rules which come with the Stream Governance Advanced package. Monitoring and alerting can also be used to notify you when data deviates too much from the baseline.
Our third data quality advice is to start treating your data as a product to build trust. It means considering data as a precious resource that requires meticulous management at every stage of its existence.
Document your data sets in a centralized location, with a description, who owns it, how to access it, and where it originates from.
Add a few tags to the mix like “contains_pii” or “marketing” or “sales” to boost discovery and productivity. All of the above will make the life of developers and data analysts much easier when they need to use your data.
Finally, try to keep track of downstream users and collect feedback from them to identify pain points
and improve your data just like you would with a software product.
If there’s a significant change that you must perform, communicate it clearly and give plenty of time for your downstream consumers to adapt.
By the way, Confluent has built some great features to help you do all of that,
check out our stream governance course on developer.confluent.io to learn more.
In Kafka, data is represented as a sequence of bytes, that’s what producers send and that’s what consumers receive.
The meaning of those bytes is up to you. The first way we recommend ensuring data quality is to create schemas for the data in the streaming platform.
Without schemas, the data exchanged between systems may be more difficult to understand and interpret as the consumer of a data stream would have to read the source code of the producer to make sense of the data format or inspect the data and try to infer the meaning.
By using schemas, we can guarantee that data is correctly formatted and can be consumed by the downstream systems. Schemas will change over time.
This evolution must be done in a compatible way and properly communicated otherwise you might accidentally break consumers.
Lastly, it’s crucial for developers and data engineers to easily discover and share schemas, promoting better collaboration and reuse of existing data structures and formats. Our first tip on data format is that you should standardize on a data format.
There are multiple options when it comes to choosing a data serialization format: JSON, Avro and Protobuf. It might sound obvious, but pick one that is well supported by your different programming languages. Be careful not to choose a format at odds with your enterprise data exchange standard otherwise you’ll have to build conversion layers at every system boundary. Next up, enforce data contracts between producers and consumers.
By enforcing schema validation, it can help ensure that the data being exchanged across the system is correct and follows a specific structure.
The schema registry provides a centralized location for storing and managing schema information. In the end, you should select a serialization format which is easy to evolve, storage efficient and has good tools support.
JSON Schema is really just a specification.
In our mind, Avro and Protobuf come on top. They are both binary formats that offer a number of advantages over JSON. For one thing, they're more compact, which means they take up less storage space and can be transmitted more quickly over a network.
They also have your back when it comes to schema evolution, this means you can tweak the structure of your data without breaking existing code.
We recommend keeping the key as simple types like a string UUID, or a long ID. Keys should only serve to ensure partition ordering, and the business key of the object should be in the value data contract. Avro does not guarantee deterministic serialization for maps or arrays, and Protobuf and JSON schema formats do not guarantee deterministic serialization for any object, so using those as keys could break partitioning.
In a data streaming environment, Data Modeling is the process of designing the structure of real-time events to facilitate processing and improve system performance. Events can help modern businesses react to change immediately by providing a record of what has happened,
instead of relying on periodic queries performed only a few times a day. By modeling events, their attributes, and their relationships,
you will provide a clear and concise picture of how a system must behave in response to the other events. It’s a good way to design flexible and adaptable systems which are easy to change over time. Check out our courses on Confluent Developer to help you get started with event modeling, event streams design but also CQRS and Event Sourcing.
Our first data modeling tip is to capture metadata for better insights.
For example, business metadata is data that adds business context to other data.
Let's say a retail company wants to develop an event-driven system to track their store inventory. Maybe they have events like "New shipment received," "Item sold," and "Item returned.". To extract better insights from the data, each event should include important information like the product and shipment ID,
or how many items they received or sold, and where the store is located. With Kafka, you should use the message payload to store that business metadata. Perhaps you need to collect technical metadata instead, like, the system the data originates from or the user session id, for lineage and audit purposes or to facilitate routing of the data downstream. In this case, it’s best to use the record headers instead of the payload.
Our second suggestion is to add a unique identifier to each message. This will be helpful to make your applications idempotent,
meaning that if the app encounters the exact same message after successfully processing it the first time,
it will ignore that duplicate the second time. We’ll discuss idempotence in greater detail later when we discuss the Applications pillar.
Lastly, try to standardize field and topic names.
It's important to prefix topics with your domain name. This helps create a clear and unique namespace for your topics, avoiding conflicts and confusion. For example, you might use 'financing_users_v1' instead of just 'users.'
If you do that, it will be very straightforward to aggregate two clusters together for example or set up topic mirroring the day you need it. Consistency in naming not only helps improve readability and maintainability of your code and data schemas,
it will also make authorization simpler later on.
Another key aspect is implementing semantic versioning for your data schemas. When you make a change that's backward incompatible,
you should increment the major version number. Minor version changes, on the other hand, should be both backward and forward compatible.
By using this versioning system, you can clearly communicate the nature of the changes made to the structure of your data. Cybercriminals are always on the lookout for opportunities to exploit data breaches
and gain access to sensitive information such as financial data, personal identification information, or confidential business data. Data Security is the protection of the data from unauthorized access, use, disclosure, modification or destruction. Many industries, including government agencies, financial and healthcare industries,
are subject to regulations and laws such as PCI, GDPR, and CCPA,
which require encryption of sensitive data both at rest and in transit.
Data privacy is the right of individuals to control the collection, use, and disclosure of their personal information. For example, GDPR and CCPA state that individuals have the right to have their personal data erased. Compliance is a critical aspect of data security and privacy,
and it involves adhering to laws,
regulation and industry standards that relate to data protection.
It's essential that organizations understand the laws and regulations that apply to their operations,
and take steps to protect sensitive data and maintain compliance to avoid legal liabilities and reputational damage.
Now, a few tips on Security, Privacy and Compliance. First, it's important to encrypt or tokenize PII data both at rest and in transit. This ensures that any sensitive information is unreadable to anyone who does not have access to the encryption keys. Confluent Cloud provides Client-Side Field Level Encryption to simply tag a field,
define an encryption policy, and begin streaming encrypted data securely. Secondly, building the ability to delete PII data is also crucial for data privacy. With GDPR protecting EU citizens and residents,
it's essential that organizations are able to delete personal information upon request. This applies to everyone in the world, not just within the EU. Finally, limiting and controlling access to sensitive and personal data for internal stakeholders
and developers is another important aspect of protecting the data. A best practice is to implement a least privilege approach:
every user or process should have only the minimum access necessary to perform their job functions. This can be achieved through the use of ACLs, Access Control Lists, with Kafka. In Confluent Cloud, we also provide RBAC functionality to greatly simplify the management of who can access restricted data.
An important question you must ask yourself is what kind of volume of events will be generated by the data sources,
and how fast do you need to process it?
The Data Ingestion Rate requirement is just that.
It measures how quickly a stream processing system can receive and process incoming events from various sources. Throughput refers to the amount of data that can be processed by a system over a given period of time,
typically measured in messages or bytes per second. A high throughput Kafka system can handle large amounts of data quickly and efficiently,
making it a good choice for applications with high data volumes. Latency, on the other hand, refers to the time delay between when a message is produced and when it is consumed by a consumer. With Kafka, you can optimize for one or the other, but not both at the same time. That’s because high throughput demands message batching,
which increases latency but low latency requires individual messages to be processed in real-time. The performance of Kafka depends on various factors,
including the hardware it is running on, the configuration of the cluster, the network infrastructure,
the size of the messages being processed, and the workload patterns. Finally, keep in mind that huge volumes of data demand more processing and parallelization,
which means more partitions, more brokers to handle them, and more machines or CPU for the processing nodes. So, before you go to production, make sure you understand the latency requirements for each data set. As an example, use cases such as analytics, fraud detection, or application monitoring have different requirements for latency,
with some requiring near real-time processing and others being more tolerant to delays.
In most situations, you should try to strike a balance between throughput and latency,
and there is an array of techniques to help you do that. Use workload placement to process data closer to where it resides, to avoid pulling data over a long distance. Another effective approach is to use asynchronous message publishing to reduce latency. Increasing partitions count can also help distribute the load and improve the throughput. Likewise, message batching and compression can improve the throughput at the expense of some latency. A good practice is to increase capacity for unexpected expansion. Data ingestion can experience unpredictable spikes in volume due to changes in data sources, business events, or increased user activities.
So make sure to size up by 20% your resources like compute, storage and bandwidth so that you have enough room to grow and avoid service degradation or disruption. Beware that adding more nodes to a cluster could increase latency as a producer would need to wait for more nodes to have a copy of the data.
Our last tip is to keep an eye on the latency of the platform with monitoring and alerting. If it is on the rise it’s time to investigate the performance degradation and maybe add more processing power. Data retention refers to the length of time that messages are retained in a Kafka topic before being deleted.
Kafka allows you to set a retention period for data, after which it will be automatically deleted.
It's important to choose an appropriate retention period based on your data storage requirements and compliance regulations. Longer retention periods can provide more historical data for analysis,
but can also result in higher storage costs and longer recovery times in the event of a failure.
Shorter retention periods, on the other hand, can reduce storage costs and improve recovery times,
but can also limit the amount of historical data that is available for analysis. This retention setting can be expressed in terms of storage size or in terms of time elapsed.
The retention could be infinite if needed.
It’s defined by the broker configuration and can be set on a per-topic basis. It's worth noting that the length of time data is retained varies depending on factors such as legal requirements, industry regulations, and company policies.
For example, session data from websites are kept only for a few days, whereas medical records stored by a healthcare company would probably be retained for years.
Our first tip is that you should determine how long you need to retain each data set before it is be deleted or archived.
When doing so, bear in mind that legislation like GDPR protects individuals' privacy and prevents the unnecessary retention of their personal data.
This means that organizations must have processes in place to regularly review the PII data they hold
and ensure that any PII data that is no longer required is deleted or anonymized.
Make sure to consider storage costs as they can increase with more data.
Also clusters with more data will recover less quickly than others.
Next, try to estimate the amount of data you want in your platform on your go-live day but also in a year. This will help you plan the required CPU and storage capacity for your system. A good way to do so is to use a tool like https://eventsizer.io. If you’re wondering, that's just if you use Apache Kafka.
With Confluent Cloud, all of that is done for you via a simple knob.
We’ll talk more about this in a later module.
If you’ve configured your topic retention using time, it’s always a good idea to also configure your maximum size as a safeguard and vice versa. We’ve seen cases in the past where producers were publishing messages with timestamps in the future, hence preventing the cleaning process.
Processing massive volumes of data can take a toll on your network and systems.
Moving gigabytes of data between nodes and systems consumes a huge amount of bandwidth, slows other operations, and consumes a lot of time in the process.
Data locality solves that challenge by moving the significantly lighter processing code to the data instead. To illustrate this, the sensors in a single autonomous car can record up to 19TB of data per hour.
It would make little sense to send all of that data across the wire to the cloud, to do some computation and report back.
It would be much more efficient instead to run the computation inside the car and send an aggregated metric on every tick. In other cases, moving the data isn’t even possible.
Some countries have laws which require that the data pertaining to their citizens must be stored and processed inside the country. The first data locality tip is that you should consider whether you can move the computation near the data.
Most of the time, having data cross an availability zone or a region isn’t a good idea as Cloud Service Providers make you pay more for data going out.
Doing so can save you a significant amount of money but also help be compliant with data residency laws if applicable. Another useful technique is custom partitioning, which ensures related data is stored on the same partition for more efficient access and processing by the consumers.
Finally, organizations can also take advantage of Cluster linking or Kafka Connect to move data to the appropriate locations, allowing for more efficient processing and analysis. A very common use case is data integration.
Data integration combines data from different sources to build a unified view for analysis and decision making. With Data Pipelines, data is moved in real time as it changes, while it is current and relevant.
In doing that, the data becomes more timely in the target system. Data pipelines bridge data from disparate sources, in a decoupled way, from say, databases to cloud data warehouses.
It makes the data more consistent, all systems have the same view on the data. There is a course on how to build a streaming data pipeline if you want to dig deeper.
The first data integration tip is to make sure that the data moves quickly enough. This requires estimating the throughput and latency requirements of the platform, to ensure that it can handle the desired volume of data and process it in a timely manner. Next up, you need to identify early which connectors you will need for your data sources.
The Kafka community offers a few freely available connectors, but if you require enterprise-grade connectors, you may want to consider using Confluent Cloud, which provides over 150 enterprise-grade connectors. Another important aspect to consider is monitoring and alerting for your data pipelines.
Make sure you’re notified when pipelines fail or when lag builds up, so that you can take appropriate action and resolve issues as quickly as possible. Data availability is a critical requirement in event streaming systems.
It refers to the ability to access and retrieve data when needed.
Businesses can run without interruptions with 24/7 data availability. To ensure availability with Kafka, the data is stored on disks and replicated across multiple nodes in the cluster. This way, even if one node fails, the data can still be accessed from other nodes, ensuring continuous processing and analysis. Unfortunately, disasters or cyber attacks can strike at any time, so in a mission-critical environment, you must design your platform to protect against that risk otherwise the consequences can be catastrophic.
We will talk about this in more detail in the Business Continuity module. You will often hear the term RPO and RTO when discussing disaster recovery. RPO stands for Recovery Point Objective and represents how much data the platform can afford to lose. RTO stands for Recovery Time Objective, and is how long the platform can stay unavailable. For example, an RPO of ten minutes means that it’s fine to lose up to ten minutes worth of data but no more.
For example it could be orders being made by customers or sensor data.
The term "Zero RPO" means that no amount of data loss is acceptable. Now, let’s have a look at a few tips for data availability. But before we do that, imagine this scenario: you’re the head of a data-driven business and your team relies heavily on various data sets to carry out their day-to-day tasks. One day, the worst happens - a system failure occurs, and your data is no longer available. Panic sets in as you try to figure out how to quickly get back online.
But fear not, as you had the foresight to implement a plan to address such a situation.
Your first step before going to production was to assess the impact an outage would have on your business. How long could the business function without the affected data sets - for an hour, a day, or a week? Determining the Recovery Point Objective for each data set will help you prioritize which data sets you must recover first. Next, you came up with a plan to handle the immediate issues that could be fixed within a few minutes
and also figured out how to solve the bigger problems that might take a few hours or even days to fix. Short-term mitigation involves using temporary solutions, such as increasing resources like memory or CPU,
turning off less critical applications, reassigning partitions, or even restarting the platform to keep the business open while the problem is being fixed. We’ll see how to do that in a later module when we discuss business continuity. Last suggestion, if one of your downstream systems like a database failed and you lost data,
you can use your data pipelines to recover it, all you need to do is rewind and replay the event log.
If you aren’t already on Confluent Developer, head there now using the link in the video description
to access other courses, hands-on exercises, and many other resources.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.