Learn how the Data Portal and Apache Flink® in Confluent Cloud can help developers and data practitioners find the data they need to quickly create new data products.
Software Practice Lead
Learn how the Data Portal and Apache Flink® in Confluent Cloud can help developers and data practitioners find the data they need to quickly create new data products.
Hey, I’m Gilles from Confluent.
Have you already been in a situation where there are so many topics in your
Data Streaming Platform, that you spend a lot of time searching for the data you want to process.
If so, you might have a Data Discoverability problem.
Stick around and I'll show you how the Data Portal in Confluent Cloud can fix it.
Data Portal Overview
If you have only a handful of topics, you probably know their names, which team owns them,
and you can probably find the fields you're looking for just by browsing the topics around.
But once you get in the dozens, it becomes quite hard to remember which one is which, and it's kind
of cumbersome to have to open each topic to see what the data looks like, or if it contains PII.
As a developer, I've often spent a lot of time trying to find the
right data that I needed to consume in my applications.
I remember documenting which data my team produced in Wiki pages.
Unfortunately, no one knew where to find the documentation in the
first place and it often went out of date quickly, so it's not ideal.
But today, there's a better way.
I'm going to show you how the Data Portal can make the lives of developers and data
practitioners much easier when they want to discover which data is available for consumption.
Let's get into a quick demo.
I have created a fresh Basic Cluster with the Advanced Stream Governance package.
I've created several topics and produced data in each one to
simulate a Data Streaming Platform after a few months in production.
The Data Portal is located in the sidebar for quick access.
Let's open it.
It displays the recently added and recently modified topics.
If you have deployed connectors,
there's also a section here which displays source and sink topics grouped by connector.
You can tell quickly if all your connectors are running properly
or if the data isn’t going through because an error occurred or the connector has been closed
Let's view all the topics by clicking on this link.
First, you need select an environment for example Staging, QA or Production.
I can filter by tag or business metadata.
We will get to that in a moment.
I can also filter by Cloud provider, Region,
Cluster, Creation date, Modified date and even retention period.
I also have some sorting options.
I want them sorted by name.
Finally, I can type a few characters in the textbox
and the matching topic names will show up in the dropdown.
I can click the preview icon to get a sneak peek of each topic and check the
last message produced and the associated schema.
Of course, I can click on any tile below to do the same.
Query with Flink
But maybe having just the last message might not be enough to
have a good sense of what the data looks like.
So I'm going to show you how easy it is to query the data with Flink SQL.
If I click on the "Query" button, I can write a Flink query.
I'm not limited to just querying the topic I've selected.
I can join these two topics together for example:
Note that I am producing data to both topics as
I speak so this query is streaming the results to the UI in real time.
Requesting access to data
Of course, you'll need to be granted access to the data before being able to query it.
The required role is DeveloperRead.
It might have been assigned to you by your environment admin – and that
was the case in the example I've shown before – but if it hasn't,
it's very easy to request access to a particular topic from the UI.
Let me show you how.
For example, let's say William from the Audit team
wants to check out the data in the insurance_offer topic.
He can request access directly by sending a
message to the topic owner who will receive an email.
In this case, the topic owner is me.
Once I've received the notification email, I can go to the Access Requests tab in the settings,
select the line, and click the "Approve access" button.
Okay, time to get started organizing and documenting!
Create tags via UI
I'm going to create a tag called PII and assign it to a topic.
To add a tag to a topic, click on a tile and then View Topic next to the topic name.
You can now select the tag that you want to assign to this topic.
Let's add a description too while we're at it.
If I go back to the data portal home page, and click on the PII tag,
it will only display topics which have this tag.
Tags are great for categorizing, but at times,
Add Business Metadata
you would like to have a way to add some additional information to your topics.
That's exactly what "business metadata" are for.
The difference with tags is that business metadata have attributes in the form of key value pairs.
You can give a value to each key when assigning the business
metadata to an entity such as a topic, a schema or even an environment.
First, let's head over to the environment and create a "business metadata" object called
"Domain" with a name, a team-owner and a slack-channel attributes.
Next, we're going to assign it to a topic.
In the data portal, I'm going to search for the insurance_customer_activity topic.
Ok, let's see, "Name" is the "Contracts" domain, Team-owner is "Contracts Engineering",
and "contracts-eng" is the Slack-channel in which the team can be reached out.
If I go back to the data portal page, I can now filter by this business metadata.
For example, I can show only the topics who belong to the "Contracts" domain.
Likewise, I can display only the topics who belong to the "Offers" domain.
Now, it would be a bit cumbersome to add dozens of tags and business metadata for
Create business metadata and tags via Terraform
each environment and assign them to topics or schemas via the UI.
There are two additional ways to do that.
You can either use the Confluent Cloud API,
for example using cURL in a shell script or in a python program.
But it's even better to use a declarative approach using Confluent Cloud Terraform Provider.
It's the recommended approach in a production setup and it
works particularly well as a CI/CD step.
I have used the template available in the Terraform Provider GitHub repo to
import all Kafka resources from my cluster into a Terraform file.
Here are my insurance related topics plus all other topics.
I can now create additional tags or business metadata,
associate them to topics and run terraform apply.
Now, with little effort, I've added more tags and business metadata via Terraform.
It will document and help teams quickly discover the data they need.
I have 5 more tags to clearly identify which topics are data products,
which ones have sensitive data and which ones are deprecated.
Speaking of deprecation, I've even added a "deprecation notice" business metadata which
indicates when a topic has been retired or will be retired and which topic people should use instead.
So, even if folks missed the memo, it's right there for them to see, on the topic itself.
As you can see, data portal brings visibility into what data exists and
Conclusion
makes it much faster for teams to build new streaming applications and pipelines.
It's available in Confluent Cloud for users with a Stream
Governance package enabled in their environments.
Alright, If you have any questions about the Data Portal, put them in the comments below.
I hope you enjoyed the video, please like, share and subscribe to support this content.
And if you want a deep dive into data streaming with Apache Kafka or Stream Processing with
Apache Flink, check out our Youtube channel and the courses on developer.confluent.io.
Thank you for watching and see you next time!