Course: Building Data Pipelines with Apache Kafka® and Confluent

Hands On: Introduction and Environment Setup

7 min
Tim BerglundSr. Director, Developer Advocacy (Course Presenter)
Robin MoffattStaff Developer Advocate (Course Author)

Hands On: Introduction and Environment Setup

In this exercise, you will set up the necessary environments for the exercises in the rest of this course. Before we even do that, let’s talk about what we’re even going to be building.

The example streaming data pipeline that we’re going to build during this course is driven by a stream of events arriving with information about product ratings that users submit to our fictional company’s website.

Ratings data flowing into a Kafka topic

In reality, the ratings would be written from the website to Apache Kafka® using the producer API. For the purposes of our tutorial though, this data is going to come instead from a data generator that we’ll set up in a later step.

Each of the ratings that we receive has a user_id field. We can use this to look up information about the user so as to enrich the data that we’ve received and take timely action on it. For example, each user has a name and contact details and is also assigned a VIP status. If we have an important customer leave unhappy reviews, then we probably want to take action so as not to risk them churning. The information about customers is held in a separate system—it’s a fair bet that it’s going to be an RDBMS of some sort. The general pattern to follow with streaming pipelines is to get the data into Kafka and join it to the events there. We’ll get onto why this makes sense later on and talk about exciting things like stream-table duality.

Ratings and Customer Data

Using stream processing, we can join each event that we’ve already received, and every new one that arrives, to its associated customer data. The resulting enriched information is written back into a Kafka topic and then streamed to the target system where it can be used in an operational dashboard.

The finished pipeline

Confluent Cloud

For this course, you will use Confluent Cloud to provide a managed Kafka service, connectors, and stream processing.

  1. Go to Confluent Cloud. If you don’t already have a Confluent Cloud account, you can create one here. Use the promo code PIPELINES101 for an additional $101 of free usage.
  2. Create a new cluster in Confluent Cloud. For the purposes of this exercise, you can use the Basic type. Name the cluster pipelines_quickstart.

    Create a cluster in Confluent Cloud

  3. Once the cluster is created, you need to make sure that it has finished initializing before going to the next step. To do this, go to the "Topics" page and wait for the Create topic button to become available.

    Create topic not available

    Create topic available

  4. Next, you need to create a Confluent Schema Registry for your environment.

    Note

    If you already have a Schema Registry in your Confluent Cloud environment, you can skip this step.

    • From the Schema Registry option on your environment’s homepage (or from the left-hand navigation on the "Cluster" page), click on Set up on my own.

      Creating the Schema Registry

    • Select a cloud platform and region, and click Continue.

      Creating the Schema Registry

  5. Now go to the ksqlDB page within it and click Add application.

    If prompted, select Create application myself.

    Leave the access control set to "Global access." Make the application name pipelines-quickstart-ksqldb, and leave the number of streaming units set to "4."

    Creating a ksqlDB application in Confluent Cloud

    The ksqlDB application will take a few minutes to provision.

Note

Make sure that when you have finished the exercises in this course you use the Confluent Cloud UI or CLI to destroy all the resources you created. Verify they are destroyed to avoid unexpected charges.

Source and Target Systems

This course uses a source MySQL database and target Elasticsearch instance. Both of these should be accessible from the internet.

The provision and configuration of these systems is primarily outside the scope of this exercise. These could be run as managed services (for example, Amazon RDS and Elastic Cloud), or as self-hosted with the appropriate networking configured such that they can be connected to from the internet.

MySQL

You can follow this guide to set up an Amazon RDS managed MySQL instance.

Regardless of how you provision your MySQL database, run this script to create the table and data used in this course. If you’re using the mysql CLI, you can do it like this:

https://raw.githubusercontent.com/confluentinc/learn-kafka-courses/main/data-pipelines/customers.sql | \
  mysql -u admin -h $MYSQL_HOST -p

Elasticsearch

You can run Elasticsearch yourself, or use a managed service such as Elastic Cloud.

Use the promo code PIPELINES101 to receive $101 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.