Get Started Free

How to Analyze Data from a REST API with Flink SQL

Join Lucia Cerchie in a coding walkthrough, bridging the gap between REST APIs and data streaming. Together we’ll transform the OpenSky Network's live API into a data stream using Kafka and Flink SQL.

How to Analyze Data from a REST API with Flink SQL

lucia-cerchie

Lucia Cerchie

Join Lucia Cerchie in a coding walkthrough, bridging the gap between REST APIs and data streaming. Together we’ll transform the OpenSky Network's live API into a data stream using Kafka and Flink SQL.

Not only do we change the REST API into a data stream in this walkthrough, but we clean up the data on the way! We use Flink SQL to make it more readable and clean, and in that way we keep more of the business logic away from the client code.

How to Analyze Data from a REST API with Flink SQL

Hey there! Today we’re going to look at how to turn a REST-based API into a data stream in Kafka. Now, why would we do that in the first place? Well, REST APIs are cool for a lot of reasons, including their popularity and maturity,
but basically, sometimes the data we need to process with Kafka is in a REST API and we,
as developers, don’t really have a choice. But, we can take the data that’s in the REST API we’re given, clean it
up and make it replayable with Kafka and FlinkSQL. My colleague Dave Troiano recently created a demo that does just that. Grab a cup of coffee, get comfy, and join me as we take data from
the OpenSky Network’s REST API and turn it into a data stream. Step 1: Pre-requisites Let’s get a couple pre-reqs out of the way. You’re gonna need a Confluent Cloud account,
as well as an installation of the Confluent CLI and the confluent-flink-quickstart plugin. We also ask that you review the OpenSky Network’s terms of use. All of these resources are linked below. Step 2: Setup and source connector launch First, we’ll clone the demo repo. Then, I’ll log in to Confluent via the CLI. Then, I’ll run this command to spin up a Kafka cluster
and Flink compute pool in the Confluent Cloud. It’ll run for a few minutes, then, tada, we’ll be in an interactive Flink shell! We’ll run this Flink SQL statement. But there will be no tables yet because we haven’t produced any data in to the system. Let’s fix that! Now, we’ll launch the http source connector. We’ll use the script included in this repo: create-connector.sh While that’s running, let’s take a look at a couple of the connector configs. We are polling OpenSky Network’s All State Vectors URL. And for this demo, we’ve actually bound the data to this area of Switzerland,
using this piece of configuration that determines the latitude and longitude. This is the polling interval we’re using. Step 3: Connector validation Ok, let’s validate that our connector is up and running. We’ll click on http-streaming_environment in our Environments tab, then click our
http-streaming_kafka-cluster tile to view the details. Then, we’ll click Connectors in the left-hand nav to see
that our OpenSkyFlights connector is running. Now that we’ve got some data flowing, will our show tables command work? Let’s check it out! Let’s take a look at the table structure, as well as what the data looks like. Now, this is pretty much in the same shape as the data from the API. It’s not ideal. There’s no self-documenting schema for the states field, and there’s some white space here. Also, For each row in the all_flights table,
the states column represents all aircraft in the given bounding box. Step 4: Cleaning up the tables Let’s roll up our sleeves and clean this up with some FlinkSQL processing. We’ll run this statement to define a new table, or view. Before we run the insert query to populate this table, let’s pause to note a couple things. The states array in each row of the all_flights table is getting expanded into new rows,
one array per element, by this cross join against the UNNEST'ing of the states array. The two timestamp fields (one for the poll time and one for the reported event
time) are converted from Unix epoch longs to TIMESTAMP_LTZ timestamps. String, numeric, and boolean fields are typecast accordingly,
with string fields RTRIM'ed to remove any whitespace padding on the right. Ok let’s see how our table looks now! That’s much more readable, don’t you think? Step 5: Teardown You probably don’t want to leave this running forever. In order to delete all the resources in your Confluent account that are
related to this environment, you can just delete the environment. Find the environment id with this command. Then delete it! There. Leave nothing but footprints, as I always say. Outro If you have any questions, leave a comment below,
and Dave and I will do our best to get back to you. Don’t forget to like, share and subscribe to
the Confluent channel. Thanks for watching, and see you next time!

Related Videos

You may also be interested in:

Intro to Flink SQL