Hello, you're listening to the Streaming Audio podcast. And let me start with a bit of science fiction. Indulge me for a second. The year is 2059. A soldier checks his battlefield computer and an analysis of seismic activity declares there are no active threats. So he heads back to the Jeep. And all that battlefield data gets synchronized to the Jeep's computing cluster, which then analyzes it and says the patrol has covered the whole area. So the squad leader says it's time to head back to base. And when they head back to the base, all the Jeeps coming back synchronize their data to HQ, which then analyzes the entire platoon's data to get a complete situation report.
In that scenario, you're storing event data at multiple tiers. You're processing it at multiple tiers for different reasons, and you're synchronizing it whenever connectivity allows. Well while the military example is a nice dramatic one for storytelling, there are dozens of situations where you might want to do real-time stream processing out in the field and back at HQ, and maybe several places along the way.
And today's guest, Jeff Needham, has been working on bringing that kind of futuristic story closer to the present. We talk about that and on the way he builds up a picture of why Kafka is uniquely suited as the platform to do that kind of processing. Before we get into it, Streaming Audio is brought to you by compliment developer, which is our educational site for event systems and Apache KFKA. It's got a library of blog entries, code samples, and hands-on courses. So check it out at developer.confluent.io. And if you take one of the courses you'll find there, you'll need a Kafka cluster. The easiest way to spin up one of those is with Confluent Cloud. Sign up and you can be up and running in minutes. And with that, let's take a peek into Confluence Advanced Technologies Group and see what they've been up to.
My guest today is Jeff Needham. Hi, Jeff. Welcome to the show.
Hello there, Kris. How are you?
I'm very well, good to have you here. Let me see if I've got this right. You're a senior solutions engineer at the advanced technology group. It's a great title.
What's it mean? What's an advanced technology group?
So ATG is a very unusual collection of characters and we do kind of crazy hair on fire things. We try and push the envelope of what you can do with Kafka, where Kafka can operate. At least in terms of my role, I sort of do the hair on fire, explorer kind of activities. Just to give you a sampling of the people who work in my group, Jeremy Custenborder, who a lot of people know both within Confluent and within the community. He was the first SE ever in the history of the company. We have the mighty Kai Waehner, the machine who flies around the world and never sleeps. As far as I can tell he never sleeps. Peter Gustafsson and Sudarshan Pathalam are our cloud networking geniuses. So that's a lot of horsepower when we get into complicated, especially, cloud resident deals or hybrid clouds.
My background is I worked in Yahoo Global Engineering about 15 years ago. So I definitely appreciate the cloud is not always much fun. And the networking componentry side, especially in hybrid secure encrypted environments, there's a lot of wiring. And so Sundarshan and Peter sort of handle when people need that kind of horsepower. And then we work for Nick Dearden and Nick ran case ksqlDB engineering for quite a while. And then came out to be part of this group. And the last member, Heinz Schaffner, up in Toronto, he came from Solas as did Hans Jesperson who we started out being part of this kind of SWAT team set of SEs. And now we're in PMAX CSIT organization, which is a innovation group. And so we're kind of within this CSIT innovation group and that's where we live, but we're very closely tied to the field. I started out as an SE in federal as I was at Horton Works. And so I still have-
So just for those who aren't U.S natives, federal is working for the U.S. government.
Mostly the U.S. federal government, which includes public civilian, which would be commerce and interior, and then the DOD, defense department, and then the intelligence community. Those are the three buckets that make up the U.S. Public sector. I also do some work with Canberra, Australia Defense, and I also do some work in the UK space as well. Piece of trivia, I'm a dual citizen. So I actually carry a Commonwealth passport as well as a U.S. passport being a Canadian citizen as well. So that sometimes helps. It sometimes deters me from getting into very classified meetings because I hold two passports and you're not supposed to do that. But an interesting tidbit is that the intelligence community five eyes is a term you might have heard of. It's the five of the English speaking intelligence communities, the primary one being the U.S. national security agency. The other four are all Commonwealth nations, UK, Canada, Australia, New Zealand.
Oh, is it eyes as in ocular eyes, looking at you, right.
I thought this is going to be a really long acronym. So the five spies.
Yes, so to speak. And so they're all signal intelligence organizations that are loosely coupled into what's called the five eyes and four out of the five of them are Commonwealth agencies.
So I [inaudible 00:06:24] public sector, global SE sort of, kind of informally.
Probably not as glamorous as the stuff you get in spy films.
Absolutely non-glamorous because it's all signal intelligence. But it's a really good thing because when we get into some of this stuff in the podcast, when you say signal intelligence, everybody assumes that the word that goes in front of it is military. If you think about Caltrans or Cal FIRE in California, or parts of the UK infrastructure, National Rail, British Gas, signal intelligence is signal intelligence. It's not always the military variety like it is in the five eyes world. And so a lot of what I do in terms of doing platform design for customers actually bleeds into the commercial side because situational or operational awareness is a form of signal intelligence. It's just not the military kind. It's the kind that Cal FIRE or Caltrans needs.
I think I need a definition here. What are you classing as signal intelligence?
That's a really good question. Signal intelligence is being able to essentially improve your operational or situational awareness by understanding the telemetry coming in from the sensors. The sensor array. Now that sensor array could be satellites and fancy pants James Bond, MI6 kind of stuff, or it can be sensor arrays along a water pipeline. Like in California where monitoring water is the most important thing we do aside from trying to put out fires. Signal intelligence is making sense of what's coming in. And when I talk to customers, whether it's cyber or oil pipelines or water pipelines or classic DOD stuff, IC stuff, what it really helps them understand is there really isn't a needle in a haystack. And so there's usually an aggregation of signals, which I call a signature. And so you don't usually find an aha needle going, ah, that's the thing that's really bad. It's a combination of telemetry that produces this signature. And that really plays into the sort of Kafka backbone, Uber, Netflix, LinkedIn, telemetry back plane story is from taking-
So are you saying you've got to convert it into streaming audio terms, right? So signal intelligence is this bringing in of event data from external sources analyzing it?
And you're saying you don't expect to find the key event that changes everything. What you're looking for is a pattern in the series of events coming through.
Absolutely. And that's where streaming analytics. And that can be a very simple piece of KSQL. And I mean just little bit of select list, little bit of predicate. And now what you're doing is being able to pull together disparate signals into a signature. And since Kafka has this unprecedented ability to sort of ingest at horrific Netflix scale telemetry or Uber scale telemetry, what it allows you to do, unlike very specialized hardware. In the cyber world, these are called intrusion detection sensors, IDS. It's a term from that world. When you use one of those tools like correlate Zeke, dedicated hardware, very fast, running at PCI bus speeds crazy fast, we don't do that. We're upstream. But we can see the whole battlefield. We can see the whole board. And so that becomes unique because we work in conjunction with the hardware accelerated sensor arrays.
And that could just be an MQTT server, or it could be something like correlate Zeke where it's looking at a very focused piece of the world, feeds us metadata and telemetry, but then there's a dozen of those spread across the entire sensor array, whatever that is. So I'm sort of using cyber threat intelligence as the current example, but that's because the signatures that you gather across the entire sensor array, hardware accelerated or otherwise, is what constitutes you constructing that signature. And that signature can be constructed in such a way that it's just a simple piece of KSQL that brings those things together. And we haven't even really touched on using user defined functions or more advanced stream processing. We're just cleaning stuff up and making the downstream KSQL topic that stuff like... Oh, what is that term? The create stream as select.
Creating stream as the query on another stream. Right?
Right. And so then that destination topic where the stream writes everything, it's imutable, it's persistent. It's a flight data recorder. That's my favorite metaphor for commit [inaudible 00:12:05]. And then what it's done is just taken a handful of signals with not a lot of effort on the KSQL side and create this destination, what I would call, signature topic of interest. Now you have the signals aggregated-
Sorry. Can you give me a concrete example of that? What specific sensors are we going to look at to get what specific answer?
So, one example from cyber threat intelligence is to look at what's considered the trifecta of traffic. One is from firewall logs or logs in general, a lot of Sims like Splunk and Elastic are forensic batch database aggregators of log data. So logs are important. Net flow is important, which is a protocol developed by Cisco that essentially shows you the egress routing from the switches all the way back to the source. So it's the pathway. So it's graph centric, is a graph centric flow of data. So logs, graph centric net flow, and then something called PCAP headers. These are the TCP IP headers that are in the raw traffic. Now that's where using something like corrolate Zeke, or other intrusion detection sensors, they can handle the tera bits, tens of terabits of raw traffic. And what we're just looking for are just the headers, which can be like 64 to 96 bites.
90% of what you need to see in general, there's a lot of exceptions, is in the headers of the TCP. So you combine these three sources of data from logs, maybe from Splunk Universal Forwarders, which is why I lobbied hard and we eventually got a Splunk Universal Forwarder connector, is you bring in logs, that's one leg. You bring in net flow. And there are dedicated tools that provide that including Juniper and Cisco switches. So the net flow standard is kind of ubiquitous across enterprise switches. You bring that traffic in and then the PCAP header thing.
Now, most Sims can't handle that kind of velocity. But with Kafka, we're, how many brokers do you want? How fast do you want this thing to run? And so being able to design essentially a signal or signature processing back plane on top of brokers and topics allows you to sort of bring those three together and being able to aggregate what you want to look at is, again, maybe a matter of a very modest select list and a modest amount of predicate in the KSQL statement. You don't have to get too deep into it. Again, you probably don't want to get too deep into the SQL processing because the velocities are serious. So then you have to trade off how much filtering and opportunistic analytics you can do on the fly. That's the trade off against the velocity of being able to aggregate the signatures that you really are interested in looking at.
Okay. I can see how if you're looking to pull all those three sources of data into a single real time processing unified place. I can see how Kafka's ideal for that. I'm trying to pin down the specific use case here. So what's a pattern I would be looking for in those three streams to know I've got something worth alerting?
So one of the most common examples is from the firewall logs, who is hammering on your firewall to look at the PCAP headers on the outside of the raw traffic. So then you can use something like Dark Trace or correlate Zeke, and correlate Zeke will live outside the firewall. And depending on how the firewall is configured, it can spew net flow. So the firewall produces logs, net flow, and then the raw PCAP of outside the firewall. You look at the bare naked crap out on the network. The use case is to be able to detect the combination of parameters because they're trying to do a specific attack. If that helps. Do you need more detail than that?
Have I got right we're going to look at the log stream and say, okay, these addresses are hammering our system and they have-
...are hammering our system. And they have these telltale IP headers, which say, "Okay, this is a particular kind of attack." And the graph says it's coming perhaps from these hostile regions around the world.
Right, because not only do you get a source implementation. And then a lot of hackers use ToR switches so it's very difficult to trace the IP, but you can trace the pattern of traffic because then you can at least trace it back to where the ToR originates. Now, that's where it gets a little more complicated but that's a good example where you take three sources of telemetry in the cyber threat world, and then combine that to make sure you see a clear enough picture, and it's not just the log entries.
That's where Kafka has a big advantage over a classic SIM, because NetFlow and PCAP are very high velocity, so high volume, it would be impractical to sync all that into a classic forensic database. Also, it's a forensic database, it's running a little behind, it's not real time. So being able to detect these patterns in addition to curating the topic of interest, the signature topic of interest, now you can do some of this because now you can have streaming analytics run right on top of this destination topic of interest.
Okay. Yeah, I can absolutely see that somewhere in the world there's some system that is batch loading these ETL into a relational database, and trying to join the latest tables overnight to get that data by the morning. And that's not...
My history is I spent about a decade working in the Oracle kernel group focusing on the performance and scalability of the Oracle database. So I understand the hardest thing to do on a relational, on any database, is to do essentially random writing, index update, deletes, right?
Most sims are index driven, so if you have to ingest a crapload of material, used a technical term, what you have to do is hammer on the database cluster and ask it to do the hardest thing imaginable. When it runs forensic queries, it's just read-only, that's good. Queries are easy, because they're fun, they're read-only. Especially in version-based databases like Oracle, but there are other version-based database kernels, right?
When you have to write, it hurts. And so then you're constantly ingesting NetFlow, PCAP, and logs into a database and asking it to ingest ETL, so ingest filter, then write it into the index before you can even run the query, so then if you're doing these three random write hard steps before you even can run the query, it just adds time. And it assumes you've tuned the living daylights out of your database, so it has an enormous amount of random small block write headroom. In my experience of doing this for quite a while, hardly any database is designed to have that kind of write IOPS headroom. And to do periodic ETL filtering at high velocity, you need write ops headroom. And almost invariably the hardware platform isn't set up for that, so not only are you adding a bunch of steps, you're adding the steps that hurt the most. Where on our side, we're clean, we're not 500 microseconds away from the data but we're a hundred milliseconds away, so we're pretty close.
And the amount of... There's no question that we can handle the scale and velocity, because Uber and Netflix and LinkedIn prove that every day. So then you have this capability that is we can do the ingest and the analytics faster, because we're not batch and our write layer is so lean and it's just commit logs, and they're append only. This makes a huge difference, because again, if you're looking at the right overhead... We still have to write persistently to disk, we don't have to write rep three because it's analytic data, it's not transactional data, it's not your payroll check. But that's something you trade off and so that even gives us a little more write ingest headroom, while we're doing the case SQL streams, read from topic, write back to topic loop. Now, if you configure this quickly, this thing is... Again, you could figure out a signature in the 100 to 500 millisecond range. You can't do that with a classic batch forensic ETL pipeline.
Yeah. Yeah. Because you're loading that kind of volume of data into a relational database let's say, and you know that you're writing it once, you know that effectively it's immutable. But that doesn't help you because the database has to cope for the fact that you might be concurrently writing, even if you're not, the whole thing is architected for concurrent writes. It has to give you the guarantee that would work.
This is so cool about Kafka when I first started to really understand it. I had a good conversation with Jun Rao about this. I know how the Oracle transaction layer works, and you look at Kafka and it's like, "I get a hand tunable customized transaction layer for every single topic and you're like, what?" It just blew my head off, because as a designer it means if I need that transactional integrity, I need classic acid mode, sure you can get that, here's your classic acid mode topic. I don't want to do classic acid mode on NetFlow or PCAP headers because I'm going to just shred the brokers, so I'm going to need more of them.
Because again, in order to do that kind of writing and be asset compliant on rep three with ISRs and Axall, that takes time because that's your paycheck. But it's PCAP header, so I can loosen that requirement and what I get back in exchange for loosening the acid semantics is headroom, write headroom. And that means less time spent, and therefore less time spent to understanding whether or not I've actually found the signature of interest.
Yeah. Because you're saying you can tune it so you say, "Every event is precious, you must not lose a single event." And that's fine. But you can also say, "You know what? If it gives me more throughput, I can lose a few events because I don't even care about individual events. I'm looking for a pattern." Yeah.
This is very important because this is the difference between transaction processing with Kafka and analytic processing with Kafka. I'm in the signal processing business, and any data scientist I talk to they're in the signal processing business, they're not looking at an individual column or row because they're not doing the asset semantics transaction thing. They're going to look at 300, 400 million rows looking for a pattern. So if you dropped 10,000 or 30,000 or 140,000 people are like, "Oh my God, you dropped 140,000 rows. What's wrong with you?" You're like, "There's 190 million of them." And what the data scientist is looking for is a pattern.
So you draw a distinction between transaction processing and analytics processing? Take me through that distinction a little bit.
So transaction processing is what we've understood for a million years, since OLTP was invented in the 60s by IBM. It uses the acid property, acid semantics, which is atomic consistent. And I forgot what the I means.
It's not item potent, is it? Oh, we're going to fail the job interview now.
We're going to fail the job interview. And the D is durability, which more often than not is underwritten by the [inaudible 00:25:21]
I is isolation.
Isolation, transaction isolation.
There we go.
So you can't update the same row at the same time. And so something like the Oracle database kernel, which is something I know but they're all often are all very similar. Microsoft SQL server is based on the original Sybase kernel. And then there's DB2, and then there's nine zillion flavors of DB2. That asset semantics is important because that column in a row could be worth $80 billion, so you really don't want to lose the... You want to have acid semantics around that column and that row [inaudible 00:26:04]
It could easily be like a trade in a bank that is for X million or billion dollars.
Absolutely. And Oracle numbers like IBM DB2 [inaudible 00:26:14] numbers, they're byte encoded numbers, so two or three bytes is easily hundreds of millions of dollars. So those few bytes are very, very, very important, and that's the foundation for OLTP. Over time, these databases accumulated transaction histories and they try to start to do OLAP or analytics on their existing transaction profile histories, that were sitting around in an IBM, Microsoft or Oracle database, so that's where data warehousing comes from. But you're running queries to analyze forensic data, but in order to insert more data like we talked about with the earlier SIM ETL loop, there's going to be a traditional OLTP acid centric ETL loop, that's also hard on those databases, so that chews up their bandwidth. So you'll see customers running on Exadata and Teradata and they'll try and run queries.
And I've talked to customers when I was working at Hortonworks where they had running their business on Teradata, which is pretty good OLTP engine, not quite designed for that but pretty good at it. And they would not be able to run their forensic analytics queries, because they couldn't afford to take the bandwidth away from day to day business, so then they stopped doing it. This was a conversation which spills into the Kafka world because when I was at Hortonworks, Hadoop is a non-acid semantic driven database. It has durability, but it doesn't support this classic acid mode because it was designed from the ground up to be an analytics database. So Yahoo engineering, which reversed engineered big table GFS from the Google paper and that's story for another time. But Hadoop is really... It's designed to be a purpose built analytics processing platform.
So it of course doesn't have acid semantics, because it's not doing OLTP. [inaudible 00:28:25] Kafka, and Kafka has inherited some of the rep three semantics and capabilities and distributed model that you see a little bit in GFS, big table Hadoop. There's a straight line between the history of these three products, and a straight line between LinkedIn and Yahoo. Jeff Wiener, who founded LinkedIn, came from Yahoo search engineering when I was there. And LinkedIn was down the street on Matilda from Yahoo, so it's a small family. But that technology, you can see that, and the beauty about Kafka is I get the option to do classic acid semantics, and I can also switch it and just let it run fast and loose as an analytic database. So I can trade off the semantics mostly in latency, for the bandwidth to handle the volume and velocity of something like PCAP, NetFlow and logs coming in the front end to go back to the cyber intelligence use case.
Okay. Yeah, I see that. Do you think... You'll have a much better idea about this than me because you've dealt with the guts of Oracle and Kafka and stuff. Is this somehow fundamentally related to the idea that we are separating storage and compute? Can we draw a line there?
Yeah, that's a good question because one of the things that storage and compute... In distributed computing systems, the reason why storage and compute are isolated onto a node like in the classic HDFS data node or in the classic brokers with drives bolted to the broker, is something that's called bi-sectional bandwidth, and it's an HPC term. And so bi-sectional bandwidth from the HPC, the old world so more history on my side, I started out working at a company called Control Data which is founded by a guy named Seymour Cray, and I worked a lot on Seymour Cray design, super computers. But in the 90s...
Is that the Cray?
Of Cray Supercomputer, yeah. Okay.
Of Cray... Yeah. So I worked on... I did some C-compiler work on basically Seymour Cray designed computers, and that led me to working at the Oracle kernel group because I had this weird history. In the 90s, super computers started to become distributed beowulf clusters on commodity Linux hardware. That's an important lesson or that's an important event in the history of something like Hadoop or Kafka, because it's high performance, distributed, scalable computing on commodity servers. So there's a straight line between classic Cray mainframes, beowulf clusters and Hadoop. And then that line continues into Kafka, because Kafka is a distributed rep three, horizontally scalable data processing architecture, so in my mind anyway, there's a straight line between all these three things. But the reason I mentioned the HPC world is that the beowulf cluster, once you had the working set isolated onto a bunch of drives with a computer then that working set...
Let's take classic HPC workloads like finite element analysis, you crank on your corner of the grid because you're modeling weather forecasting or the surface of an F-35 or what have you. So those used to run on big Cray mainframes and now they're a little slice of the wing that you're modeling, now runs on one server and you have a thousand servers and now you get parallelism. So as you add another server, you add disc bandwidth, network bandwidth, memory and compute bandwidth every time you add a component. So that's bi-sectional bandwidth.
In the classic Kafka sense, where you're running CP on-prem and you have a broker with a bunch of drives, most are NVME SSDs these days so they're pretty quick, then every time you add a broker, you get storage bandwidth, you get memory latency or memory bandwidth, you get computing power and you get network bandwidth because you just add brokers, add brokers, add brokers. And if you architect the app and the partition mapping, it's not magic voodoo out of the box, you still have to do some Kafka architectural design homework, but then the thing essentially scales horizontally. And if you look at what Netflix and Uber do for a living, they've dialed that in correctly, fully exploiting.
Now when you move to cloud computing or elastic computing, in order to be able to have the elastic shrink to work, you have to have the data on shared storage. So you trade off a little bit of bi-sectional bandwidth IO for the ability to certainly elastically expand, but also to elastically shrink. So the key to CC or to anything that runs, that can shrink persistent data access, this is very important because this data is persistent, it's state-full data. Normally, if you have stateless data, you shrink, you toss, so it's really easy to expand the cluster to a hundred nodes, fool around with-
It's really easy to expand a cluster to 100 nodes, fool around with something, and then the result gets written back to a persistent store. And then the intermediate results that are out on the nodes gets tossed, right?
You reduce. You throw away everything that got you there.
Right. In classic MapReduce, MapReduce, once you do the reduction and then you finally have the result, then you can actually... If you're not that interested in keeping the original working set around, you toss it and you shrink your nodes.
Yeah, yeah, yeah.
In transaction asset semantics processing, shrinking data, it's got to be on shared storage or you lose it. We used to say this, and I still kind of make this point, which is data is not elastic, but the fine print is persistent. Stateful data is not elastic. It's got to be around. It's got to still be acid compliant, especially in the transaction shrinking world. In analytics, again, as a trade-off, we get to shrink that. Because if those hundreds of millions of rows aren't interesting to us because that's yesterday's data set or this morning's data set, right? You do the analysis. This is important to understand. Because it's not the data that's important, it's the analytic output, right? What the data science algorithms result.
That result, that tells you that it is a signature of interest and that's the money. It's kind of somewhat more... It's a little beyond sort of the data is important. It's the intelligence gathered from the data. With my customers, whether it's Cal Fire, Caltrans, or the DOD, those customers, they're interested in what the patterns are, right? Whatever those patterns are. The pipeline relay pumps in Northern California are about to fail. Okay, that's bad. We need to... It's not so much the raw data.
So then you have this ability to, again, take the difference between classic asset semantic transactions and signal processing analytics, and the notion of elasticity, where you don't really have a stringent requirement on the elasticity of asset semantic data on the analytic side, you definitely do on this side. But on this side, now you have a little more rope to work with. As a platform designer, that allows me to build a more flexible or cost effective or efficient or at the edge more resilient platform so that the analysts can get their job done.
I'm trying to piece this together in my head with technology. I am saying, okay, I want a system that has transactional processing, but I also need to be able to bring in a vast quantity of data, process it, and then maybe look at something like topic compaction to throw away the not so significant data on the way through. Now I found the signal. Now I've brought it into one place and joined it and analyzed it. I can start to compact and throw away the old stuff.
This is a really good point because one of the most important tools of handling like super high volume velocities is you get two very important levers. One is compaction and one is retention. Because you might want to set the retention just a little outside of the KSQL window, for example.
Now, in the asset world, people would look at you like you're a crazy person. But again, when you're in the signal world using compaction because keeping a flight data recorded history is important forensically, so it's not useless, but it's probably going to go into a batched warehouse to do long haul, large working set forensic analysis, right? So then some topics need to be compacted, because it's like, "I really don't need the old data. I need to know what's going on right now. And what went on 190 milliseconds ago isn't really that interesting." But the windowing is interesting because then you get a little more of a signal tale, right?
And that's important to data analysts. Because if they can see the last 34 samples, then their algorithms can start to understand, especially if they start to do what I call machine doing, right?
Define that one for me.
Well, everybody knows what machine learning is, right? You have big ass working sets sitting out on some S3 set of buckets and you're running data bricks, classic stuff, right? Machine doing is actually deploying the trained algorithms out front right on the live streams. To me, that's machine doing, not just machine learning, because now it's actually doing its day job. I like that term because it helps you understand that we're in the doing business, right? Because we're about real-time stream processing. We don't do machine learning within a topic. But again, with KSQL user defined functions, UDFs, there's your hook into the model. The model can actually incrementally train if it's that sophisticated, but your trained model is sitting there probably in a jar.
That's just called out through the KSQL UDF interface. And now all of a sudden, you're watching, you're using your machine learning algorithm, not only to look for things in a more sophisticated way, but to learn from what it's seeing. And that's a big difference. I think that from my perspective, machine learning and AI, which are kind of buzzwords, but from a data processing point of view, what they really represent from my chair is they're dynamically adaptive algorithms. Where if you write a piece of code and you throw it in jar and off you go and you're like, "Ah, the algorithm is crap. We got to write a new one." Okay, you go to change code, resling the jar, blah, blah, blah.
What ML and AI really represent is a category of dynamically adaptive algorithms. They can dynamically adapt to the data stream that's coming in. Whether you use the hipster terms or not, if you can get those algorithms hung off live traffic right out on the stream's interface, then now that's the best chance that they have to certainly help you understand your awareness, but also learn as fast as possible because you're not going back into the batch forensic working set warehouse to retrain. This is kind of an evolving concept. A lot of data scientists who do machine learning are still doing the classic big working set training model.
Dynamic machine learning and dynamic machine doing are kind of a cutting edge topic, but I've run into a few customers and systems integrators, SI, that are definitely looking at this, but it's kind of state of the art for the data science world. But when they're ready, we're an ideal back plane for that, because it's like you hang your jar, we'll call it. As soon as the traffic hits the topic, you can do your job. Think about the latency in terms of milliseconds that that cuts out of the time you become aware.
Signal processing and adaptive signal processing.
It's adaptive signal processing.
Yeah, cool. To try and to get this more grounded, where else in the world is this being used? I mean, what are the actual use cases. We've got intrusion detection.
Classic cyber stuff. Yep.
Give me some more.
We're looking at being able to do some of this in the smart soldier use case category, which is probably better understood as intelligent disconnected edge. Because if you can run Kafka on something as small as a Raspberry Pi... We've been doing that on and off, or I've been doing that on and off for a few years. Officially, Confluent doesn't support ARM64 yet, but everything I've done seems to work. It runs very fast on a pi. Now you have the capability of everything we've talked about in terms of being able to actually handle a reasonable amount of velocity on a Raspberry Pi, which might sound crazy, but it handles a lot of velocity and it does flight data record.
If you decide not to compact or not to set really short retention windows that are KSQL window based, then what you get is intelligent, disconnected, edge processing. You get everything we were talking about, but now we get it where you're out in a ditch somewhere, where there's no communication channel. No comms is a term you might hear on that side. But there's lots of opportunity where you're nowhere near a cell tower, water pipelines, utilities, infrastructure, anything, there's lots of places where you don't have 4K streaming 5G cell towers. Like in the T-Mobile world, right? Their network predominantly is near... Their towers or near freeways. If you live near a freeway, you can get really awesome 5G 4K streaming.
If you live in the middle of Northern Saskatchewan, probably not. That capability... Or offshore in the North Sea, in the Baltics, anywhere. If you look at those circumstances and you're collecting telemetry, we're working on projects with NOAA, for their ocean buoys. Australia project is weather stations that are spread all over Australia, which is a sizable continent, which is not...
...clad with cell towers, right, like Canada.
Yeah, yeah, yeah.
Especially when you get out away from sort of large urban areas where cell coverage is really good, comms are really good. Of r any kind of operational technology modernization efforts, which is kind of the larger topic, what I'm kind of skirting up against, is now you can help customers collect the data, analyze it, certainly forensically record it for later when you do have some degree of upstream comms, but also, again, you use KSQL and a select list and a little bit of predicate. What you send upstream over your precious 3G five minutes a day kind of connection has a really high signal to noise ratio. There's hardly any noise and it's really high quality.
You're saying something like so I stick Kafka onto a Raspberry Pi with a battery pack and I've got this tiny little thing that's running Kafka in the field, perhaps attached to a soldier's backpack or a weather station out in the back beyond in Australia. That is real time signal doing real time signal analysis and sending up the important stuff over a precious connection, but also keeping the history of all the, like you're saying, flight data recording, a complete history if we need it.
This is really because again, the way I've pitched Kafka's core values is it's a flight data recorder. It is the ability to be resilient around comms in the cluster linking feature we really take full advantage of. Yeah, that's huge, because cluster linking is just a service layer on top of the broker kernel, right? It's very super clean, right? It's built on top of the replica code. All they had to do was add a thin service layer on top of code that's been there for a million years, so you're like, this is going to be pretty resilient out of the box. And then the third thing is the stream processing.
You have intelligent disconnected edge analytics with the flight data recorder forensic capability on a topic by topic basis, and you have the ability to do analytics, including machine doing if you really want to go that far down. Lots of organizations aren't ready to go that far down, but those three things, that's what we do out of the box. That's what Kafka has done out of the box all along. It's just sort of a really amazing combination that gives designers the ability to sort of twist it into the shape they need, whether they're in the classic big data center, or if they're in a ditch with a bag full of Pis slung into the back of a F-150 pickup truck. We don't care. We really don't care.
And then it's really up to sort of the imagination of what the customer wants to do and how aware they want to be, but we're like we're good. We can do this. Smart soldier project that I've been working on for the last year or so is putting Raspberry Pis on squad members, but with a mobile command post as their base of operations using cluster linking. When the squad members come back into the command post, they kind of just walk back in and the data that are set up for mirrored topics, not everything's mirrored, but the ones that are mirrored, nothing. Noninvasive.
You're syncing your mobile field Kafka topics back in whenever you get the chance into the main command hub.
That's really cool.
That's really cool, but the mobile command post and the squad leader and the squad members... Oh, I should mention that mobile command post isn't a Humvee or a Razor or a big thing. It could be the squad leader and her radio man could be a mobile command post and could have two or three Pis and maybe some GPU accelerated versions of hardware. Her radioman could do advanced squad leader mobile command post activities. Now, we're not there yet with our partners, but they brought that up and we're like, "Oh yeah. We totally... This does not break the architecture."
And then you look at anybody who have windmill farms in the North Sea or large infrastructure that's like in Northern Scotland where you want to be able to understand and be intelligent at the disconnected edge, then that's really where we really... I mean, this is a long way away from the origin use case at LinkedIn, obviously, but it's like that's how powerful Kafka is and how flexible the topic transaction layer capabilities are. To me, it's pretty incredible that you can twist this thing into a million shapes and dump it onto a Raspberry Pi. It does things that no other technology currently can do.
I don't know of anybody who's trying to like get Kafka-like technologies to run on a Pi in a ditch somewhere, but we're there. It works. You tell us when you're ready to go and we'll put your stuff in a ditch somewhere.
Do you think one day we could have like a plane running a central Kafka flying over the Australian Outback, flying over those weather stations and just syncing with it as it passes by?
That's a great topological use case. What we would do in those cases... The only drawback of the plane is the stupid thing has to fly at a certain speed so it doesn't fall. Large commercial drones is... Mostly because you can reduce the ground speed. You fly out into Western Australia, middle of nowhere, and there's a bunch of Pis on the ground. You want to fly out into that region of Western Australia. Mining operations in the Pilbara, right? [Inaudible 00:51:03]
... mining operations, right? In the [inaudible 00:51:03]. Right? Way out there where all the banded iron is. You fly out and you hang, right? And then you have a little more time. So you open the comms window a little longer. With fixed wing, they're just moving too damn fast, unfortunately. But with rotating wings, whether it's a old school helicopter, but a lot of people are moving to commercial drones. These aren't like little ... These are like commercial drones, right? They're proper. And they can just stop and hover at any altitude, and so the altitude also affects the signal quality of the comms, so if you [inaudible 00:51:41].
Well, they could just land nearby, right?
You could just land nearby, and then you could have like a ridiculous amount of ... You could essentially have a local wifi that's five gigahertz on the drone, and then whoosh. And then the topics that are mirrored, they just whoosh.
If you want to pull topics, then you'd have to write a microservice in order to do ... if you wanted to pull some of the forensic data. So I see the opportunistic valuable data that is topic mirrored goes at a higher priority versus a more sort of directed or explicit process that would actually pull the flight data recorder bulk, because you landed next to it, right? And so now you have the luxury of comms that aren't as precious.
Yeah, yeah. And does that form a chain when the big drone gets back to the central hub? Does that then cluster link sync its topics to the super central processor?
Yes. And so that can go to CC in a classic case, or it can go to a private cloud depending on how the agency has organized. So we think of squad member, squad leader, command post, battalion, and battalion might be big enough that it's their own data center, or it could be public cloud. And if you just turn that into the oil patch, oil and gas industry, right? You have well site, pipeline, refinery, and anywhere along the pipeline, and especially the refinery, there would be enough equipment to be able to have a proper ... They do anything they want there, but also they're going to have uplink comms that are pretty good, right? So up into CC, for instance.
So then again, we're trying to be flexible to say you go as far down range, including fully disconnected with no comms for weeks on end, as you like, and we can accommodate that topology. And as you come upstream, then you can do more processing because you've got a rack somewhere, or if you want to go all the way into CC, because you want to push it that and then do rep through CC across AZs globally, or you're in a point of presence in Australia and you need to push it up into an AZ, which then reflects across to Virginia. Again, that's kind of more of a government DOD use case, but you could see that this would apply to any global company that has global shipping, global energy, global utilities, global manufacturing. We haven't even really talked about supply chain that much on this session. But one of the things that we're starting to understand is smart soldier and gizmos on squad leaders. It's very sexy, sounds very cool. Could write a screenplay about it, but in those worlds and in most other worlds, it's the supply chain that matters.
So then what you do is you apply everything we've been talking about to make your supply chain more aware. So it's a different form of signature intelligence or signal intelligence, but without a supply chain, especially in DOD terms or MOD terms, no supply chain and you're out of business. It's just that simple. And so then it's not just ... That's where again, situational awareness or situational intelligence is what we've been talking about and what we can do all the way down to the disconnected edge. Operational intelligence, it doesn't just apply to making sure the pipeline pumps and relays are working okay. But it's like, okay, I think the pump or the solenoids or switches are going to fail soon. Okay. Order the parts.
So now the supply chain becomes operationally aware, and it's two sides of the same Kafka coin, right? From our chair, the topic definitions and what you do with those topics might be a little different, but everything we've been talking about once that topology is on the ground, I will often tell agency customers, you get situational awareness for free or operational awareness for free. It just depends on which one is the primary mission, but you're going to get one of them for free. And they're like, "What? Free?" And I'm like, we don't care. Right? We're just helping you be more aware of signals. And some of those signals are you need to order more diesel fuel because the generator's going to run out in seven days and it takes you nine days to get the fuel refreshed from wherever it's coming from, and now you know. And so the supply chain is just as important, and being operationally aware of the supply chain is just as important as being situationally aware of whatever you're trying to be aware of.
Yeah. Yeah. And it's like, I guess the new thing, it's not new that you can gather data at all those points and eventually wire them through different systems to get them back. But to have a single database that works at all those levels and can analyze all those different places and transparently just connect that mental and programming model all the way through the chain.
Because [inaudible 00:57:21].
That's pretty new.
Yeah. And so starting from the ditch, it's a key value store message written to a commit log.
Yeah, with the stream process.
You take that key value message as far up into CC. Like, even if you're doing the CC tiered storage forensic thing in the big three, right? It's a lot more useful to be able to start with the key value message and go end to end, as opposed to syncing it into an S3 bucket and then you have to read it back into topics and fool around. I keep an eye on sort of how we're doing with Azure tiered, which is now available, and it's available in AWS. And so that's an important part of my story, even though it appears that I'm nowhere near CC tiered feature set, but I am, because that could be the end of the forensic line that I initiated in a ditch or on the backpack of the radio man, of the squad leader all the way.
You're on a wind turbine somewhere in Holland or ...
Or if you're on top of a windmill out in the North Sea, right? It's like it starts there and you tell me where you want it to end, but we can keep it in a very compact ... And at any point, because they're just messages in a topic, if you all of a sudden wanted to treat them in an acid semantic way, don't care.
Yeah. Yeah. Right at the moment, I can't get hold of a Raspberry Pi for love or money, but eventually I'll be able to buy one when I can.
Yeah, supply chain. When I can, I'm going to do this. I'm going to wire one up to a battery pack and get Kafka on it and stick it in my backpack and just see what data I can gather out in the field and try some of this stuff.
One of the unique things about the ARM, obviously it's risk and it's fabs down on 10 nanometers, so it's super small and you can get four or eight cores of no power and it's great. The other thing is there's this GPIO bus. I'm not a Pi expert. I only started using them six months ago. I've been using ARM for since 2013, since 64 bit ARM came out. I had Spark and Hadoop running on ARM a long time ago, but that's kind of a weird thing. But the GPIO bus gives you access to like analog signals in a process control sense, right? And then that's really interesting. And I know there's these two weird instruments behind me, but these instruments are like eight bit microprocessor process control architectures. And so-
The people listening on the podcast, that's Oberheim synthesizer you've got behind you? Music synthesizer?
It's an Oberheim ... Yes. It's an Oberheim OBXA synthesizer and a Profit VS Vector synthesizer. They're 80s era analogs.
Classic [inaudible 01:00:19]. Musical instrument genre, yep.
Yep. And they're eight bit, two and a half megahertz. There's all kinds of A to D converters, D to A converters, sample and holds, analog switches. That world is exposed to you on the ARM GPIO interface. So now, you can have your ARM sort of understand the analog world around you. So then that's really important when you ave it on your backpack, because there could be analog signals that you want to read off your body, like your vital signs, for example.
Yeah, absolutely. We had-
It's a really powerful thing because it combines risk processing technology and sort of this classic PDP 11 live in the analog vector interrupt world. And that allows you to sort of have that capability. That's why ARM is kind of unique at the edge is because it's kind of architected, at least in the Pi format, architected to understand that the world is analog. And so being able to process analog and digital signals is an important architectural aspect of the Pi 4 platform from my chair, anyway.
Yeah. And to use the same processing model on a Raspberry Pi that you will then transparently use on something like Confluent Cloud, CC as you say, that's a very nice architecture. It makes me want to go away and build it.
And it's crazy, they're quad processors right now. And with eight gig of RAM, I'd recommend if you get one, get with eight. We could probably squeeze it into four, but it's good just to have a little bit of headroom, but I run a broker. I run a KSQL streams loop together in the working set size when I'm sort of just doing work at velocity line rate. It's about two and a half to three gigabytes of resident set size working set. The rest of it's page cache. And so at eight gig, you don't have to worry about it. You'd have to be a little more surgical if you had a four gig Pi. It's not impossible, but it's nice just to have the headroom, because then essentially you don't have to worry about running out of resident set memory and the rest of it's page cache. And that's like on a little thing this big. That's [inaudible 01:02:44].
If you can buy them, they're like 20 or $30, right?
Right now ,they're about ... Actually, they're a little more than a hundred in the eight gig format.
If you can find them, but like you said, finding an eight gig Pi right now seems to be hard to do due to supply chain. But these, I'm clocking this thing just doing like 1K messages, do 40 or 50,000 1K messages a second. [inaudible 01:03:12].
Perhaps more than businesses need.
Right, yeah. That's amazing.
And I'm saturating ... There's only a GigE interface. So I was saturating the GigE interface on the Pi, but it's ... You can produce and consume. This is where actually the extra six gig of page cache helps, right? Because you produce at like, it's about 60 megabytes, right? So yeah. It's around 50 to 60,000 messages per second. And then if they're hanging in the page cache then you can consume at the same speed on the other side of the duplex of the GigE link. So now you have like 50K in and 50K out. And if you hit the page cache window, then there's really no IO going on anyway. It's all SSD, so for the most part, the network interface is the bottleneck, not the storage of the processing.
Yeah. Yeah, of course.
You could grind it up if you got ambitious with KSQL. I'm not saying that there isn't code path there, but if you just wanted to go in and out really fast, that's an example. If you're not consuming and you're just probably producing at a more sane rate, which is probably more like a thousand a second, like, how many vital sign messages are coming off your body? Well, that's a function of the sample rate, right? And that capability, like a thousand messages a second, that's a lot of telemetry at the disconnected edge. And then now you have all that headroom back in the Pi to run some pretty interesting KSQL scripts.
Yeah, absolutely. Well, I think we should leave it there, but I look forward to seeing where else you take this in the real world, but now I most want to run into my own little private world.
This is a fun project. Jeff, thanks very much for joining us. That was a whirlwind tour of a very, very interesting and large space.
Well, we wanted to make it a good show. Yeah.
Yeah. Thank you very much.
You're quite welcome.
Right. With that, I think I need to go and troll the internet for a new Raspberry Pi. My one's way too old to manage that. But I quite fancy it. Bit of processing in the field, bit of processing in the cloud. Who knows what I might learn? If you know how to get hold of a Pi in the current market, or if you have any other thoughts, now's a great time to get in touch. Leave us a comment, a like, a star, a review, drop me a line on Twitter, whichever way you do it. It's always great to hear from you. For more information on Kafka, whether your use case is advanced like Jeff's or much simpler, head to developer.confluent.io where you'll find plenty of guides and courses to help you build a successful event system. And if you take one of those courses, you're going to need a Kafka cluster. Well, the easiest way to get one of those is to head to confluent.cloud, where you can be up and running in minutes. And with that, it remains for me to thank Jeff Needham for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.
Imagine you can process and analyze real-time event streams for intelligence to mitigate cyber threats or keep soldiers constantly alerted to risks and precautions they should take based on events. In this episode, Jeffrey Needham (Senior Solutions Engineer, Advanced Technology Group, Confluent) shares use cases on how Apache Kafka® can be used for real-time signal processing to mitigate risk before it arises. He also explains the classic Kafka transactional processing defaults and the distinction between transactional and analytic processing.
Jeffrey is part of the customer solutions and innovations division (CSID), which involves designing event streaming platforms and innovations to improve productivity for organizations by pushing the envelope of Kafka for real-time signal processing.
What is signal intelligence? Jeffrey explains that it's not always affiliated with the military. Signal processing improves your operational or situational awareness by understanding the petabyte datasets of clickstream data, or the telemetry coming in from sensors, which could be the satellite or sensor arrays along a water pipeline. That is, bringing in event data from external sources to analyze, and then finding the pattern in the series of events to make informed decisions.
Conventional On-Line Analytical Processing (OLAP) or data warehouse platforms evolved out of the transaction processing model. However, when analytics or even AI processing is applied to any data set, these algorithms never look at a single column or row, but look for patterns within millions of rows of transactionally derived data. Transaction-centric solutions are designed to update and delete specific rows and columns in an “ACID” compliant manner, which makes them inefficient and usually unaffordable at scale because this capability is less critical when the analytic goal is to look for a pattern within millions or even billions of these rows.
Kafka was designed as a step forward from classic transaction processing technologies, which can also be configured in a way that’s optimized for signal processing high velocities of noisy or jittery data streams, in order to make sense, in real-time, of a dynamic, non-transactional environment.
With its immutable, write-append commit logs, Kafka functions as a flight data recorder, which remains resilient even when network communications, or COMMs, are poor or nonexistent. Jeffrey shares the disconnected edge project he has been working on—smart soldier, which runs Kafka on a Raspberry Pi and x64-based handhelds. These devices are ergonomically integrated on each squad member to provide real-time visibility into the soldiers’ activities or situations. COMMs permitting, the topic data is then mirrored upstream and aggregated at multiple tiers—mobile command post, battalion, HQ—to provide ever-increasing views of the entire battlefield, or whatever the sensor array is monitoring, including the all important supply chain. Jeffrey also shares a couple of other use cases on how Kafka can be used for signal intelligence, including cybersecurity and protecting national critical infrastructure.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us