In this week's Streaming Audio, we head over to Bordeaux to talk about hacking, both kinds of hacking, actually, the television meaning of hacking, where people worry about other people breaking into our networks, and then the kind of hacking that's much closer to my heart, which is the idea of building interesting things with computers just to learn, just for the joy of building stuff and seeing where it leads. Hacking, as in playing with technology to feed your brain and see if it grows into something much larger.
So my guest today is Géraud Dugé de Bernonville, who's been building out a network mapping and intrusion detection system using Apache Kafka, the graph database, Neo4j, and a bit of Google's TensorFlow for machine learning. And they're doing that to see if they can tap into the fire hose of network traffic data and just see what they can learn, see what they can build. Before we get started, Streaming Audio is brought to you by our education site, Confluent Developer. More about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. Joining me today is Géraud Dugé de Bernonville. Géraud, how you doing?
Hey, fine.
Good to have you here. I have to say, as an Englishman, that is an epic name, very stately. I hope I'm not mangling it too badly.
In French, too, it's quite hard sometimes.
It sounds like you ought to own a lot of land.
No.
But you're actually ... you're not a rich French landowner. You're actually a software data wrangler for a company called Zenika, right?
Yeah, that's it.
Tell us a little bit about what you do.
Yeah. So I'm based in the Bordeaux area. So Zenika, we have a bunch of consultant, which deliver technological, organizational, and managerial expertise to our customer. So we do a lot of stuff. We do data stuff, data science, data processing and so on, and also development from back. We do agile, so it's quite large. But me, I'm more dedicated to data subject, so many data project in the Bordeaux area. So I do Kafka connect integration and so on, but I also work a lot with Elasticsearch at the moment. And so we deliver expertise to our client, but when we are not working for client, for customers, we give training. So I give training on Kafka topics, Kafka deployment, architecture, administration, and development, and also in Elasticsearch. And I also keep training related to machine learning and TensorFlow. A wide range of [inaudible 00:03:26].
A full suite of services. Funnily enough, I used to work for that kind of software consultancy agency that was based out of Paris. Is it just a coincidence or are there a lot of those sorts of houses in France?
Yeah. So Zenika, we have about 10 agencies in France. So we have one in Paris and we are also present in Canada, Singapore, and Morocco.
Oh, okay. Interesting.
More wide enterprise.
With a slight leaning towards places that speak French, by the sound of it.
Yes. Singapore is in English.
Yeah.
Other place, yes, more [inaudible 00:04:17].
Isn't there some French in Morocco? Let's not get into the geography. That's really not my strongest subject. Do you have a typical kind of client or is it just anyone that needs software?
No, we address different client from different business case. So we work with healthcare, ecommerce, and so on. So we have a wide range of customer or different customer types.
Okay, fair enough. Sometimes those agencies specialize and sometimes they're just who needs software, which is just about everybody these days, right?
Yeah. We work with who needs software.
Good target market because it's everyone. So getting more specific, so one of the reasons I wanted to have you on is you were one of our finalists in the Confluent Hackathon and you had an interesting project that I'm hoping you'll take me through. So give me the high-level view and then we'll get into some nitty gritty detail.
Yeah. So our project is called Ziem in French. So it's a pun with Ziem and Zenika. And it's mainly an internal project, more or less research and development. So we have a few people who work, few consultants who work on this. And the goal of this project is to provide real time intrusion detection. So in the process, we will collect data from network traffic, process those data, and try to detect if we can detect a potential attack from a hacker or bad person.
That instantly screams real time, large data to me because, network traffic, that's a lot of data to come in and you have to find out if someone's intruded soon rather than at the end of the month, right?
Yeah, real time. So that was the main objective of the project. This is still a work in progress project. So we still have a lot of work to do on this, but for the moment, we managed to build a simple infrastructure that we can iterate on and improve in the time.
So take us through that. What is the structure?
So the structure, in fact, this project comes from an old idea. So three or four years ago, we gave a conference on ksqlDB.
Oh, okay.
Sorry. The context was in the network traffic analysis. And last year, we decided to go further in this topic. So we decided to create a project around this conference, around this topic, and it started in the end of last year. So we started by reducing what we have done on the first presentation. So the infrastructure is the first part, there is a first big part, is to create sandbox environment where we can try to perform some attacks to collect the data and then to create dataset to train machine learning models on it and so on.
Oh, okay.
So this part was a big part to create this sandbox and this sandbox is... Can I go into technical detail?
Oh, yeah. Anything that doesn't need a diagram, you go for it.
Okay. So this part, sandbox, is deploying GCP in Google Cloud.
Okay.
We use Terraform to deploy this stuff. And so the infrastructure can be made of several dozen of virtual machine that share the same network and we can then connect to one of those VM and try to perform attacks.
Right, yeah.
So this is a playground where us, concept and we can play with and have-
Safely try and break things.
Yeah, yeah. That's it. And on this sandbox we have plugged an agent that collect the network traffic and then those data are sent to Kafka cluster.
Is that like a custom agent or is it just something built into...
For the moment, it's mainly... It's based on TShark, which is Wireshark but in client, command line mode. And-
So just a very standard network sniffer?
Yeah, yeah.
Yeah.
That's it. So it based on TShark and we use the Kafka kcat to send the data, send the output to Kafka.
And you're just dumping all that raw data straight into a topic?
Yeah. At the moment, we take all the data, we take all.
Okay.
So this will be in the next part where we'll try to improve the data to see if we need everything or not.
But that's a very standard approach with Kafka. "I've got this hose of data, I just dump it into a topic and then worry about it," right?
Yeah, yeah. That's it. That's why Kafka was the good solution for us because it can ingest a large amount of data.
And the other point that make us use Kafka is ksqlDB, which is a way to easily provide processing to implement a processing task.
And also with the integration of Kafka Connect, we can easily extract data to other systems. So that was the point of the Hackathon.
Right. So your pipeline is command line sniffing network packets, sending it to a Kafka topic. And then I did have a quick look at your source code, actually. So you are dumping just JSON packets as events into the topic and then you're using KSQL to reach into that JSON and massage it for later processing, right?
Yep, that's it.
Yeah. Okay. That makes sense. And it also makes a lot of sense that you're using that because you're grabbing everything because you don't yet know what patterns you're looking for, so just grab all the data and process it.
So what do you do with it once it's in the topic, once you've... How do you massage it with KSQL and what are you trying to get to?
Yeah, so for the moment, before the Hackathon, there was only one use case for ksqlDB. It was mainly to prepare data and export it for training some machine learning models.
And then we plan to reiterate on this stream processing to improve, to better format the data for machine learning.
Right.
And then we saw the contest, the Hackathon, and we think about how can we implement something that looks sexy and quite quickly.
So we had the data, we know how to collect the data, we have data in Kafka that is coming, and so the better way to represent those data that represent network traffic. So this is a graph. So the natural solution for us was to use Neo4j.
Neo4j, yeah. The graph database. Yeah.
Yeah, in the first step. So with ksqlDB, it was very easy to implement, to create a sync connector to Neo4j that take data from a transform topic to a graph database.
Okay.
Then the first idea was to only present the data from the default Neo4j UI. So this was the first idea to participate in the Hackathon.
But after looking a while, if the visualization proposed by Neo4j at the time was not quite satisfying, so we decided to implement a client based on a already existing JavaScript library, which is Neovis.js.
Oh, okay. NeoVis? So that's a dedicated JavaScript library for visualizing Neo4j data.
Yeah, that's it.
I've not used that one. Okay.
So that's the screenshot you see on the GitHub repository, shows that, the result of our UI.
I saw that and it's almost an instant network diagram of who's actually on your network and who's internal and external. And I thought before we even get into the hacking and machine learning, that would be useful for a lot of organizations that don't know their own internet intranet infrastructure, right?
Yeah, sure. That was the point when we implemented this because we thought for ourself, this is already a good point to have this visualization. And also, yes, it was in fact imposing in the roadmap of the project to have better visualization. In fact, we implemented this UI very quickly, it take only a few days. Once we had-
Very quickly, it takes only a few days. Once we had the whole infrastructure and the data coming in Kafka, it was very easy to implement the step to extract, to Neo4j and then plug in an application on it.
Do you get things like the ascents from your visualization of how much traffic is flowing between which nodes? Is it weighted at this stage?
It's in the V2 of ... Because during the summer, we continue to work on it and we improve the UI. And now, we can display the traffic flow, the volume, the count of packet that go from one point to another point.
Nice. Because that's ... I've worked at some banks where their network diagram is a Word document that's two months at best, two months out of date. They don't know what they actually have on their network. And if you could just go from a script to that kind of visualization that you knew was live and real, that's actually really cool just in itself.
Yeah, sure. But yeah, there are some limitation.
Oh, tell me.
Yeah. Under clearly data-
Terms and conditions.
Yeah. Because at the moment, currently in our sandbox, there is only one network, so there is no VLAN and so on. So in fact, if you want to scrap the data from different VLAN or different network, you'll have to deploy at least one agent on each, the clock.
Oh, right. Yeah. So one agent is enough to collect the local land traffic.
Yeah.
That doesn't seem too onerous. That doesn't seem like too high a burden, to make something that's really quite large. Yeah, that would be very cool. So getting beyond that, tell me something about your machine learning plans with this, because that sounds really cool.
Yeah, the really cool part. So for the moment, the first step was to perform non-supervised machine learning, to perform anomaly detection mainly. So it is to train some models and then to see if the prediction are too different from what we collect at the moment, to see if we detect anomalies, if the traffic is corresponded to anomalies. So it's mainly based on packet count.
Okay.
If you are a very high count of packet during a small period, it's suspicious.
Suspicious. Yeah. So it could be hacking, it could be some new process creating a lot of network load. But it's probably something you want to know about either way.
Yeah. That's one first point of the project, to start with non-supervised. And then the next step, we are not here yet, is to use non-supervised detection to help us better put label on our dataset for training. And the goal after is to go to supervised machine learning, to better categorize the different type of attacks. So to detect if an attack is a scan port or if it's a trial of DDoS in [inaudible 00:20:07] service.
Right, so let me just check I remember this correctly. Supervised machine learning is where you give it some examples that are labeled with, "This is an example of something that's an attack. This is an example of something that's not."
Yeah.
And unsupervised is where you give it no clues. You just give it data and say what patterns can you spot?
Yeah. That's it.
Yeah. Okay. Cool. So I learned something on that machine learning course I took. So do you have in-house security experts who are performing new attacks or are you not at that stage yet?
So in fact at Zenika, we have a team specializing in security. And for the first steps of the project, in fact, we use a training that is provided by this team, to try to reproduce some attacks by yourself, on our sandbox.
Oh, okay. I have to ask: You're running this on GCP. Do you risk running foul of Google detecting your tests as actual attacks and shutting you down, because presumably they're doing this proactive monitoring?
Yeah. So in fact, we are not choosing directly the network layer provided by GCP. We have implemented in fact a virtual layer on top of the network. In fact, all the VM share the same virtual network where we authorize some stuff that GCP doesn't.
Okay. That sounds like you'll be safe.
So for the moment, we didn't have some issue with Google.
If anyone from the GCP team who's listening to this, please leave them alone. They're safe.
It's for research and development.
Yes. It's all white hat hacking?
Yeah.
So is this something that's going to lead directly into client work, or is it just purely research and learning?
That is the purpose of the project too. It is top ... In first internally, to try to build a solution around this problematic of network security, and to have a consultant that can work on it and gain experience. And when we'll be more experienced, it is to industrialize this solution. And why not propose the solution to customers?
Yeah, because I can see you could just go straight in with that as network visualization and security detection. But it's also a kind of pipeline that's got to be reusable in other industries? We've got a ton of data coming in and we need some pipeline that can handle visualizing it, learning from it, detecting patterns from it. It must be the word Bordeaux in my head, but I instantly think how this could change the wine making industry. They have a lot of things to monitor and they need to know quickly when something unusual is happening. I guess sometimes you can just look out the window, but not always.
Yeah, sure. I think the network security is just a use case when we will be more experienced with all the ... how we implement the pipeline and so on, collecting data, processing, training models and so on. This is a method that can be reused in other use case.
Yeah. One other thing, one last thing I wanted to ask you was, I've not actually used TensorFlow in [inaudible 00:24:35], and I really want to. So give me a crash course in getting started. If I had some network connection packet traffic data, what would I actually do?
The purpose is to use TensorFlow to train a model on the network data. TensorFlow is a very famous library to train a neural network. Neural network is made of several layers of cells that try to learn from the incoming data. There is a lot of hyper parameters to tune the number of cells, the number of layers, how each layer connect to each other. That's a big work for data scientist. And the output of a training is a model that we can reuse to predict, to perform some prediction on fresh data that is coming in Kafka. That's the goal of the product too. How can we call the model? In fact, in ksqlDB, you can use user define function.
Ah, yeah.
Enrich the KSQL language. The idea is to define a user defined function that will call the model to perform prediction.
Oh, so you're going to in line some kind of predict function into your SQL query?
Yeah.
Okay. Because you define user defined functions in Java? So are there Java hooks into TensorFlow?
Yeah.
Okay. I thought TensorFlow was mostly Python. Am I wrong there?
Yeah, but you can try load models.
Okay. Okay.
Or maybe a PI can call a PI to remote TensorFlow server.
Oh, that would be cool. So that's how you get back to the real time on a pair event basis prediction.
Yeah.
Nice. Okay. I'm going to have to try that out. I'm going to have to find time soon to play with that kind of thing.
Yeah. But it's still in progress, so we broke a lot to improve the rule.
Well, I don't think we have any rules against you resubmitting to next year's Confluent hackathon, so maybe we'll see how the project evolved by then. Right?
Yeah.
Géraud, thank you very much for coming and talking to us. That's a really interesting project.
Yeah. Thank you for your invitation.
Cheers. We'll see you again. Bye.
Yeah, bye.
Thank you, Géraud. I'm going to confess to you here, I'm a little bit jealous of him. That combination of big data and machine learning stuff is something I've never really found time to play with, and I would love to. If my employers are listening, please give me time to play with these things. In return, I'll remind our listeners that Streaming Audio is brought to you by Confluent Developer, which is our technical site that teaches you everything you need to know about Apache Kafka, realtime systems, and event systems in general. We've got tutorials, courses, architectural guides, and of course the back catalog of this very podcast. So take a look at developer.confluent.io. In the meantime, if you want to run your own Apache Kafka cluster, get it up and running easily, and leave all the management to us, take a look at our cloud service at Confluent Cloud. You can sign up in minutes, have Kafka running reliably in no time at all. If you add the code PODCAST100 to your account, you'll get a bit of extra free credit to run with.
And with that, it just remains for me to thank Géraud Dugé de Bernonville for joining us. To make one last apology for mangling his name as I'm sure I have with my British accent, and to thank you for listening. I've been your host, Kris Jenkins, and I will catch you next time.
Can we use machine learning to detect security threats in real-time? As organizations increasingly rely on distributed systems, it is becoming more important to analyze the traffic that passes through those systems quickly. Confluent Hackathon ’22 finalist, Géraud Dugé de Bernonville (Data Consultant, Zenika Bordeaux), shares how his team used TensorFlow (machine learning) and Neo4j (graph database) to analyze and detect network traffic data in real-time. What started as a research and development exercise turned into ZIEM, a full-blown internal project using ksqlDB to manipulate, export, and visualize data from Apache Kafka®.
Géraud and his team noticed that large amounts of data passed through their network, and they were curious to see if they could detect threats as they happened. As a hackathon project, they built ZIEM, a network mapping and intrusion detection platform that quickly generates network diagrams. Using Kafka, the system captures network packets, processes the data in ksqlDB, and uses a Neo4j Sink Connector to send it to a Neo4j instance. Using the Neo4j browser, users can see instant network diagrams showing who's on the network, allowing them to detect anomalies quickly in real time.
The Ziem project was initially conceived as an experiment to explore the potential of using Kafka for data processing and manipulation. However, it soon became apparent that there was great potential for broader applications (banking, security, etc.). As a result, the focus shifted to developing a tool for exporting data from Kafka, which is helpful in transforming data for deeper analysis, moving it from one database to another, or creating powerful visualizations.
Géraud goes on to talk about how the success of this project has helped them better understand the potential of using Kafka for data processing. Zenika plans to continue working to build a pipeline that can handle more robust visualizations, expose more learning opportunities, and detect patterns.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us