Joining me today on Streaming Audio is Fred Patton who's part of SwimOS, which is an interesting project that blurs the line between a distributed computing network and a stream processing layer, and a live user interface framework. It's large and it's ambitious. And when I first heard about it, I could tell it was interesting, but I couldn't quite pin down exactly what it was. So we asked Fred to come in and take us through it. Along the way we go from taking inspirations from Erlang and Scala, to rewrites into Java and Rust for performance, and two TypeScript programs, which are talking to a mesh of agents out on the web to build up a user interface. I don't think I can do it justice in one minute, but Fred does a great job of outlining the project in 30. As ever, Streaming Audio is brought to you by our education site, Confluent Developer, more about that at the end, but for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. My guest today is Fred Patton. Fred, welcome to Streaming Audio.
Hey, thank you very much. Very happy to be here.
It's good to have you. You're going to clarify something in my mind for me, I think, today. Because you are, let me get this right, you're a developer evangelist at Swim Incorporated.
And I was looking at Swim. My first impression was I was blown away by the ambition of it. My second was I couldn't quite get my head around exactly what it is, because it reminds me of Kafka. It reminds me of Erlang and Scala Akka streams. It reminds me of a distributed web scale operating system. What is it Fred?
Actually, when you mention Erlang and you mentioned Akka, that's a good start because, at its core, it is a distributed actor system. It's not the same. It takes a departure because it's really about real time streaming. So it's not so much about general distributed computing where you're really tying together certain services and you've got all the supervision and those type of things. It's really about the real time context. So what do we mean by streaming applications first of all? We might start there because that can mean different things to different people. And in this case, we're not really talking about video, we're really talking about data and information. And so when you've got stream processing, you've really got an overarching computation that's going across a lot of nodes and workers. You can think of it like laying down railroad tracks. Everything has to be in service of this computation to work at scale and you use certain operators, and you funnel everything together so you can basically solve a computation, which is generally like an aggregation or summation, that type of thing.
For streaming applications on the other hand, it's really got to be general purpose. Each application needs a certain degree of autonomy and you can't prescript everything. And the other factor is, so I guess I'll back up a little bit. If we can think about something like the Lambda architecture, where you have the speed layer, you can think of streaming applications in that aspect. You've got all the other state being serialized, going to different places, doing the different batch processing, getting stored at the end of the day. But if you really need some real time insights and streaming throughout, you've got those other things sorted out. So really what you want here is basically nonstop streaming. You don't want to do a stop the world GC pause and do anything with disk. You got to keep things flowing. And one of the things differentiating factors of Swim is that it goes all the way to the client. It goes all the way to the UI. And actually, if you imagine connecting a fire host to a browser, that's a recipe disaster. And so Swim does a lot of things to be able to handle that, which we can get it more into.
Okay. So let's pin that down a bit more, because I know a few ways to do stream processing. Kafka streams is obviously one we're concerned with locally here on Streaming Audio. So I'm getting the sense from you that the first big difference is each process might be running in a very distributed way, at a completely different part of the web. Is that fair to say?
That is fair to say, but they also might be running locally as per normal. You might have a bunch of similar nodes on rack, a bunch of different processes on a particular machine. We do very lightweight, what we call, stateful microservices. You can think of them as actors. And so one thing that we do, back pressure is typically done on a connection basis, the consumer and the producer. And for Swim to work at scale and to be able to go to browsers and stuff like that, we have to handle that across the endpoint. So any real time UI bid is probably going to be consuming many streams. And so we basically handle the [inaudible 00:06:00] so that everything flows together and we don't have any batching where we can't really afford to delay, but then we amortize that by all the different streams that a particular endpoint might be consuming or pushing out, if that makes sense.
So pin that down for me. Give me a use case that kind of processing might happen and we can talk about the specific data.
If you do typical stream processing, you're basically materializing a certain view. So you want a certain answer. And so you're processing that and you're pushing that stream of data to a particular consumer. And this is all the answers here. For our case, there's going to be lots of different sources pushing the data as soon as possible, whenever things are changing on. So we're fine grained on an entity level. So as things change, it could be like a traffic light. It could be IOT machine, whatever bit of data. As soon as they're passing, all the interested subscribers are basically getting those streams nonstop from the different aspects. And so what we do is that we handle the back pressure on the level of all those streams coming into a certain destination, as opposed to individually.
Okay. So let's say, traffic's an interesting one. So I've got cameras detecting the flow of traffic at certain junctions. And presumably that's raw. That's just, I saw a car, I saw another car.
That's streaming in. And then you're trying to aggregate that into a local picture of how much traffic is in that area.
That's a good point because what we do is that. A lot of that is it's going to really keep telling you, I see a car, I see a car, I see a car or whatever. And what you care about in the business standpoint, is it open? Is it busy? When will the light change? And so we're all about handling all the Deltas and the information bits at scale. So we don't want to send all of that data like that. We want to basically send the information, the particular Deltas, predictions.
So would you expect to have some degree of processing running at or near the specific traffic light camera?
Typically, yes. It doesn't have to be that case, but yes, often that is the case. And in that case, a lot of that data gets thrown out. And so you don't stream all of that up into the edge or the cloud. That's the ideal case. We run things on Raspberry pi's and other type of things for those purposes.
And is some part of the Swim operating system is managing the state of all these different, what should I call them, actors around the network?
Yeah. We like to call them stateful microservices these days.
Okay. Fair enough.
But I'll go with either, and yes. And so it's managing basically as a massive view. So based on all the consumption patterns, you basically can think of certain graphs that have to happen. And then as a meaningful update happens in any of the sources, those get propagated along. And so you're really just based on your processing needs and what has changed, you're just subscribing to all the relevant data.
So I'm building up a processing graph in some programmatic fashion.
Yes. That can be static or ad hoc real time. You don't have to do it ahead of time.
Okay. So what part does Swim do for me? Because I could do that with RPC cores between different Java nodes around the world. What's Swim bringing to that structure?
Yeah. Swim brings endless streaming. So everything is a subscription. And after that there's no request, there's no polling. Data just flows unchecked. There is only really the handling of the back pressure to not overwhelm any of the sources. So there's no querying of a database or internal state. There's no making request. You basically just link to the information that you need and the data constantly flows there at an acceptable rate.
So it's building up a graph of published subscribe connections between different nodes.
Okay. And is there a management section to that? So putting on my Erlang hat, my very tiny little Erlang hat-
Love Erlang. Yes.
Yeah. I haven't spent enough time with it, but I find the ideas in it very interesting. I guess the first thing you're going to want is what happened if one of those nodes crashes. Is there some means to monitor and ensure the health of the graph that you're building up?
There is a management layer for connections dropping, especially since the streaming should happen nonstop, that those will just automatically get reloaded, even rebalanced. It might need to be on a different node and you might need duplicate copies. So there is a layer that basically handles the different processing components and monitors them for availability. And we even have things where we can actually store that stuff, when you really want to, on disk, as a speed layer type approach. Any datas that you're feeding to us, you're probably also feeding them if you don't want to lose anything and to your data warehouse or whatever, you're already doing it anyway. So we just try to focus on the live real time streaming with the idea that you really only care about information with a certain time window where you want to take action. And so we don't worry about all the agent history, because you're going to query that and you can already batch process and do all that type of stuff otherwise.
So for back pressure, you are in the extreme case, you're dropping data. And just saying, if it's too old, we don't consider it anymore.
Generally there's a certain look back, which makes sense for the application. It's application defined. So basically how much you want. So we don't try to assume how much data will be kept, but just general use cases to date are really about being able to act in real time. So whether it's a user or an automated process.
There's always this debate over exactly what real time is and hard and soft real time. But I think we can generally all agree that newer is more valuable and we work from there, right?
Yeah. Because once it's been consumed, if you send a message to someone and then they can decide if they want to take action or not, then they probably don't need anything. They don't need it anymore. But if there's certain algorithms running where you do have some back window that you're using, then of course you need a certain amount of data. That recent data is still valuable, but once it's not needed for the decisions that you're making and the processing that you're making, there's really no need to keep it.
Okay. I can see that point for himself quickly. So what's it like to program when I'm right? Is it Java? Is it Java only?
Yes it is. It's TypeScript on the client side and we are well on our way with our Rust rewrite. So that should be coming soon.
Okay. Yeah, absolutely. Don't commit to delivery dates on this podcast. It would be asking too much of anyone.
So with Rust in there, if you want to do Python and other things that becomes much more palatable. Initially everything was Scala. And we did actually use Akka initially, but scalability was a huge issue since our models are, again, different. This idea of only once processing, a lot of those type of needs are very specific. We're not really concerned about storing things for the long term in optimal patterns. We're really about getting all the relevant context in one place and streaming it, and then just streaming the updates.
So that brings to me to another of my questions, bringing all the relevant data to one place. So what do you do about joining disparate streams of data?
So you can think of them as streaming joints. So what we call ourselves is entity parallel. So entities are different things of different people. Your group might be caring about certain fields or attributes of a general entity. While another group might look at it a little bit differently, might need data from other places as part of that join. So these are really two related but different entities. And so it's basically the meaningful object that you want to consume. So if you think of a query where you're doing your joins and everything, that resulting thing that you got is really a sort of entity.
So it might be that I consider my entity to be a specific traffic light ID, or I might be planning for a whole state and it's just like a block, a block level join of traffic lights.
Or it might be okay at this intersection. And then part of this intersection, it might include its nearest neighbors. So that as the previous light is changing and something's happening there, that next light will want to know type of thing. So you might have little neighborhood clusters like that.
And are there restrictions on those kinds of joins? Because I know we think about that a lot in things like Kafka streams, what kind of joins make sense for streaming data? So what's your thinking around that? Is it easier because you're allowed to drop older data?
It's easier from that standpoint. For instance, a lot of times when you're doing stream processing and you have different things, they're really isolated. You're really doing a lot of redundant work where it's setting all of these things up. In this case, because there's basically a grid of web agents, and as connections are made that you're always optimizing for the routes that are being used. If you might stop requesting something, then you don't need to worry about anymore. So I would say that in here it's really ad hoc so that you're just basically reacting to demand and linking the appropriate node that you need for consumers. And so if they overlap, you get that reuse. Things might flow through node A and then B prime, and B might both use that. And if it's the same consumer, it won't be two separate runs of that.
Okay. So you can have multiple node subscribing to the same source feed.
With another node.
And they can connect and deconnect.
Is it load balanced inherently, or is that an application level concern?
It is. Because to be able to, again, those realtime UIs really will easily get overwhelmed. And if you go back to the traffic case, if you're looking at the whole, we won't say the whole United States, we'll say all of the UK. If you're looking at all-
You're implying that my home country is small.
You are right.
Then it's level of detail. So we have a lot of operators for geographies and different things. And so as you're consuming data, you don't need necessarily all the data until you drill down, and then you need to see more of it. And so we handle that case where we can basically dynamically manage the level of detail based on how much can be consumed in real time. It's no use showing someone more information than they can handle, and less relevant information based on where they're drilling down. So we work in that way as well.
I think we should talk more about the UI because, from what I know of Swim, it felt like an operating system first thing. But I'm getting the sense that actually the user interface is a very big part of this for you.
The user interface implementation wise in TypeScript, that's really basically ... We call our staple microservices. Technically we call them web agents because they're all web addressable. They're all basically URIs. You could hit them from just in the simple web way and start consuming them as a stream. And so the UI is basically made up of UI web agents that share a lot of the same similarities. There's really more of a parity on both ends.
But they're written in TypeScript.
Okay. And is that running in the browser or running on node, or how do you set that?
You can do it both ways. We'll often run things in the browser and then we might go to the command line and run things in node. We can even hit it with Crawl because again they are URIs. And so you might hit it in Crawl and just see the data flowing through. You can use the [inaudible 00:20:51] and subscribe to the web socket and see stuff continue to come through. And it's very similar to what you see in the browser.
Okay. So it's this distributed mesh of processing nodes that I can connect to in multiple ways, including rendering out of UI?
So the UI can really say that it can pick in multiple sources. I want to consume from maybe all of the cell towers in certain district of Washington DC or Paddington Square, wherever. And that again is dynamic.
Okay. Take me through the flow there. Because I'm curious about the hatch, how that actually works under the hood. Let's say I am displaying traffic light data for the East Coast states of America. And I decide I'm going to zoom in on the state of New York. So presumably I'm subscribed to some web agent that aggregates at the level of states.
Absolutely. So what happens is you might have a state web agent and then it has what we call a join lane. And this join lane is going to some sub areas, some regions or counties, or however you're breaking it down for GRA granularity. So basically it's deciding to listen to multiple sources, and each of those sources are probably listened to multiple sources. So you really get like a tree in that respect. All the states will basically aggregate up to a country one. So you might have the IS node that's got the 50 states in Washington, DC.
Okay. So there's a bunch of say 50 nodes feeding into a countrywide node. My web UI is subscribed to the countrywide node and then I sort of cancel that subscription and subscribe to the New York node instead?
So the New York node would get privileged. As you zoom in, we have the geo coordinates around things. So as you zoom in, yes, your level of detail will transition to New York once you've started drilling down into New York.
So the game of writing the UI becomes knowing which nodes you can subscribe to for different feeds of data at the level of granularity you want.
Yeah. There's a certain zoom factor. So when you first look at it and you zoomed out, then all of that basically map in view, then all of those data sources can come in at that level. And probably just the higher level data at that point, based on how much that browser's throughput has. And then as you zoom in, it basically updates the surrounding regions and figures out which web agents are responsible for that. And then it switches and subscribes to those.
Okay. Yeah. I can see that. You say it figures out. Is there a discovery mechanism involved there, or does it just have to know?
Basically, there's a notion of like a bounding box. You're certain of a view. So a lot of our projects are dealing with geo coordinates and we have specific data structures for handling that efficiently. And then just the changes. So really, as your coordinate space changes, there's a very efficient lookup that will figure out which web agents are in that area and which ones need to start subscribing and continue to subscribe, and which ones basically can drop off.
So I'm working with some discovery service to know which nodes to change to.
That's okay. I'm starting to get a sense of how this would work in my head. What are people using it out there? Who are your typical kinds of customers?
So big telephony is a big one.
That makes sense.
Often 150 million active connections needing up to the second information coming through. And normally you might say, "Okay, here's a user." You might say, this is basically quality of service type thing. A user has a problem. You can say, "Oh, okay. Yeah, contact that user. They have a problem." And generally, if you wanted to know what cell tower they were connected in that moment, you would basically have to do a query to do some lookup to figure that out. But here as an application standpoint, you would know that automatically. Your queries would have that right context. And so you'd be basically keeping context, whether it's static or dynamic flowing away at any particular time. And so when you see that these users are affected, and this is a cell tower that's causing it, you'll have that all at once because that's the essential information at that time.
Is that because you are always passing, trying to get a sense of how that works, are you passing some of the graph information along with each piece of data in the graph?
Yeah. Each agent has links to the relevant information that it needs. And so for an individual cell phone that would be represented as a very lightweight web agent, and as it moves around and connects to different cell towers and other type of things, that's already kept in store with that state.
Okay. So you can discover the current connections and you've got a web agent at that level of granularity.
Yes. We try to make every entity at web agent because they're very lightweight.
Okay. Can you quantify very lightweight in Javary terms?
Yeah. It's [inaudible 00:27:10] Kilobytes. It's really been a long time. Recently back to Swim after having worked a lot on the back end previously. But I would say probably a couple of Kilobytes if I don't misremember.
For order of magnitude is fine. I'm not going to grill you for that level of detail.
And similar where you might have something passivating and activating, based on need. If you have, again, with the 150 web agents that are created, they might not all be active at that point because if something hasn't been used in a while, and just resources, might be passive at that moment. But as it comes up, it's pulled from local disk. If it's phased out a little bit and then once it's up again, then it's ready and doing everything with memory and CPU.
Okay. So there is some notion of persisting the state of each web agent.
Yeah. At the end of the day, there's basically in memory structures that can, based on requirements, get written locally as needed.
Okay. Interesting. How old is this project?
It's about seven, eight years, I guess.
Goes back a while the initial applications were different. There was the IOT phase, there was the web monitoring phase. If you think what we're talking about are really general purpose features, a lot of it depended on which customers were excited or interested at the time. Actually, I did work pick and place machines on factories. And a lot of it was doing an anomaly detection where predicting issues with nozzles and things like that, and basically allowing people to get advanced notice and take action.
That makes perfect sense to me, like a newish company with a core idea, looking for ways to solve people's problems. Especially when you are new, you'd gravitate towards different specific industries as time went on. But the ideas I'm assuming have remained fairly stable over the past seven, eight years.
Interesting. And there is some degree I believe of integration with Kafka, right?
Yeah. As a distributed log and a source of events. A great majority of our customers use Kafka because it makes sense for an event driven system, and when that has to maintain a lot of state. So because we know Kafka works very well for those things and handles certain data, when it comes into Swim, you're like in the real time context, you've got your application that you're trying to execute. And so we don't try to do the things that Kafka would have to do with exactly one's processing and all these other type of things, because that information is already there. What we're trying to do is that in the real time we need to know right away. And if you're a user you needed to hit your screen and you need to be able to react and have it go back. If you're a bit of automation, then you need to get that into your program as soon as possible and take action. So that's why we're optimized really for. We're basically the speed layer again. That's how I look at it.
And you're optimizing both for speed and for visibility?
Yes. Because the general purpose applications and different groups can have very different requirements, and they won't be coordinated ahead of time, you basically have to make decisions based on where the user is and what they're trying to do, and what they can handle at a particular time.
Yeah. Makes sense. It's a really interesting idea. It's one of those things where I've got a much better idea of how it works now. So thank you. But it was also makes me want to write some code to actually get my hands dirty and understand. Because you never really understand something until you've written some code to play with it. So if someone wanted to get started with these ideas and play around with it, where should they start?
Yeah. SwimOS is a great place to start. There's tutorials there.
Is that swimos.org, right?
We'll put a link in the show notes. What got you into this world?
I love distributed computing. It's like one of the things like, no, I'm not a RAF master or [inaudible 00:32:36], but just really very involved systems coordinating and having all the negotiations, and having all that complexity having to be managed really attracts me. And at the time I had done some Erlang previously and I had done some Akka previously. And so when I heard distributed actor model and it was Scala, at the time everything was in Scala. And so I was very excited about that. The reason things moved from Scala is because the Scala runtime is a lot larger and heavier and we needed to run on some very small, low power devices. And so it made sense because your Java it could be much smaller, it doesn't have all the weight that's called-
Sorry, is that also motivated the move to Rust?
Yes. Rust is all the efficiencies and everyone doesn't like to use Java. So just having more flexibility and more performance. We really care a lot about performance.
Makes perfect sense in this domain. Okay. So last question, give me something in the future, where do you think the whole SwimOS project is going to go next?
To the cloud. So you have all of your queries, you have all of your stream processing plumbing in there, and you basically have groups that want, whether it's dashboards or other type of applications, automations. And so everything else is being handled for you. If I want to do things real time, if I want to see the latest information, make the latest, most efficient queries and just really be up to date as possible, Swim's a great technology there. And if we can automate that more so that you don't have to do as much of the work, we can basically look at the things we know we need to look at the databases, Kafka and these different things, and basically expose automatically real time services for you and let you select, then I think that's the future.
So will it be a day sometime in the future when I can just give you my web agent code and it will just magically happen?
Absolutely. And in addition to that, no, I won't say that there's some similar, there's a lot of nice ecosystem things that were going to be enabled.
I almost got you to commit to the feature roadmap, but oh well, you're probably wise to avoid that and wait till it's actually launched. Fred, it's a very interesting project. I still want to write some code to get more of a sense to the scope of it. But I think I know where I would get started now and what kind of things it could do for me. So thank you very much for letting me grill you and explain the project.
Yeah, my pleasure. And on this level of view, you talked about traffic. If you go to traffic.swim.inc, you can see a traffic demo that way.
Oh, live demo. We love those. We'll put a link that in the show notes too.
And also for just fun, you can do ripple.swim.inc. And that will show you-
I've seen that I'm not going to spoil what it is, but it is a fun demo.
And you can just see hundreds of people, thousand or whatever. I can't remember what hardware we're putting that on, but you can just see a lot of real time interactivity.
I love a playful demo, especially when it demonstrates some real underlying data flows. So cool.
Yeah. It's been a pleasure.
Thank you very much for joining us, Fred. We'll see you again.
Well, that's SwimOS. I'm going to have to find some time to play with it. I want to go and get my hands dirty. It seems interesting and full of possibility, and I'll never really understand it unless I've written some code with it. So in the meantime, I'm glad they're working hard on solving those kinds of problems. Rumor has it that SwimOS will be speaking at Current, if you want to learn more, and Current is the next generation of Kafka summit. It's Kafka summit plus more streaming technologies, more real time technologies, more speakers, more tracks. It's happening in Austin, Texas, this October 2022, and tickets are on sale now, I believe. I hope you'll consider joining us. I'm looking forward to seeing some more people from SwimOS. I'm looking forward to meeting some of you, and I'm looking forward to doing a little bit of live coding myself.
Should be a good conference. If all that is too long to wait and you desperately want to see more event streaming code and Apache Kafka code, head to developer.confluent.io for code samples, tutorials, blogs, and lots more. And if you go through one of those tutorials, you are probably going to need a Kafka cluster. So you can try spinning one up at Confluent Cloud, which is our Kafka cloud service. You can sign up and have Kafka running reliably in minutes. And if you add the code PODCAST100 to your account, you'll get some extra free credit to run with. Meanwhile, if you've enjoyed this episode, please do click like and subscribe and the writing buttons and all those things everybody asks for, it helps. It helps us to know what you'd like to hear more about, and it helps like-minded people find us and join the show. And as always, my Twitter handle is in the show notes if you want to get in touch with me directly. And with that, it remains for me to thank Fred Patton for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.
How do you set data applications in motion by running stateful business logic on streaming data? Capturing key stream processing events and cumulative statistics that necessitate real-time data assessment, migration, and visualization remains as a gap—for event-driven systems and stream processing frameworks according to Fred Patton (Developer Evangelist, Swim Inc.) In this episode, Fred explains streaming applications and how it contrasts with stream processing applications. Fred and Kris also discuss how you can use Apache Kafka® and Swim for a real-time UI for streaming data.
Swim's technology facilitates relationships between streaming data from distributed sources and complex UIs, managing backpressure cumulatively, so that front ends don't get overwhelmed. They are focused on real-time, actionable insights, as opposed to those derived from historical data. Fred compares Swim's functionality to the speed layer in the Lambda architecture model, which is specifically concerned with serving real-time views. For this reason, when sending your data to Swim, it is common to also send a copy to a data warehouse that you control.
Web agent—a data entity in the Swim ecosystem, can be as small as a single cellphone or as large as a whole cellular network. Web agents communicate with one another as well as with their subscribers, and each one is a URI that can be called by a browser or the command line. Swim has been designed to instantaneously accommodate requests at widely varying levels of granularity, each of which demands a completely different volume of data. Thus, as you drill down, for example, from a city view on a map into a neighborhood view, the Swim system figures out which web agent is responsible for the view you are requesting, as well as the other web agents needed to show it.
Fred also shares an example where they work with a telephony company that requires real-time statuses for a network infrastructure with thousands of cell towers servicing millions of devices. Along with a use case for a transportation company needing to transform raw edge data into actionable insights for its connected vehicle customers.
Future plans for Swim include porting more functionality to the cloud, which will enable additional automation, so that, for example, a customer just has to provide database and Kafka cluster connections, and Swim can automatically build out infrastructure.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us