Running in the cloud is great. But for a system of any complexity, it's unlikely that all of your services or all of your infrastructure are all going to be in the same region. And hey, maybe not even in same cloud provider. This means we have to understand how networking works in the cloud.
I'm talking to Dan LaMotte today. He's an SRE. He works on Confluent Cloud. To talk about just how this works and what kind of networking solutions Confluent Cloud supports. It's all on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.
Hello, and welcome to another episode of Streaming Audio. I am once again, your host, Tim Burgland. And we're here, as I've been saying for the last few episodes, in audio version as always, and in video on YouTube. If you prefer to consume your podcasts by looking at the faces of the people who are talking, you can do that now. So in fact, if you're watching the video, it's probably weird for me to be saying that because you already know. Anyway, I'm glad you're here and I'm glad to be joined in the studio today by my colleague, Dan LaMotte. Dan and I are going to talk about a thing called private link.
Dan, welcome to the show.
Hey, thanks for having me.
You got it, man. Before we get into private link and cloud private networks and network engineering ,and some fun stuff I want to talk about, how about you? What is it that you do here?
Well, I've been working at Confluent for a little over two and a half years now, doing a lot of SRE work and Confluent infrastructure, or Confluent cloud infrastructure work. These days I'm on the traffic team and I'm the team lead working on how customers connect to Confluent Cloud.
Nice. So, private link seems relevant.
And I want to get into that, but what are the primary ways? Because it's funny, I'm like, well, there's clients, and there's the command line, and there's funky network things. But how do customers connect? How do people connect to Confluent Cloud?
Yeah. We have a number of options. So we have public internet connectivity. We have your peering style connectivity, trends and gateways as a kind of flavor of peering. And then we also support private link, which is our latest addition to that kind of portfolio of connectivity options.
Gotcha. So public internet... So for the demos that I would do and stuff like that, if I'm going to show people, here's the Confluent Cloud command line interface, or I have a Java client and I'm doing something. I mean, that's just good old internet. What's peering? Tell us about peering, because I know that that term comes up in kind of internet backbone, architecture discussions. So-
Yeah. It's just point to point, in the cloud, it's between two VPCs, sometimes in the same region, sometimes cross region, but it's joining your two networks together. So there's a number of good and bad things that come with that, but it is a private way to connect to Confluent Cloud instead of using the internet. So from a security posture, it's much better.
Much better. And I'm going to make you define absolutely everything, VPC. This is assuming if you don't know a lot about network engineering or cloud connectivity, I'm just going to assume that listener, that is you. So what's a VPC?
VPC is kind of your virtual private cloud. It's just a network of space, that is where you launch your instances in the cloud, that you can kind of connect to other VPCs for instance.
Right. Got it. So if I could just bring it down to the networking terms, there would be a set of IPS that are mine, that are not routable from the public internet. And if I have routes to the public internet, they're explicitly defined, kind of from, to, and ports, and otherwise there's no free for all. There's no paying. There is no... Old memory from 30 years ago, the utility finger, user on another system. Tell me about yourself. The trusting days. But so it's just not, it's not... A VPC is really a virtualized private area, that there's no in or out, that's not defined.
Yep, it's all private to your company. So private IPs is only.
And so peering. And so, let us imagine I've got a VPC, that's got some services, microservices running in it in some way. Maybe I have a self managed Kubernetes cluster, or I just like to manage them on their own, or whatever. They're all in there, and they need to talk to Confluent Cloud. So what does peering do? How does peering make that work?
Yeah, it allows these two VPCs to connect to each other over their private IP address space. So instead of creating a public IP that anything could connect to, a peering would allow two VPCs to talk to each other as private address space directly.
Got it. So is that really routing? I mean, is it ultimately routing and encryption?
I mean, it is routing, depending on which cloud you're in, you could get exposed to less or more of it, in Azure and GCP, generally when you peer it just works, the routes are automatically created for you. Everything's great. In AWS, they make you create these routes and yeah, it's a little bit more-
A little bit more like the good old days.
Yeah. The good old days of somebody logging into a router and creating some route for-
Right, right. And for this, and I know private link is kind of our topic du jour, but I want to get there and build some pieces slowly. There again, this is a little bit of a good old days technology, I mean, it's still a totally viable thing, but a site to site VPN where it's traffic that's absolutely routed over the public internet, but there aren't routes in the IP sense, and the traffic is encrypted in some way. Is it analogous to that? Is it built on that kind of technology? What are the pieces underneath that make this work?
For a cloud peering, the high level analogy is definitely correct. It enables the same type of connectivity, but underneath is cloud provider specific details. So it might be a VPN, it might be just some other kind of tunneling protocol, or however they've implemented their underlay networking. For instance, if it's same region, they might shortcut a lot of these things, and there might not be any of those things, but for cross region, maybe they do have to implement some more of these kind of building blocks. But the experience you get is the same. That in a site-to-site VPN everything can kind of talk to each other on both ends, with a peering, everything can talk to each other. So it's bi-directional, that's the key.
And when you're an SRE for a cloud service like say, you are, I assume you do have to get to know some of those details with the different cloud providers. If you're just using one and kind of staying within the box and the rules and everything like that, like I might do, if I were setting something up, I would have the luxury of, as it were, not worrying my pretty little head about that. But you sort of have to dig into things, I guess if you're-
Yeah, some of those details leak up to the top and yeah.
That's the thing about abstractions right. They're good. And then they leak. Okay. So diving into explicitly a thing that is not what we're talking about today, but peering, public internet, and you said... Was private link the only third one? I might be missing something in that list.
Transit gateway's in there, but it's like peering. It's identical to peering other than, it's just easier. It's more of the hub and spoke topology, whereas peering is just point to point. It can be between two VPCs and that's it. Transit gateway, you get this hub where everybody can connect to, and then everybody that's connected to the hub, is connected to each other.
Oh, okay. So if you're connected to the hub, you have connectivity to all the other pieces?
Gotcha. And is transit gateway cloud specific? Because I know some of these things are, I mean, they're general concepts, but the word is a brand name for one of the providers. Is that just an AWS thing?
It is an AWS thing. There's kind of analogous features in Azure and GCP that are built on top of peering. So it's a bit confusing over there, but it is, it's kind of there, so.
Yeah, but that's totally happening with... I see stuff in this part of the stack, and it's usually AWS, their thing kind of wins and everybody calls it that and it's, Xerox, Coke, Kleenex, all of a sudden that brand name sort of dominates, which hey, good move. Yeah. I guess we'd like to be that with fully managed Kafka. Okay. So you got transit gateway, and then private link. So talk to us about that.
So private link kind of changed the game for this-
Also an AWS thing, right?
Also an AWS thing, but also an Azure thing. And GCP has their thing that they just launched. It's not called private link, but it's the same kind of thing. So all three clouds it's a very popular connectivity that's showed up, particularly because peering becomes problematic because we're actually sharing address space with the company. So there's some amount of address space that the company has to give Confluent in order to peer, because those addresses are now part of their network. And so they can't use it again in their network. And depending, that might be okay, or if they're strapped for IPs, it might not be okay. And so private link gets us out of that, because all it needs is an IP per zone in the customer's VPC. So maybe three zones, three IPs, that's a very nice surface area to take away, just to onboard a service like Confluent Cloud.
How many IPS are you generally getting in a VPC?
I mean, most people provision it with a slash 16 or GCP allows much larger-
These are private after all, so.
Yeah. I mean-
Everybody gets an IP, right no.
You get an IP.
Yeah. I mean, it just depends, some companies prefer, they might give you a slash 24, right. If they're going to actually connect it to their network, you might not get a slash 16 [inaudible 00:10:53]. The cloud providers supports it, no problem. But it's very much company dependent.
Gotcha. Okay. So and it would be parsimonious with IPs private link, you said one IP per zone, but yeah, again, just walk us through how it works and the problems it's solving and how it's a game changer.
Yep. So once we get those three IPS, the customer can connect to those IPs to get to Confluent Cloud. That connection is unidirectional now, it's no longer a bi-directional, like it is with peering. So from a security posture, Confluence can't connect to the customer, but the customer can connect to Confluent. So it's much better there. And at that point it's just easier to deal with. There's no routing to create. It's just IPs and your VPC, and away we go. It does pose a number of engineering challenges for us to provide the service behind that, because now Kafka, being that you need to direct traffic to specific brokers as a client, we have to figure out a way for the Kafka client to target a specific broker behind three IPS. We can't have an IP per broker anymore. And so now we have a problem, and that's the problem that we've been solving at Confluent Cloud.
There's kind of two main ways to solve it, there's port mapping, where you might have a port per broker, and the client would just direct traffic at that port. Or you could do what we've decided to do, is a proxy layer with an SMI routing. So as clients connect to brokers, they encode a different TLS name for that broker in the connection, and we route based on that. And so this allows us to scale much farther than any of the limits on the cloud providers have for their prevalent solutions. Some of those can be kind of problematic, which would not allow larger clusters, because you run out of ports, or other kinds of interesting issues that show up. Yeah.
But those clients are standard clients, so help me out with that. So they are still doing all of their regular connection pooling, and metadata gathering, and they have this idea that there's 60 brokers out there, and they'll have their own IPs. So where do we get to be there? I mean, where... So you described the block diagram of the network chunks, and I have that in my head, but I don't know where we have the opportunity to put software.
Oh, I guess it works right now with the standard client, and then right behind the private link. So, okay. Maybe I'll explain what it looks like on the service provider's side. So private link, there's the end points that are put in the VPC, in the customer VPC. On the service provider's side, we'll have a load balancer, or set of load balancers, that we've attached a private link service to, and the customer's private link endpoint connects to that service. And so behind that load balancer is traditional load balancing. So you have a set of targets, and any connection that comes into the load balancer will land on one of those targets. And so it's right now, it's our Envoy proxy, that we're running right there. And then we route with SNI straight to Kafka.
SNI, sorry. Surfer Name Indication, it's-
That's the... In the SSL, the name?
Correct. Yeah. It tells the server what certificate to pass to the client. So in the end, it's enough so that the server can supply a specific certificate per name, but we can also route based on it. And so the client, just by virtue of connecting to some name, DNS name on the internet, we'll put that name into the connection so that the server can serve it a certificate, so they can validate its identity, and then we route to the Kafka broker based on that.
Got it. That makes sense. That all seems delicate. There's a lot of software to write, to make that work. This is not easy.
Yeah. But what kinds of things can people build now, that would have been difficult without it? I can imagine some examples, but what do you see? What sticks out to you as the cool stuff?
I mean, it's super nice to be able to connect to multiple Confluent Cloud clusters. Some companies have a number of VPCs in the cloud, like hundreds per region. Now it's quite trivial for those types of companies to private link into Confluent Cloud, without having to kind of make sure there's a certain IP range that doesn't conflict with any of those hundred VPCs in the cloud. So now all of a sudden, the connectivity is very simple and it just works so. The other aspect is just the security posture. So a lot of banks, they don't want a compromising confluent to compromise them. And with a peering solution, it could be that if Confluent were compromised, we have an open door into their private network, and no longer do they have to give up that security posture to use our product. Not that we would be compromised, we're doing our due diligence there, but it's just a matter of risks to them, in any case, they want to make sure they minimize it.
Yeah. We're some entity that's not them. And so there's a degree to which we are less trusted, because that's just how their own security audits and analysis have to be.
That makes sense. We would certainly say the same thing about them, right? We trust us, but not going to trust anybody else for the sake of securing our own service.
Or at least we'll audit and be intentional about the kind and level of trust that we extend. All right.
The other interesting aspect of that is, now that we're not sharing IP space, the more that they kind of grow their usage. Before with a peering solution, we could have used less IP space, and it would have been more convenient, but it would also mean that if they were ever to grow their usage of Confluent Cloud, they could run out on our end, and would make it very hard to kind of expand. It would be like migration. Again, with private link, those things are decoupled now. And so expansion, adding more clusters, or moving clusters, all of these things can happen without kind of any network considerations. It just works.
Right. It seems like, just listening to you describe this, it seems like the hard part is the load balancer on our side of the link. It seems like that's where all the routing information needs to live, and cluster sizing is going to change metadata, and that needs to be reflected there in a big hurry and all that. It just seems like that's where the money is, am I... So first question, is that true?
I don't know if it's where the money is. I mean, it's definitely where we've put a lot of work into making sure that it functions correctly.
The money in that sense, people, engineers had to spend time there.
It should be invisible, right? Yeah. It should be the boring old infrastructure that just works. It should be-
Yeah. Which is often super hard to make.
Especially when you have this ridiculous elastically scaling thing happening, this amorphous blob of Kafka brokers, that just look like a cluster, and topics, and everything's completely fine, but it's a very dynamic situation over here, and software upgrades happening, and rolling upgrades all the time. It's just this quantum foam of Kafka on our side, but-
Yeah. You don't know a thing.
You have these amazingly performing topics, and you've even got some [inaudible 00:19:13]. Everything's fine. So, what was interesting about, or what has been interesting about building that? Is that-
Yeah. I mean, it's much different than your standard kind of load balancer solution. I mean, this is not a HTTP server where connections kind of come and go all the time. So that layer has to kind of evolve to be more and more ... Handle these long lived connections to Kafka likes, or at least treat them just that, as the bids come in, we very slowly kind of drain those connections off of those nodes, so that we're not impacting Kafka traffic. So that part is delicate. And we're introducing technology of that later to make that ever more better than it is today. So.
Yeah, that really Kafka load balancing, client load balancer layer, is kind of what that is, right?
Yeah. The historical architecture is that, I guess there really isn't load balancing as such. I mean, there's hashing, there's the client library. Things generally work out okay, as long as there's no key that dominates 90% of the traffic in a topic, or a high traffic topic or something. So it's just not a thing, but now with this, it really is a thing. And that building that highly specialized Kafka load balancer kind of feels like the place where a lot of the effort has gone.
Yeah. For sure.
For you, what's, what's fun about this, for a SRE network engineer background kind of person, what drew you to this?
Yeah. It's a very interesting problem to solve. There's a number of just network technologies at this layer that you probably don't get to really use too often, until you've caught problems like this. So we're looking at EBPF as some way to help here, but really-
That stands for?
Oh, I think it's either Extensible Berkeley Packet Filter, or Extended Berkeley Pack-
Definitely some sort of extending happening-
Some E on the normal BPF that unfortunately, I don't know at the moment. But yeah, we're looking at that technology as well to figure out how to sustain these TCP connections and be least interrupted in this connectivity path. So how can we hand off TCP connections between nodes, and those kinds of problems. So it's a pretty interesting layer for the networking space.
Nice. Sounds like it. I just, being naughty here during the podcast, I just Googled EBPF. I actually Googled RBPF first, which is the Royal Bahamas police force, that's definitely not what we want. And funny, just looking at the page, it doesn't... Okay. Here's a button that says, what is EBPF, and scanning it. It's just dead, it's not going to say. All right, fine. You know what, link in the show notes, and you get the idea it's BPF, but there's more to it.
In the end, it's like compiled code that gets inserted into the kernel and runs. So it's like this micro VM inside the kernel, the Lennox kernel, that kind of runs this stuff. So it's much faster than any kind of user space network handling that you could do, maybe not DPDK or anything, but it's an interesting solution as well.
Nice. Final question. Are you hiring?
We are hiring. We are hiring for everything in the networking space. So folks that are interested in this load balancer layer, as well as the control plane layer for creating peerings, managing peerings, private links, it's a very hot space at Confluent.
There's cool problems to solve. And podcasts are forever, so if you're listening to this a long time... We're recording this early May, 2021. If you're listening to this a long time in the future, my guess is we'll still be hiring in that space, but there'll be a link in the show notes, that'll take you to the career page if you're interested. This isn't intended to be a Confluent recruiting commercial, but like you said, this is stuff that... The foundational technologies, if you engineer networks, or work with them at all, the basic ideas are things that you're going to be able to traffic in, but it's kind of cool to deal with them at this scale and to say, well, yeah, I know that's how the service works, but here are the actual places the skeletons are buried in this cloud provider. This is the stuff that you get to learn, and that's cool stuff. So if you're interested, check it out. My guest today has been Dan LaMotte. Dan, thanks for being a part of streaming audio.
Thank you, Tim.
And there you have it. Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST—that's 60PDCAST—to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available. So don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund, that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign-up links for those things in the show notes. If you'd like to sign up and while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review. And we think that's a good thing. So thanks for your support, and we'll see you next time.
Confluent Cloud isn’t just for public access anymore. As the requirement for security across sectors increases, so does the need for virtual private cloud (VPC) connections. It is becoming more common today to come across Apache Kafka® implementations with the latest private link connectivity option.
In the past, most Confluent Cloud users were satisfied with public connectivity paths and VPC peering. However, enabling private links on the cloud is increasingly important for security across networks and even the reliability of stream processing. Dan LaMotte, who since this recording became a staff software engineer II, and his team are focused on making secure connections for customers to utilize Confluent Cloud. This is done by allowing two VPCs to connect without sharing their own private IP address space. There’s no crossover between them, and it lends itself to entirely secure connection unidirectional connectivity from customer to service provider without sharing IPs.
But why do clients still want to peer if they have the option to interface privately? Dan explains that peering has been known as the base architecture for this type of connection. Peering at the core is just point-to-point cloud connections that happen between two VPCs. With global connectivity becoming more commonplace and the rise of globally distributed working teams, networks are often not even based in the same region. Regardless of region, however, organizations must take the level of security into account. Peering and transit gateways with a high level of analogy are the new baseline for these use cases, and this is where Kafka’s private links come in handy.
Private links now allow team members to connect to Confluent Cloud instantaneously without depending on the internet. You can directly connect all of your multi-cloud options and microservices within your own secure space that is private to the company and to specific IP addresses. Also, the connection must be initiated on the client side for an increased security measure.
With the option of private links, you can now also build microservices that use new functionality that wasn’t available in the past, such as:
You no longer need to segment your workflow, thanks to completely secure connections between teams that are otherwise disconnected from one another.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us