The Changelog: Software Development, Open Source - The power of eBPF (Interview)

Starting point is 00:00:00 This week on The Change Law, Jared Wansilla talking to Liz Rice about EBPF. EBPF is a revolutionary kernel technology that has lit the cloud-native world on fire. If you're going to have one person explain excitement, that person would be Liz Rice. On this episode, Liz tells Jared about all the power of eBPF, where it came from, what kind of new applications it's enabling, and who's building the next generation of networking, security, and observability tools with it. Big thanks to our friends and our partners at Fastly and Fly.io. Our pods are fast to download globally because Fastly is fast globally. Learn more at Fastly.com. And Fly helps us deploy our app service closer to our users.

Starting point is 00:00:46 It's like a CDN, but for our entire application stack. Try free at Fly.io. This episode is brought to you by Influx Data, the makers of InfluxDB. Increasingly, time series data is all around us. It's in the cloud as applications and services scale out. It's in IoT as more and more devices come online. Sensor data is time series data, and that's exactly where InfluxDB comes into play. InfluxDB is the open source time series data platform that allows developers to build and to integrate applications with time as a foundational component. InfluxDB is made for developers to build real-time

Starting point is 00:01:30 applications quickly and at scale, and they keep improving their platform to build those applications with less time and less code. Recently, they launched their Edge data replication feature. This new capability is built into the 2.2 open source version. It allows developers to replicate data from local instances into InfluxDB Cloud, enables users to aggregate and store data for long-term management and analysis, and to satisfy regulations. It brings the horsepower closer to the sensor and gives developers and solution builders the ability to leverage their own Elastic Compute Resources deployed at the edge. Edge data replication lets you decide strategically what data moves from edge to cloud, how the data should be enriched and formatted.

Starting point is 00:02:12 Add to this, InfluxDB has ongoing efforts to unify APIs across all its database offerings. They now provide a path to build once and deploy time-series applications anywhere. Learn more about InfluxDB and this new feature at influxdata.com slash changelog. Again, influxdata.com slash changelog. All right, today we're joined by Liz Rice, who is the Chief Open Source Officer with the eBPF Pioneers, ISO Valent. Welcome, Liz. Thanks for having me, Jared. Nice to be here. Nice to have you for sure. So we've been wanting to talk eBPF for a while,

Starting point is 00:03:10 and now we have you here. So perfect fit. I've heard a lot about eBPF, mostly from Shibit. Gerhard Lezu has had you on the show, the folks from Parka. A lot of people are excited about eBPF. In fact, in his post, KubeCon EU roundup, Gerhard said almost half of the people that he talked to are either working on it, using it, or actively integrating with eBPF. So like lots of buzz, lots of interest. And you've been working with this technology and talking about it for a couple of years now. Do you want to catch people up? First of all, what is eBPF? And then we'll go from there. Yeah, sure. So the letters eBPF stand for Extended Berkeley Packet Filter. And I usually just tell people to forget that straight away because it's not terribly helpful. It tells us something about the history, but it doesn't tell us about what eBPF is today. What it allows us to do is to run programs within the kernel of the operating

Starting point is 00:04:14 system. We can dynamically load these eBPF programs into the kernel, and we can use that to change the way the kernel behaves. And originally it was the Linux kernel. There is now a Windows eBPF implementation happening. So I tend to just think about it from a Linux point of view, but it is broader than that. And it means we can customize the kernel. We can change the way that kernel features behave. We can use it to observe what's happening in the kernel. And the really interesting thing or why it's so powerful is if you're an application programmer, you probably don't think very much about what's happening in the kernel because you use programming language abstractions that kind of hide that low level from you on a day-to-day basis. So every time you,

Starting point is 00:05:06 I don't know, open a file or write something to the screen, you've got some function that looks like open or write or something like that. Underneath the covers, every time you interface with hardware in any way, the kernel has to be involved. So every time you do any network access, open any files, access memory, all of these things involve the kernel. And with eBPF, we can insert programs into the kernel's behavior and we can use that to perhaps observe what you're doing. Every time you open a file, we could see that happening. We could see which processes are opening different files. Every time a network packet arrives, we can manipulate that network packet. We can do all sorts of really powerful things to both

Starting point is 00:05:58 observe what's happening in the kernel and even change what's happening. And that kind of changing what's happening allows us to build security tooling and it allows us to build network functionality as well. So those kind of three areas, networking, security and observability are the, I would say, three areas where eBPF is being used most commonly today. But it's super powerful because of that insight across everything that's happening on the machine. So I'm thinking about Docker because this has been called a revolution. I think Docker was a revolution in its time.

Starting point is 00:06:36 And I remember when Docker first came out and Solomon Hikes and the.cloud team and the app.net team that popularized that technology. They're like, these containers have been in the operating system for a while, but they just weren't accessible. Nobody knew about them. They're hard to use.

Starting point is 00:06:50 And Docker really made that simple. Is eBPF this long-lived feature of the kernel that all of a sudden we realized was there and could do things? Or is it a brand new thing that's been built into the kernel recently? So it's a bit of both. It's been built into the kernel recently? So it's a bit of both. It's been evolving for

Starting point is 00:07:06 years. I mean, I've mentioned the packet filtering element that's been around since I think it's the 90s. And what we call the extended parts, the kind of modern features that we can now use with eBPF have been added in over the last few years, really. And the reason why it's all suddenly taking off is kind of also relates to why eBPF is really powerful. So when you use an operating system, you know, whatever Linux distribution you might be using, it's probably using a version of the kernel that's four or five years old. The distributions don't take the latest release of the kernel. They wait for a while to make sure that it's stable and it's been sort of field hardened.

Starting point is 00:07:56 So when eBPF functionality and features have been added in over the course of several years, we have to sort of go back to a kernel that's maybe four years old to see what people are really using in production today. And those versions of the kernel are now new enough to have sufficient kind of eBPF capabilities that we can do really, really useful things. There's still innovation happening in the kernel. There are still new things being added to eBPF, but those kind of core building blocks are now available in pretty much every production Linux distribution. And that is why over the last, let's say, 18 months, we've seen this huge explosion in interest because it's not just

Starting point is 00:08:43 niche kind of features for people running cutting edge kernels it can be used by everybody but the reason why i said it's kind of it also speaks to the power of the kernel is now that we have ebpf we don't necessarily have to wait for a new version of the kernel to change its behavior because we can use eBPF to do it. Which is kind of mind-bending, but pretty cool. So one of the, I think, really nice examples of how eBPF can be used is for dynamically mitigating kernel security vulnerabilities. So a really nice example of this is something called packet of death. So maybe there's a kernel vulnerability that is susceptible to some particularly formed network packet.

Starting point is 00:09:33 For example, maybe there's supposed to be a length field. And perhaps if you don't set that length field or you set it incorrectly, there's a bug in the kernel that doesn't know how to handle it. There have been some instances of this in the past. It's not just theoretical. And if the kernel receives a packet that's been formed to perhaps set that length field incorrectly, kernel doesn't know how to handle kernel crashes. That vulnerability is exploited. And in the traditional world, you would need to install a kernel patch and reboot your machine to no longer have that vulnerability. But with eBPF, you can load an eBPF program dynamically that recognizes,

Starting point is 00:10:29 ah, it's that kind of network packet that we know is a bad idea. We need to just throw that packet away. And you've mitigated that vulnerability without having to actually update the kernel. You're just running that EVPF program. You can load that EVPF program into all of your machines dynamically. You don't have to affect any of your running applications. It's really, really nice and really powerful. That's cool.

Starting point is 00:10:59 You got me thinking about old kernels because, well, I used to run back when I first graduated from college, back in the early aughts, I ran a network of Linux machines, you know, it's like mail servers and spam, I ran a network of Linux machines. You know, it was like mail servers and spam, you know, all sorts of stuff. And it was back in the days when we treated our servers as pets and not cattle, you know, that analogy. So I had them all named and stuff. You know, I use a MASH theme.

Starting point is 00:11:17 I'm not sure if you're familiar with the show MASH. So there was Hawkeye and Trapper and Hot Lips Houlihan and Radar. That was kind of actually the fun part. It was like when we used to call ourselves sysadmins. That was cool and all, and I would always patch them and keep them upgraded and everything,

Starting point is 00:11:30 but the kernel itself, I would always let it get outdated, not because I wanted to, but simply because it required a reboot, and I wasn't about to reboot my production server. You were talking about how now this has been in there for a while, but people are getting to where their kernels are upgraded enough that they have the features. And I'm wondering if in the days of cattle, of Elastic Compute and Kubernetes and stuff,

Starting point is 00:11:53 is the reason why people still run old kernels, is it still that same old we don't want to reboot? You think that you would just offload the capacity, reboot the thing, upgrade, and then launch a new node or whatever you're going to do? Or is it more about, reboot the thing, upgrade, and then launch a new node or whatever you're going to do? Or is it more about, I mean, I understand like, well, you want to stay a couple versions behind because like, this is your kernel, you don't want to be on the latest, but they're generally stable. What are your thoughts on that? Is it still the old,

Starting point is 00:12:16 we don't want to reboot thing? Or do you think it's about security or stability? I hope it's not that we don't want to reboot thing because... I hope so too, because that was a long time ago. And I used to feel that way. Yeah. That whole principle of, you know, cattle, not pets is exactly that, that you're supposed to be able to, you know, destroy your machines and recreate new ones and do it all programmatically so that the state of those new machines is exactly what you intended it to be. And there's no sort of human intervention that means you missed something

Starting point is 00:12:49 while you were bringing it up. And I think it's very good practice in this day and age too, to make sure that you can destroy servers and replace them automatically. There's that really great, I guess, phrase or saying about how, you know, unless you've restored from a backup, you don't know that you've got a backup. And I think the same is true for unless you've tried destroying a server, you don't know what your recovery process is going to be. So I think it's accepted good practice these days that you should be bringing new machines and updated machines into the deployment. But that said, they're still going to be using, you know, they might be using the latest version of a Linux distribution like Ubuntu or RHEL or CentOS or whatever. And the distro itself, like Debian, for instance, stays very conservative on their packages.

Starting point is 00:13:42 Exactly. And they will use a kernel version that is, yeah, you know, a few years, I would say old. Yeah. Just to make sure that it's stable. Curious about your perspective on this related. So from your perch, you know, related with the CNCF and where you are with your work and barely being involved in the cloud native community, there's this whole switch to this new style of operations. It's where the excitement is,

Starting point is 00:14:10 it's where a lot of the money is, it's where the landscape is, and you can get lost in the landscape, right? Like, which service do I pick and all this? The world moves much slower than that. As changelog person Jared, I see all the new shiny, the interesting. We talk about leading edge technologies.

Starting point is 00:14:28 The rest of the world moves much slower. And I'm curious, like from your perspective, are the people still doing it the old school pets way? Are there still a lot of those organizations and enterprises? Or has it kind of been to where like maybe like 80% have moved over to a more modern infrastructure? What's your perspective on that? Yeah, I suppose my perspective is colored by the

Starting point is 00:14:51 fact that I'm so involved in the cloud native world that I probably see those people who have moved over. I certainly, you know, over the years that I've been involved in CNTF and this kind of cloud native world, we've definitely gone from you know a few years ago oh amazing we can find an end user to talk about a thing to well there are loads of people who are using you know feature x project y you know there's the you know it's hard to find a sort of big brand name that doesn't have, you know, some kind of modern cloud-based deployment these days, I think. Well, that's good news. Certainly.

Starting point is 00:15:30 I'm sure a lot of those people do also have legacy deployments as well. And a lot of what I'm currently seeing, you know, I'm involved in the Cilium project. Cilium is a networking solution. I would say kind of mostly for Kubernetes, but a lot of the challenges we see now are to do with allowing people to coordinate between their lovely, shiny, new Kubernetes workloads and their legacy workloads that are running on, you know, a BGP network in a data center somewhere so uh there's

Starting point is 00:16:06 definitely uh um people haven't thrown away all those data centers yet right there's kind of like a migration path but you have to straddle for probably years because you're not just going to throw everything out and and start fresh that doesn't make any business sense it's probably a bit like um so when i very first got into computing professionally, when I was doing my first job, we were doing things that emulated punch cards because people didn't, you know, was you know the world had invented a lot of things that were a lot more modern than punch cards but it just took people a very long time to migrate away from those really old systems yeah well i'm nostalgic so i still pine for the days when we could name our servers you know i like a good naming scheme i love to check the uptime on a server and be like this server has been up for two and a half years. That always felt good.

Starting point is 00:17:07 That's why I would never upgrade my kernels. But I understand. Things push forward. You can't do it that way forever. And there's definitely way more reasons to do it the new way. Think of all those security vulnerabilities that are potentially in that old code. All right, you convinced me. This episode is brought to you by our friends at Fly. Fly lets you deploy full-stack apps and databases closer to users,

Starting point is 00:17:52 and they make it too easy. No ops are required. And I'm here with Chris McCord, the creator of Phoenix Framework for Elixir, and staff engineer at Fly. Chris, I know you've been working hard for many years to remove the complexity of running full-stack apps in production. So now that you're at Fly solving these problems at scale, what's the challenge you're facing? One of the challenges we've had at Fly is getting people to

Starting point is 00:18:11 really understand the benefits of running close to a user. Because I think as developers, we internalize as a CDN, people get it. They're like, oh, yeah, you want to put your JavaScript close to a user and your CSS. But then for some reason, we have this mental block when it comes to our applications. And I don't know why that is. And getting people past that block is really important because a lot of us are privileged that we live in North America and we deploy 50 milliseconds a hop away. So things go fast.

Starting point is 00:18:37 Like when GitHub, maybe they're deploying regionally now, but for the first 12 years of their existence, GitHub worked great if you lived in North America. If you lived in Europe or anywhere else in the world, you had to hop over the ocean and it was actually a pretty slow experience. So one of the things with Fly is it runs your app code close to users. So it's the same mental model of like, hey, it's really important to put our images and our CSS close to users. But like, what if your app could run there as well? API requests could be super fast.

Starting point is 00:19:02 What if your data was replicated there? Database requests could be super fast. But if your data was replicated, their database requests could be super fast. So I think the challenge for fly is to get people to understand that the CDN model maps exactly to your application code. And it's even more important for your app to be running close to a user because it's not just requesting a file. It's like your data and saving data to this, especially data for this, that all needs to live close to the user for the same reason that your JavaScript assets should be close to a user. Very cool. Thank you, Chris. So if you understand why you CDN your CSS and your JavaScript, then you understand why you should do the same for your full stack app code. And Fly makes it too easy to launch most apps in about three minutes.

Starting point is 00:19:35 Try it free today at fly.io. Again, fly.io. So I agree that this feature of being able to kind of like hot upgrade or patch, I guess, your kernel without upgrading your kernel via eBPF, modify the way it works, protect yourself from that security vulnerability today without major downtime or upgrades. I mean, that does seem like an amazingly revolutionary feature. Is there anything about that, though, that's scary? It's like, hey, go ahead and change the way that things work in user space like doesn't that seem a little bit like you could also shoot yourself in the foot yeah people often you know have that concern when they first hear about ebpf here here's this incredibly powerful platform that can change the way your servers are operating and security is certainly a huge concern. So a couple of things to be aware of.

Starting point is 00:20:48 First of all, when you load these eBPF programs into the kernel, they go through what's called the verifier, which checks that the program is safe to run. And this is one of the big advantages compared to, let's say, a custom kernel module. Kernel modules are just kernel code that just run. Nothing is checking whether they're buggy or not. With eBPF programs, the verifier will make sure that it's going to run to completion, so it can't loop forever. It will check to make sure that all pointer D references are safe. It will check to make sure that all uh pointer d references are safe it will check to make sure that memory access is safe and you know while nobody who works in security is ever going to say and that means it's completely secure but you know the verifier does a lot of work to make sure that

Starting point is 00:21:40 the program is as secure as possible and certainly can't crash your kernel. That's kind of a guarantee. So that's one side of the security equation. The other is that you do have to treat eBPF like root privileges. You don't want to allow random people to insert random eBPF programs into your service because they do have the potential to see literally everything that's happening on that machine. So treat eBPF like you treat root privileges. Be very careful about who you allow to run eBPF programs. So with great power comes great responsibility as the comics say that makes sense so how do you run an eBPF program or how would you facilitate not running you know who gets to who doesn't get to I assume these are standard Unix user tools or how does that work? So eBPF

Starting point is 00:22:39 itself is say a feature within the kernel a bit like I don't know the TCP IP stack, a feature within the kernel, a bit like, I don't know, the TCP IP stack is a feature within the kernel. Most people probably won't interact with it directly. They'll probably use tools that take advantage of eBPF. I love to show people how things work. So I've done talks before that show, you know, beginner's guide to eBPF programming, because I think it really helps people get a mental model if they can see some actual code. That's certainly how I learn things. I kind of have to see the real thing. you write ebpf programs you are interacting with the kernel and the kernel's data structures and writing ebpf code does quite rapidly go from hello world which everybody can do to okay how how do i safely interact with these data structures and what am i changing when I change this? So for that reason, I think most people are going to find eBPF accessible

Starting point is 00:23:49 through the use of sort of higher level abstractions, higher projects. A few examples. So Brendan Gregg, who was at Netflix, he's now at Intel, he did lots of work to build some eBPF-based observability performance tracing tools. And there's a whole array of, I think, literally dozens of tools for measuring anything that you might want to measure about how your system's operating. And then we get into other abstractions projects like psyllium for networking and observability like arca for seeing flame graphs of how your um or some continuous monitoring of how your user space applications are running uh there's a tool called pixie that's also in the CNCF for observing your Kubernetes workloads.

Starting point is 00:24:46 Lots of different projects that are using the power of EVPF to give you really advanced capabilities, but that are in a much more easy to consume fashion than messing with the kernel directly. Gotcha. So most of us will benefit from eBPF kind of transitively through tools and projects that are using it under the hood and providing some higher level functionality. And those of us who are going to write our own eBPF program as well,

Starting point is 00:25:18 you know who you are, right? Like there's the self-selecting group of people who are very interested at kernel level things, are very good at them, or can at least learn and has a use case. So we were talking about the security angle. The other one that I think of when I think of something that allows you to hook into low-level primitives or low-level kernel space is performance. I feel like you could really slow things down if you do it wrong.

Starting point is 00:25:44 Is that the case? Or are there also things in there that say it has to be performant, similar to the verifier? How does that work? Yeah, I mean, it would certainly be true. It would certainly be possible to write pathological code that would slow things down. Generally speaking, most eBPF programs tend to be small. There's a historical reason for that. There used to be a limit of like 4,096 instructions. So a few years ago, you only could write small eBPF programs. That limit has now been raised and you can, to all intents and purposes, write

Starting point is 00:26:24 pretty much anything you like in in ebpf was that pretty constraining for folks yes yes definitely so everybody rejoiced when this changed yes it's certainly it seems like that kind of constraint might actually be a benefit at least maybe at first but now that people are starting to do more with it i can see where they would feel constrained yes yes the fact that you're calling these ebpf programs directly in the kernel can often lead to some really um good performance improvements actually particularly for things like networking so as an example of this for psyllium providing networking to kubernetes pods i need to just back up a bit

Starting point is 00:27:07 and talk a little bit about how container networking works all right let's do it when you create a container you usually create a networking namespace for that container so the container has its own networking stack effectively and you create a virtual Ethernet connection that connects your container to the host that it's running on. And in Kubernetes, you typically have one of these network namespaces per pod. What that means for a network packet that arrives, let's say a packet arrives to that machine from the outside world through a physical network card into that machine. And in traditional networking, that packet's got to go all the way through the networking stack on the host across that virtual Ethernet connection into the network namespace for the container, and then go through the networking stack again to reach the application.

Starting point is 00:28:06 What we do in Cilium, using the power of eBPF, we're creating what we call endpoints, a sort of logical endpoint for each pod. And when that network packet arrives, we can inspect it before it goes through the kernel's networking stack and we can say oh well i know where that you know the ip address that's associated with that pod i know i know where it is i have its endpoint right here we can avoid going through the host networking stack and go straight into the pods networking stack and while that might not sound very much it shortens the networking path dramatically and when you add up however millions of packets are this is one of the really fun things about infrastructure software is you know like the these things scale the impact scales up and you can see

Starting point is 00:28:59 real improvements significant improvements in latency by using eBPF to shortcut these networking parts. There's an old commercial where a guy is running through his office and he's holding a nickel and he's jumping up and down. I saved a nickel. I saved a nickel. And he's just telling everybody he saved a nickel. And they're all just like, whatever, George, or whatever, like they're rolling their eyes or like, you know, perturbed. And he runs past these people who are walking through the hallway, like who are like C-level execs or VPs or whatever. And he's like, I saved a nickel every time we do X, whatever X is. And the two guys look at each other and they say, we do X 75,000 times a day. And you know, it hits you that all of a sudden this micro-optimization

Starting point is 00:29:43 at scale is a huge win. It sounds like that's what you're describing. Yeah, exactly. Exactly, yeah. Okay, so performance, if you do it right, you're going to end up better off with an eBPF-powered program than otherwise. The other aspect of performance,

Starting point is 00:30:01 so things like observability tooling, you can hook into these events that might happen very very frequently but run this very small ebpf program to count or you know take some information about those events store them in there's a thing called ebpf maps it's a data structure that's shared between, or it's in the kernel that the user space programs can access. So you can store this data very efficiently in the kernel and then retrieve it, I'm going to say on a leisurely basis,

Starting point is 00:30:37 you know, from user space. Leisurely. Because you don't have to kind of do that transition for every event. You don't have to, perhaps you're collecting that information in user space every hundred events or every thousand events. So you don't have to, usually the transition between kernel and user space is very costly performance wise. But you can make it by not having to transition for every event. It's much more performant. Let me see if I'm understanding you correctly.

Starting point is 00:31:08 So in the context of monitoring or observing a program, people would generally take like one out of every hundred or they would sample because it's cost prohibitive. You don't want to bog down the CPU that you're running the program on, right? You want to observe it without affecting it. And you're saying with eBPF, because of the performance savings without having to go back and forth between

Starting point is 00:31:29 kernel and user space, it's so much faster that you don't have to sample or maybe you sample way more often without incurring the performance cost. Is that what you're saying? Yeah, that's exactly right. Yes, yes. Well, that sounds cool.

Starting point is 00:31:44 I can see where that would be great. Yeah, so you can see some really powerful metrics and make security checks for every single time that a particular kind of operation happens. And you can filter those events potentially in the kernel. So maybe you want to police which processes are allowed to access which files, say. And there's been a kind of evolution in the way that eBPF programs do that kind of check. So it used to be

Starting point is 00:32:16 very much based around system calls. We're going to look at those system calls and see whether or not we permit that open. People might have even come across this in the form of SECOMP. So SECOMP stands for secure computing. It's a pretty old technology. Docker, you kind of popularized it quite a lot. You had SECOMP profiles that you would associate with programs to just limit a little bit of what system calls applications are allowed to call. And that is actually based on BPF. It does use BPF to make those checks. But as eBPF has evolved, we could start looking at things like not just is this application allowed to call open on any file, but is it allowed to open this particular file? More recently, there's an interface called the Linux Security Module Interface

Starting point is 00:33:08 that typically has been used for kernel modules that added security checks. But now we can hook eBPF programs to that security module interface and we can make checks to say, is it okay if this user or that process or whatever opens this file we've been working on something called tetragon that takes this another step further really and allows us to filter on the path name so the name of the file that we're gonna open we'll we'll filter those events in the kernel so we're not making the check in user space for every single file that we're going to open, we'll filter those events in the kernel. So we're not making the check in user space for every single file open. We're checking it in the kernel and

Starting point is 00:33:51 only filtering out the file opens that, you know, just as an example kind of event that match a particular prefix, for example. So you can make these things, this internal filtering can make these security checks really performant. So let's speak for a minute to the person who earlier raised their hand when I said, if you're going to be programming EBPF, you know who you are. To that person getting started, or even like language requirements, are you, is it like a C interface? Can you use various programming languages? Maybe just give the lay of the land for that person who's like, would like to actually dive in and go for the hello world and maybe go beyond. Maybe point to some of your talks or somewhere where they start. Yeah.

Starting point is 00:34:34 So you typically need to write two pieces of code. You write the eBPF program itself and that runs in the kernel. And you're typically going to write some user space code that can interact with that in some way. Maybe you're collecting metrics in the kernel and you're going to have some user space code that will retrieve those metrics and show them to the user. Or maybe the user space program is going to provide some configuration information to the eBPF program. Some eBPF programs, particularly for networking, there's no user space part involved. For example, if you wanted to do firewalling, you'd typically just load that into the kernel

Starting point is 00:35:17 and maybe you'd only be reporting a few metrics to user space. Anyway, so you've got these two parts the kernel code it has to be compiled into bpf bytecode and at the moment you can compile from c and you can now also compile from rust so you'll need to be proficient or you know willing to at least take a stab at writing some C code or some Rust code. For the user space part, you've got quite a lot more flexibility. And this is another kind of area where there's quite a lot of evolution. There are quite a lot of different approaches, different libraries, different frameworks. A lot of people start with a framework called BCC, which has been around for a few years. And it does make it really easy to write

Starting point is 00:36:11 both the user space code and to kind of do things like loading that BPF code into the kernel. BCC will take care of a lot of that for you. But the downside of BCC is that it actually compiles your BPF code kind of in real time. So maybe you write your program in Python or at least the user space part in Python. And when you execute that Python program, BCC will go away and compile your C code and load it into the kernel. And that means wherever you want to run it, you would need the C compiler tool chain, which is not necessarily what you really want. And one of the reasons why they did that is because wherever you compile that code, you need to have knowledge or the code is going to have to match the kernel data structures on the machine where you're going to run it. And kernel data structures do change from version to version. So if I build some eBPF code on machine A, how do I know that it's going to run on machine B?

Starting point is 00:37:19 And one of the big innovations in the last sort of recent years in eBPF is a thing called compile once, run everywhere, which essentially allows you to compile the code on machine A and sort of include the knowledge of what the kernel structures are on machine A. And then when you take that compiled object to machine B, there's essentially some automatic work that compares, oh, well, the kernel data structures are slightly different here. So I might need to adjust the code to take account of that automatically. And that makes it much easier to build the code and distribute it to users without them needing to have like the C compiler installed. So that's made quite a big, made it a lot easier for people who do want to distribute eBPF based tools.

Starting point is 00:38:13 Which seems like it's most people, because like you said, you have this small group of people slash teams who are building the tools and a whole bunch of users who are benefiting from those tools. Well, those tools have to get onto their machines and they have to work on their machines and so now you have this cross-platform problem only the platform is the linux kernel and so you have these different versions different data structures it seems like a definite real challenge and that sounds like a boon to to ebpf people for getting their stuff out there absolutely it's it's a real real kind step change. I think we keep seeing these big improvements in eBPF that just mean that it's more accessible or the tools based on it are more accessible

Starting point is 00:38:54 to the world at large. And that's fantastic. What's still painful? Where could the next step change come? Oh, that's a great question question some of this is still painful because not everyone is running a modern enough kernel to have you know all the latest features especially that instruction set change right the the max 4096 you said that was a recent thing yes that that would be an example yeah so um if you have a tool that needs to exceed that limit then yeah you might need to do some tricks to to make it run on older kernels right there are things like the way that you can actually attach programs to different events in the kernel some of those have evolved and become more performant. So for example,

Starting point is 00:39:46 you'll see loads of examples of eBPF programs that attach to kprobes. Kernel probe, it's basically a hook in the kernel. kprobes pre-existed eBPF, but it was for tracing or adding tracing probes into the kernel. And it's essentially add a kernel, add a kprobe at the entry to any function, pretty much in the kernel. Here's the function name. I want to add some tracing there. And you've been able to use eBPF programs,

Starting point is 00:40:17 hook those to kprobes for a long time. And over time, there's been some more and more performant ways of doing that. So the current preferred approach to that is called F entry. It doesn't make that, but it certainly doesn't matter to anybody who's just using the tool. Pretty easy change for somebody who's writing the code, but it does just, and like we were saying before, all those nickels, every, you know, tiny improvement

Starting point is 00:40:48 in the speed of running that program once it will add up when you've done it a million times. So we'll see things like more performant hooks. There's also, I think, for EBPF, for folks who are developing EBPF tools,

Starting point is 00:41:07 there's lots of innovation happening in things like testing and code coverage and sort of instrumenting your code. Getting your code through the verifier is still something of an art. And there's, I think, probably more improvements to come in sort of making it easier for people to write those eBPF programs kind of without necessarily having to do such a dance with the verifier. There's a really great quote I read that described the eBPF verifier as a, I think it was a fickle beast. It's quite a nice phrase sounds like something I'd like to stay away from if at all possible it's a challenge though This episode is brought to you by Square. Millions of businesses depend on Square partners to build custom solutions

Starting point is 00:42:24 using Square products and APIs. When you become a Square solutions partner, you get to leverage the entire Square platform to build robust e-commerce websites, smart payment integrations, and custom solutions for Square sellers. You don't just get access to SDKs and APIs, you get access to the exact SDKs and the exact APIs that Square uses to build the Square platform and all their applications. This is a partnership that helps you grow. Square has partner managers to help you develop your strategy, close deals, and gain customers. There are literally millions of Square sellers who need custom solutions so they can innovate for their customers and build their businesses. You get incentives and profit sharing.

Starting point is 00:43:04 You can earn a 25% status revenue share, seller referrals, product bounties, and more. You get alpha access to APIs and new products. You get product, marketing, tech, and sales support. And you're also able to get Square certified. You can get training on all things Square so you can deliver for Square sellers. The next step is to head to changelog.com slash square

Starting point is 00:43:22 and click become a solutions partner. Again, changelog.com slash square and click become a solutions partner again changelog.com slash square and by our friends at retool retool helps teams focus on product development and customer value not building and maintaining internal tools it's a low-code platform built specifically for developers no more ui libraries no more hacking together data sources and no more worrying about access controls. Start shipping internal apps that move your business forward in minutes with basically zero uptime, reliability, or maintenance burden on your team.

Starting point is 00:43:54 Some of the best teams out there trust Retool, Brex, Coinbase, Plaid, DoorDash, LegalGenius, Amazon, Allbirds, Peloton, and so many more. The developers at these teams trust Retool as their platform to build their internal tools, and that means you can too. It's free to try, so head to retool.com slash changelog. Again, retool.com slash changelog. okay so we've talked a lot about what ebpf is i'm going to ask you a slightly different question interpret it however you like who is ebpf oh interesting question so I'm going to answer that with a bit of a it talks about how it was using eBPF to create this really cool container networking. And I thought, that is really cool.

Starting point is 00:45:15 And yet it's so foreign to like, nobody can possibly use this because it needs this cutting edge kernel at the time. But I thought, that's interesting tech you know i'm gonna just keep an eye on that i want to see how that works and then a few years later i was working for a security company and somebody suggested using ebp they'd actually been doing a project outside of work using eBPF on Android for a sort of security related project. And they were like, could we use eBPF to build security tooling? So we worked on that for a while. And in the meantime, I was seeing more and more of this eBPF community kind of building up more and more people using eBPF community building up, more and more people using eBPF and different projects.

Starting point is 00:46:09 And Isovalent, which was the company that Thomas and Dan Vendlandt, Thomas who I'd seen speaking at DockerCon, they founded Isovalent around the Cillian project. And they were facilitating this this EBPF community. And, uh, you know, I realized that that was, if I wanted to really immerse myself in EBPF, that was the place to join. And that's why I joined IsoValent. And since I've been there, one of the things that really I hadn't appreciated before i was there was the extent to which psyllium and ebpf have actually been kind of developed in almost lockstep so there are two maintainers of the ebpf subsystem in the kernel uh one of them is daniel balkman who works for

Starting point is 00:46:58 isabel and the other is alexei starovoytov who works for Meta. And they are the people who kind of drive eBPF's future. And a lot of how eBPF has evolved, certainly on the networking side, has been in order to allow Cilium to build some cool networking feature, we need support in eBpf to enable that you know maybe different hooks into different parts in the networking stack as an example so it was just fascinating to me to see just how much of that the development in ebpf had really been done to enable i mean to enable the platform as a whole, but particularly with this vision of how EVPF could improve networking and facilitate all these really efficient networking features.

Starting point is 00:47:53 So for me, that was kind of why I was drawn to that team. The expertise is just, you know, beyond comparison, I think, and a really exciting place to be. That's cool. So in terms of open source project related to a corporate entity, how does, I guess, where does Cilium stop and Isovalent start with regards to financial arrangements and stuff like that?

Starting point is 00:48:20 How does that all work? So Cilium has always been open source. And one of the things that we did not long after I joined was donate the Cilium project to the CNCF so that it's got that foundation ownership so that everyone can have confidence in it as a community project.

Starting point is 00:48:38 And iSurveillance provides an enterprise distribution of that. And the way we approach this is that Cilium works. Cilium open source works. provides an enterprise distribution of that so and the way we approach this is that you know psyllium works psyllium open source works there are plenty of people who are using psyllium at scale you know some house that you can go and take a look at the psyllium website and there's a list as long as your arm of of household names who are using Cilium. And a good number of those are using it open source. But some of them either need support, that's the kind of classic open source model, or some of them

Starting point is 00:49:14 need features that you only need if you're an enterprise, you know, a large enterprise. For example, I mentioned before about, about you know integrating with legacy workloads in data centers you know if you're operating your own data center you are the kind of organization that spends money on software right you know you you want to license software you want to have somebody who's going to provide some some around that software. So those kind of features that people really only need in an enterprise environment, some really advanced UI, some really advanced security tooling features that we add on top of the open source project for our enterprise customers. And there are other people, because it's in the CNCF, there's, you know.

Starting point is 00:50:05 Other offering. Other people who can use Cilium or build products on top of Cilium. Love that, because now you're competing on a level playing field. Of course, as the maintainers of Cilium, you have that expertise, the street cred, so to speak. So other people have to establish that.

Starting point is 00:50:23 But the fact that you can have multiple service providers or licensors or offerings that are competing on how well they do that and not competing over the proprietariness of the software that they're running, I mean, that's spectacular for everybody. Yeah, absolutely. I mean, I'm a big believer in the power of open source in general and specifically for infrastructure

Starting point is 00:50:46 software just that you know the sheer number of people who will use open source code it creates such field hardening that i think for that kind of core capability something like you know how your networking is plugged together it's really an advantage for it to be open source. And then having this huge community of people who also feel confident about contributing to it as well, which I absolutely love. Totally. Well, if you look at the network stack or the OSI stack, whichever one you prefer,

Starting point is 00:51:18 you want as much competition at the application layer as possible. And collaboration at the lower layers. You know, if we're all reinventing these low level things, then we're just, we're just wasting efforts and you can find competitive advantages by doing that at, you know,

Starting point is 00:51:34 but they're going to be just isolated to you and have all those drawbacks or everybody can collaborate at those levels, have all the best minds working on the same thing, pushing everybody forward and then competing at the application layer way more effective that way. I mean, just the way it should work. 100%. Yeah, absolutely. And we can take lessons from history around this. So back in the day, if you wanted to use TCP, you had to include a TCP library in user space. And nowadays, we fully expect that you're going to run TCP. You're just going to use the kernel services to get that TCP connection going.

Starting point is 00:52:15 And I think it's completely sensible to extrapolate from that direction of travel and expect that more and more of the infrastructure software will not just become that kind of travel and expect that more and more of the infrastructure software will not just become that kind of commodity open source software but also that more and more of it will be handled by the kernel especially now that the kernel itself the kernel authors don't have to handle it right with ebpf you have more and more kernel based offerings that are happening by people who are not you know Linux kernel, or we can talk about Windows kernel as well, core maintainers. The innovation can happen in a much broader group

Starting point is 00:52:52 of folks because of eBPF. Yes, yes. And it gives us the ability to have, you know, people are using Linux for, you know, just the broadest range of different purposes. And the Linux kernel has to work on, you know, IoT devices and desktops and data centers and probably the moon. I don't know. And in fact, I think Linux does run on the Mars. I'm sure it does. Yeah. One of the Mars landers. Yeah. So Linux, you know, the kernel itself has to be super flexible and very backwards compatible, but you can do much more sort of innovation and bespoke things using eBPF, which is a rich theme of innovation.

Starting point is 00:53:37 There's definitely a parallel between browser tech and kernel tech in this way. I know I've heard people compare eBPF to be like the JavaScript of the Linux kernel. Yes. Just because of the JavaScript's relationship to the browser. And I can definitely,

Starting point is 00:53:51 when I first heard that, I was like, I don't know about that. But the more I think about it, the more that it does make sense as an analogy. And all of the innovation that happens in the browsers by people writing JavaScript libraries that eventually those things prove themselves out,

Starting point is 00:54:05 like jQuery, for instance, the way it does a lot of selecting and stuff. All of a sudden, that stuff gets brought back in to the browsers. And so we could have a similar thing here where you have the innovation in the eBPF world and then the best ideas, the most obvious ones in retrospect,

Starting point is 00:54:19 the ones that everybody needs, well, that stuff is baked back into the kernel maybe. That would be cool. Absolutely. So on the website eBPF. Well, that stuff is baked back into the kernel, maybe. That would be cool. Absolutely. So on the website, ebpf.io, it gives four kinds of applications, networking, security, profiling, and observability. You mentioned three. We could probably bike shed a semantic debate on is observability and profiling, I guess, different things, the same thing. Is tracing part of observability? I don't know. It doesn't matter to me. But if we think about these three categories that you gave earlier,

Starting point is 00:54:49 networking, security, and observability, can you give examples of people doing cool stuff, feel free to name names, or open source projects in each of these three, like if you're going to say, okay, here's cool stuff that's happening. I know you've touched on them throughout the show. But if we're just gonna say, here's a cool networking stuff, here's cool security stuff, here's cool observability stuff. What would you, what would you mention for those three? Yeah. So for networking, I mean, obviously I am very involved with Cillium. So that's the first name that comes to mind. But there are other, you know, really uh users and projects so facebook now meta um have a project called katran which is a load balancer and i'm trying to remember what the date is i want

Starting point is 00:55:36 to say 2016 let's let's say 2016 and i'll apologize if it's if i'm not quite right there but okay basically every single packet since that date that goes to facebook has been through ebpf every single packet has been processed by an ebpf program wow if that doesn't convince people about the scalability of ebpf i don't know what would right cloudflare also um using ebF to do things like DDoS protection and load balancing. And yeah, lots of really cool blog posts that Cloudflare have written about their use. If we turn to observability, I mentioned the work that Brendan Gregg had done and this whole series of tools. And he developed that at Netflix where they were using it for again

Starting point is 00:56:25 really scalable you know performance measurements and and yeah whether we call it tracing or observability or metrics or or whatever it's all about you want to give a hot take do you have a do you have an opinion on this uh is is it worth distinguishing or no i think there is a a bit of a distinction where observability allows you to make sense of data. So things like, I mean, I think we would say that metrics were different from logs, were different from traces. And then maybe observability is about how do I take all that information and actually ask questions of it in a sensible way. An umbrella term, sort of. Yeah, yeah.

Starting point is 00:57:04 Fair enough. It's definitely, I uh there's an overlap definitely i know it's been the subject of many go time unpopular opinions whether or not observability is a thing or not really a thing so it's fun for nerds to talk about yeah i do quite like it as that umbrella term for i want to know something about what's happening in my system in my cluster in my deployment yeah it's quite a nice term yeah so observability projects i i think i mentioned pixie which is cncf sandbox project parker is another one that's really interesting for observing your applications behavior psyllium has a project

Starting point is 00:57:46 called hubble or sub project called hubble that shows you things like your service graph in kubernetes so how your services are communicating with each other i can also show you individual network packets which is pretty cool if you're trying to debug yeah Yeah, debug DNS because it's always DNS. Yeah, right. Yeah. Other networking problems are available. By request. Yeah. And then on security side,

Starting point is 00:58:16 so Falco was probably the first security project, certainly in the CNTF, that was using eBPF. This project from my former colleagues at aqua called tracy and then in psyllium we have a sort of psyllium family we have a sub project called tetragon which is allowing you to create low level security primitives almost in yaml form and apply them to your kubernetes cluster and you can do really cool things with with touch gone i i get a bit overexcited about this because if your kernel is modern enough you can not just detect that something is you know potentially malicious behavior you know process processes opening the wrong file

Starting point is 00:59:05 or connecting to a cryptocurrency miner or whatever malicious thing that you've detected, you can kill that process synchronously from within the kernel. And what that means is the process get killed. It's not like you have to go and tell somebody and then eventually your process gets killed it's happening right there and then and it stops the attack before it happens which is super fun to demo i love it i bet that

Starting point is 00:59:35 makes it for a great demo very good that helps out a lot especially for people who are interested in cool things being built with psyllium so we've been I've been ferociously grabbing links as you talk. So those will all be in the show notes for the listeners. As we wrap up, Liz, let's talk about the future, where things are going. You mentioned Windows kernel. I assume that's like a burgeoning thing, or is it available?

Starting point is 00:59:56 And what's coming down the pipeline in the eBPF world? Yeah, so the Windows eBPF, I know that they have got as far as being able to demo Cilium running on Windows. So whether it's in production Windows, I don't know. But it's certainly some significant progress being made there to implement it on Windows. I'm sure we will hear more about that um and also more about sort of the future of ebpf more generally at the community conference that's coming up that um i'm part of the team organizing called ebpf summit which happens september 28th and 29th put it in your diary um and we are going

Starting point is 01:00:43 to have amongst the speakers we've got both I mentioned before, the two kernel maintainers who work on eBPF. Both of them, Alexa and Daniel, are both going to be speaking at eBPF Summit this year. So we should get a pretty good insight into what the future of eBPF is from a platform perspective. I think that will be super interesting. And we're also going to hear from lots of end users, lots of people working on different projects. We're in the process of going through the session proposals at the moment, and there are so many good proposals.

Starting point is 01:01:18 It's going to be really difficult to choose. But last year, we had a lot of fun. We had a lot of people on Slack kind of doing things like capture the flag with us interactively. And it was tons of fun. So hopefully this year's EVP Summit will be even bigger and better and more fun. Very cool. And that is fully virtual. So access from anywhere with an internet connection. Yeah. Excellent. Well, anything we left uncovered? Anything else you want to talk about here before we call it a day? Not that I can think of, no.

Starting point is 01:01:53 Yeah, we've pretty much covered everything. Excellent, excellent. Well, listener, all the links to all the things are in your show notes. Liz, thanks so much for joining us on the show. Thanks for your excitement and your ability to so well explain these difficult concepts and get other people excited. It sounds like you're a great advocate for this technology and the power that it's unlocking for so many of us through people building cool tools and stuff that probably we haven't

Starting point is 01:02:17 even thought about yet. We have these three major buckets, but I'm guessing there's maybe a fourth bucket out there, maybe things that we don't even know. So I'm excited about the future that eBPF is affording us. Absolutely. Me too. It's an exciting space to watch and be part of. Absolutely. Well, we may have to have you back, maybe put a marker a year from now, maybe have you back, have a catch up, see what's going on, see what people have invented in the meantime. That'll be fun.

Starting point is 01:02:43 That would be awesome. All right. Thanks, Liz. Thanks, everybody. on see what people have invented in the meantime that would be fun that would be awesome all right thanks liz thanks everybody thank you for tuning in let us know in the comments what you think about ebpf the link is in the show notes thanks again to our friends at fastly and fly everywhere around the globe our pods and our app are fast and that's because Fastly and Fly are fast everywhere. Check them out at fastly.com and fly.io. Thanks also to Breakmaster Cylinder for making our awesome beats. And last but not least, thanks to you for listening to the show all the way to the very end. Tell a friend if you love the show.

Starting point is 01:03:20 Send them to changelog.fm and tell them to subscribe. That's it. We're done. We'll see you next week Thank you. Teksting av Nicolai Winther Thank you. Game on.

The Changelog: Software Development, Open Source - The power of eBPF (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.