Materialized View Podcast - Parca, Polar Signals, and FrostDB with Frederic Branczyk

Episode Date: January 8, 2024

I recently talked with Frederic Branczyk. Fredric is the founder of Polar Signals, a new always-on, zero-instrumentation profiler. Before Polar Signals, Fredric spent time at Red Hat and CoreOS, where... he worked on Kubernetes and Prometheus.In this interview, Frederic and I break down Polar Signals’s architecture and its main components: Parca and FrostDB. Parca is of particular interest; it is able to achieve its minimally invasive profiling claims by using an eBPF filter that samples the entire OS’s stack at 19hz. Data is then passed to a server and stored into FrostDB, an embedded storage engine built on Datafusion and Parquet.During the discussion, Frederic mentions several influential papers:* Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers* BOLT: A Practical Binary Optimizer for Data Centers and Beyond* Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications* Large-scale Incremental Processing Using Distributed Transactions and NotificationsYou can support me by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to new software engineers that you know.I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit materializedview.io

Transcript
Discussion (0)
Starting point is 00:00:00 I recently got to sit down with Frederick Branczyk. Frederick is the founder of Polar Signals, a new always-on zero-instrumentation profiler. Before Polar Signals, Frederick spent time at Red Hat and CoreOS, where he worked on Kubernetes and Prometheus. In this interview, Frederick and I break down Polar Signals' architecture and its main components, Parca and FrostDB. Parca is of particular interest to me, as it is able to achieve its minimally invasive profiling claims using an eBPF filter that samples the entire OS's stack at about 19 hertz. Data is then passed to a server and stored in FrostDB, and embedded storage ended built on top of Data Fusion and Parquet. And now, Frederick Brenchik. All right. So I thought we would start with Prometheus, actually. We were emailing back
Starting point is 00:00:50 and forth and you mentioned that you were a committer on Prometheus. So how did you end up working on Prometheus? Yeah. So that was basically in 2016, I joined a company called CoreOS. I don't know if people still remember CoreOS. We were kind of one of the early Kubernetes companies. And basically, we were working on automatically updating software. And we were kind of thinking about, you know, if we're automatically updating all of the software all the time, right, we need to understand whether the software is doing what it's supposed to be doing and doing that successfully before, during and after upgrades. And so we quickly identified, you know, monitoring and observability was going to be super key in
Starting point is 00:01:28 Khoros' business. And so at the time, Prometheus was kind of the up and coming thing. And Khoros hired one of the major maintainers at the time already, and he hired me. And then that's how I kind of got started and then quickly became a maintainer of the Prometheus project and kind of everything in that intersection of Prometheus and Kubernetes, either built or, you know, at the very least, left my fingerprints on.
Starting point is 00:01:55 And then ultimately, through all of that collaboration, I actually ended up becoming a tech lead on the Kubernetes project as well. Oh, wow. I didn't know that. Okay, so you've worked on Kubernetes as well. Yes. And so how did that experience lead you towards Polar Signals, which is the stuff you're working on now?
Starting point is 00:02:13 Yeah, so just maybe for setting the scene at Polar Signals, we build continuous profiling software. So we basically profile all of your infrastructure all the time and then record this data so that you can analyze it later. And that basically happened because Corus was acquired by Red Hat in 2018. And I ended up basically becoming architect for all things observability. So we had kind of the classic metrics team, logging team, distributed tracing team. And we were working on this super cutting edge observability software, but we kind of found ourselves manually profiling our software pretty much every single day
Starting point is 00:02:52 because we were working with these super performance sensitive pieces of software, right? Like Prometheus, Kubernetes, Jaeger, and so on. And so for us, it was second nature to do profiling anyway. And then I read this white paper that Google published a couple of years before that, that's the infamous Google white profiling paper, where Google basically, they were the first ones to describe this publicly, to my knowledge at least, how they're doing infrastructure-wide profiling. And that's where it really, really came to me that we need to be dealing with profiling just as systematically as we do with metrics, logs, and profiling.
Starting point is 00:03:29 And it's really observability just like everything else. You always want the profiling data that you don't have. Yeah. So you mentioned you were doing it manually at that point. And that's, frankly, prior to some of the APMs and stuff that I've worked with, how I've done it in the past as well. How did that look for you guys? Were you connecting to production?
Starting point is 00:03:51 How were you doing the manual sampling? I assume it was sampling or profiling before then. Right. So in the Go ecosystem, profilers are pretty good. And basically in the Go ecosystem, and this came basically also straight out of Google, it's built into the Go runtime that you can have an HTTP endpoint that you hit and it will record a profile over, let's say 10 seconds and return you with that data. So that's basically
Starting point is 00:04:18 what we were doing all day long in Kubernetes, port forwarding and doing that, for example. But the difficulty with that is one, you actually need to connect with production, and that comes with all the potential problems as with SSHing and so on. But maybe more importantly, you don't have, like I said, this data that you really want to have, which was like at this weird time when there was this CPU spike or when an um kill has already occurred, right? You always want the data that you don't have. And this is kind of a theme that we keep seeing throughout all of the
Starting point is 00:04:50 observability. And we saw that trend basically also from like Nagios checks to like Prometheus and like Datadog, right? Where we went away from just checking, is this thing up? Is this thing up? Is this thing up? To like recording information at time series and alerting on the time series data. And now that we're doing this with profiling as well, we can do a super interesting analysis.
Starting point is 00:05:10 Not only do we have all this data throughout time, we can also say, hey, I have this performance regression in this new version of our software. Tell me all the differences of CPU time from this previous version to this new version. And boom, down to the line number, we can see exactly where new CPU time is now being spent. And like this kind of analysis wasn't even possible before. And if it was possible, you maybe were able to like grab some profiling data from the previous version because you rolled back
Starting point is 00:05:35 and took some profiling data again, and then took some profiling data again. So just like this super weird dance to get the right data. Now, it's just always available. Gotcha. So to recap, basically you were doing manual sampling against Go and they exposed an HTTP endpoint. That makes sense. Okay. I think that's loosely, I come from, you know, Java world, that's sort of loosely equivalent to what we were doing in JMX land. And there's, you know, at least in the Java side, there's sort of performance penalties you can get if you don't
Starting point is 00:06:04 sample properly, and then you can turn on HProf and sample and stuff. So there's sort of a trade-off between performance and granularity. It sounds like you were doing more point lookups at that time. So there was some issue, and you would go to the thing and say, hey, basically, give me stack traces for a 10-second window, and let me see what it looks like. Exactly. Super reactive, basically. Gotcha. see what it looks like. Exactly. Super reactive, basically. If we were good, maybe we were running some benchmarks for a very specific thing that we wanted to optimize, but also building those benchmarks is really difficult to actually have something that behaves truly like production.
Starting point is 00:06:35 Now, we use fuller signals ourselves all day long, and it's this super fast feedback cycle where we can actually be sure this is exactly how it behaves in production because it is production data. Yeah, that makes sense. Were you guys doing any sampling at that point at the OS level? I think what we've been talking about so far is really at the application level, right? But alongside the application stack traces, there's a whole bunch of information around, you know, disk and all that kind of stuff. The thing I think about at LinkedIn back in the day, we ran SAR, which was this like system activity report thing that came with, I think with Red Hat Linux, that would allow you to sort of sample and record disk stuff.
Starting point is 00:07:15 And that looks a little bit more like traditional observability. I'm assuming you were marrying up what you were doing with the more traditional stuff. Is that true? Yeah, we were doing that kind of stuff more through like traditional, we call it now traditional metrics, right? Like at the time Prometheus was still like, you know, just starting to be implemented within companies.
Starting point is 00:07:34 But yeah, like metrics is what we were using for that. But this is actually a great kind of segue into also what changed with PolarSignals because something like when we started PolarSignals, we went about it a little bit naively, right? We came from the Go ecosystem, we were like, profilers are amazing, right? Collection is a solved problem. Turns out, they're only great in the Go ecosystem.
Starting point is 00:07:56 And so ultimately what we ended up doing is we ended up building a profiler completely from scratch using eBPF, which allows us to kind of grab this data at the operating system level. And through a lot of really hard work, we now basically have this completely zero instrumentation profiler. You only need to add this profiler to your host, and it starts profiling everything, no matter what language you have on that host. Yeah, so I think let's start to dig into polar signals.
Starting point is 00:08:25 But before we do that, I think we've kind of been talking about profilers without giving a really solid definition of what exactly we're talking about. So can you give me sort of, in your view, how you define a profile and why it's useful? Yeah. So profiling tools have basically been available ever since software engineering started because what profiling gives us is down to the line number and the function call stacks that led to some amount of resource usage. So it's essentially a stack trace and a number. That's all that profiling data is. And that resource could be anything. It could be CPU time. It could be memory. It could be file IO. It could be network IO.
Starting point is 00:09:01 Anything really. Most commonly we see CPU profiling data, also because it tends to be the most expensive thing on a cloud bill. And so naturally that's the one thing that people look at when they want to optimize for cost. But at the same time, CPU also tends to be the thing that you look at when you optimize for latency. Because if it's not some other IO call that you're doing, there's basically only CPU time left to optimize. And so that's kind of what we also see our customers doing most of the time. Gotcha. Okay. So yeah, let's dive into Polar Signals now. So this is a new company you started. When did you kick it off?
Starting point is 00:09:38 So actually, we've done quite a bit of R&D. So the company was founded about three years ago and we launched our product publicly about two months ago. Gotcha. And so somewhere in the midst of this is sort of two other projects. One of them is Parka and the other is FrostDB. So what are the relationship of those two things to what Polar Signals is doing? So the profiler is part of the Parca project. So Parca is essentially two components. One is the agent, that's the collection mechanism, everything that we've talked about so far. And then the server side is kind of the Prometheus equivalent. It's a single statically linked Go binary, extremely easy to get started with, extremely easy to deploy.
Starting point is 00:10:25 It kind of ships everything in one binary, a storage, an API, a UI, everything in one box. But similar to Prometheus, we very intentionally chose not to make this a distributed system because distributed storage is a very, very difficult problem to solve. And we kind of also decided that was something that at least for a start, we're going to, if we manage to, we're going to try to figure this out as the business. And for a start, it'll be our kind of competitive advantage. But yeah, that's kind of where we started. And FrostDB is the storage layer within Parka. And it's also what our distributed storage
Starting point is 00:11:01 within Polar Signals is based on. It's kind of our RocksDB for the columnar database. Okay. Okay. So I think the relationship is essentially FrostDB is embedded time series database. You have a single node system built on that, which is Parka, that also comes with a CPU profiler, which is the eBPF thing you talked about. And then, excuse me, Polar Signals is essentially the cloud hosted version of this. Precisely. Got it. Got it. Okay. So why don't you run me through the eBPF CPU sampling stuff? I think that's really interesting to me. So first off, I think I'll take a shot at
Starting point is 00:11:43 summarizing eBPF because I'm very much a novice in that area, and then you can correct me where I'm wrong. But my understanding of eBPF is essentially, it's an interface that allows you to implement modules that go into the kernel in Linux and allow you to sort of almost like a filter chain or something, inject in between kernel calls from the application space. Is that loosely correct? Yeah, definitely. I think that's pretty accurate. Essentially, it's a virtual machine within the kernel that you can write C code for. And when you load it into the kernel, it goes through this thing called the verifier to make sure that whatever you're going to be executing within the kernel is actually going to be safe to be executed there. So it makes sure that certain areas of memory can be accessed.
Starting point is 00:12:30 It's all basically read only, except for very specific mechanisms. And then the way that eBPF programs are run are through triggers. And that can be exactly like you said, it can be a syscall being called, it can be a network packet being transmitted or received. Or in our case, we register a custom event using the perf events subsystem within Linux, where we're basically saying every X amount of CPU cycles call our program. And what our program does is it figures out when our program starts, we get just a pointer to the top of the operating system stack, right? Like when we, if we go back to like computer science and how kind of programs gets executed, they have a stack, right? And the
Starting point is 00:13:17 operating system does as well. And so when our program is called, we only get basically a pointer to the very top of the stack. And in a simplified form or in the best case, something called frame pointers are present. And what that means is that there's a register that's reserved that tells us where the next lower stack frame is. And so in that case, we can just walk this linked list essentially. And at every point, we collect what address of the instruction is. And that's essentially what we can then use to translate to a function name. That's basically how programs work. And so this is how we then get that function call stack. And if we see that same function call stack multiple times, statistically speaking, we're spending more time in this
Starting point is 00:14:03 function. Gotcha. So I think there's two things that I want to unpack there. One of them is you said when you're lucky, there are the references to the subsequent frames, right? And then the other one is, you know, my instinct is that these frames are essentially just, you know, bags of bytes. And so to make it useful to the developer, you need to somehow attach it to what would look more like a traditional stack trace that you would see as a developer, which has the method name, the return, the parameters, all that kind of stuff, package, all that. So how do those few things happen? So frame pointers are kind of a hot topic, actually. We actually just had launched this collaboration together with Canonical, where frame pointers are now going to be the default configuration for compilers within the Ubuntu packages, unless the package specifically
Starting point is 00:14:53 overrides and says, I want to omit frame pointers. Because basically this is kind of coming from the 32-bit world where we have a certain number of registers. I forget exactly the amount. I think it was eight registers, general purpose registers, and reducing that by one, only having seven, that actually makes a big difference, right? But now we have 15 or 16 general purpose registers, and it actually ends up making quite a bit less of a difference. In most benchmarks, you see absolutely no difference. Of course, you'll find edge cases for these things. For example, the hottest loop within the Python interpreter basically makes use of exactly 16 registers. And therefore, when you enable frame pointers, it has a major performance degradation. Just to make sure I understand, it sounds like the reason you would want to disable
Starting point is 00:15:44 frame pointers is for performance. Exactly. That's basically the only reason. You also get a little bit smaller of a binary because essentially it's instructions that need to set and retrieve the frame pointer. Okay. So in that world, your claim is essentially most of the time, especially with Ubuntu package work you've been doing, you're going to have frame pointers.
Starting point is 00:16:02 And in the case where you're not, does pullarSignals just... What does it do? Yeah. So we can still profile everything. And this is also part of what actually makes our profiler very innovative. So in the previous world, when you use the profiler like Perf, Linux Perf, what it does is it copies the entire operating system stack into user space, where it can then be unwinded synchronously using something called unwind tables. This is a special section in x86 binaries, and the x86 API specifies that this section must be present. Otherwise, Linux basically says, I don't know how to run, or execution of this program is
Starting point is 00:16:42 undefined. And this is the same way as how C++ exceptions work. Maybe you've seen this before. When you don't have debug infos included in C++ binary and it has an exception stack price, all you get are memory addresses. And you need to put those memory addresses into a tool like adder to line
Starting point is 00:17:02 to convert those addresses into function names. But basically these unwind tables, we needed to optimize very much so that the eBPF verifier would be happy with us still doing everything that we need to be doing. Because something that we didn't mention earlier with the eBPF verifier, it also limits how many instructions can be run. And it basically, one of the ways how it ensures that what you're running within the kernel is safe to do is it basically solves the halting problem by saying you cannot run basically Turing-complete programs. So you can't have loops that can potentially be endless. Everything has
Starting point is 00:17:43 to be bound, all of these kinds of things. And so this was a really big part of what makes our profile really, really interesting because it basically works under every circumstance. However, walking a linked list, which is frame pointers, is still way cheaper than doing table lookups, doing some calculations with the offsets, loading a bunch of registers, writing a bunch of things, and then doing each jump. So having frame pointers is still very, very much preferable. And also this entire dance with the unwind tables, not every tool out there is going
Starting point is 00:18:17 to have the kind of resources that we had in order to make that happen. If we look at debugging tools, maybe they want to figure out where some network packet came from. For this one-off thing, people are not going to put that amount of kind of engineering work into making that happen. Gotcha. So the unwind table exists in the binary.
Starting point is 00:18:37 And I'm guessing, are you guys doing some sort of prefetching and caching? Okay, so essentially you read this binary and keep it somewhere. And then your eBPF filter is able to access that in memory versus having to do full disc reads and all that stuff every time a call is made. Exactly. We also heavily optimize it so that searching for this data is very, very fast. And then we could fill the entire podcast up episode just with this, but long story short,
Starting point is 00:19:00 this unwind information itself is Turing complete. And so we need to do our utmost best to basically try to interpret this information as much as possible so that the EBPF program needs to be as little work as possible so that it will predictably halt. Gotcha. And so one of the things I was wondering about EBPF, I've never written a filter or anything. So like I said, I'm very much new to this. The question I have is around state, and I think we're kind of getting out of here with this caching stuff. So my assumption would be the validator is not going to allow you to keep anything state-wise
Starting point is 00:19:37 beyond memory. How are you managing the caching state? Yeah. So I spoke earlier about kind of writable space, right? I said that eBPF is basically read-only and that there are some exceptions to this. And this is basically one of those exceptions, and they're called BPF maps. And it's exactly the thing that's used, the only thing that you can use to communicate from user space to kernel space and vice versa.
Starting point is 00:20:00 And basically what we do is we populate these maps with these optimized unwind tables, which then the BPF program can use and read while it's doing the unwinding. And this is also how then the resulting data is communicated back to user space. Gotcha. Okay. So continuing to pull on this thread, you've got your unwind table that's loaded in this map. The filter is getting invoked by the Linux kernel. And how is the sampling set up there? I'm assuming that's there's some config file or something that you set, but how does that manifest itself as, say, 1% of the calls or something like that? Yeah. So like I said, what we do is we register something called a perf event, which causes the kernel to call our program every X amount of cycles.
Starting point is 00:20:51 And essentially we calculate, if we want, let's make the calculation simple for this calculation. We want a hundred samples per second, right? That would mean that each sample represents statistically 10 milliseconds. And so if we see the same function call stack 10 times, statistically speaking, we have spent 100 milliseconds within this function. And the longer we end up doing this, the higher the statistical significance gets.
Starting point is 00:21:19 Yeah, okay, that makes sense. And these callbacks that you register, I'm trying to get the grasp on the granularity. Is it for the whole OS or is it per process or per thread? How does that? Great question. So essentially we see the entire operating system stack. And so we end up unwinding the kernel stack. We end up unwinding the user space stack. So we get everything. We get exactly the state of the world of the CPU at this point in time. And I would assume developers are generally only interested in their Go binary that's running or whatever.
Starting point is 00:21:49 So is there some way to configure, hey, only pay attention to this subsection of the tree or? So that's something that you would do on the query end because there are lots of cases where you actually want to know what happened here, right? Like, especially if you work on super performance sensitive pieces of software, you want to know there was an L1 cache miss, right? Or some page
Starting point is 00:22:12 fault and we needed to load this memory, right? Yeah, that's fascinating. And that gets toward what I was talking about earlier about the SAR files that we used to have in Red Hat, where you could sort of look into that kind of stuff. We use them heavily for Kafka, for example, to determine when there was a page cache mesh, because it turns out that when you're streaming data, everything exists in the page cache in memory. And so second you miss that, latency goes through the roof. So it's interesting to note that this approach lets you very fluidly mix together the OS level and the application level stuff. That's really fascinating. Okay. So now I think I'd like to dig into, uh, what you, what you do once you get the sample. Cause a, I would imagine that the amount of data that you're dealing with in that sample
Starting point is 00:22:53 is actually, you know, fairly substantial given, even if you're doing it a hundred times per second, like that's, it could potentially be a lot of data in that, in those frames. Um, so how do you get that stuff, um, you know, into a parka or, well, let's start with parka and then maybe we can unwind the polar signals aspect later. But how do you get the data off the eBPF, you know, in memory thing into parka? Yeah. At the end of the day, like polar signals is not significantly more interesting than parka. It's just the distributed version. And there are lots of problems that need to be solved in that. But, you know, conceptually speaking, it ends up being quite similar.
Starting point is 00:23:29 So basically the agent every 10 seconds, or basically we wait for this data to be populated within the BPF map for 10 seconds. And then we dump all of that data. And, you know, the agent just keeps on going forever. But every 10 seconds, we take what happened over the last 10 seconds and send that off to a Parker compatible API. And Polar Signals Cloud just happens to be a Parker compatible API. Gotcha. So there's, again, the sort of DMZ between your BPF filter and the outside world
Starting point is 00:23:59 is this map and the samples are getting written into the map. And then on the other end, there's an agent reading the map and writing it out to park a, it's just like a go, sorry, a protobuf GRPC based thing. That's exactly what it is. Yeah. And I think one thing that's interesting there is from a security perspective, is the only thing that has access to these samples, the go agent, how does that accessibility stuff get managed? Is that via the kernel? Yeah. So actually it's funny that you mentioned security. So from a security perspective, this is actually, and this is more of a byproduct, wasn't really our intention, but it's actually way more secure than doing profiling with a tool like Perf because we can do all the unwinding in kernel, we don't have to do that thing where we copy the entire operating system stack into user space, because absolute worst
Starting point is 00:24:49 case, you've just copied a private key into readable user space. That can be potentially devastating for security purposes. The only data that we communicate from kernel space to user space are the memory offsets into the binary. I see. I see. And actually making sense of that is what happens on the server side, like the translation of what does this memory address actually mean? Okay. So the agent is sending the stacks over to the Parca server. And I think from there, we touched on this earlier, but there's a path by which that data gets into the disk via FrostDB. So can you run me through what that write path looks like?
Starting point is 00:25:31 Yeah. I mean, the write path is between Parker receiving the gRPC call to writing to FrostDB is not complicated. So basically, FrostDB works on, inserts have to'll also mention is that we basically look at, okay, which binaries were involved in this data. And we basically maintain some metadata about this because what we insert into FrostDB is basically only these memory offsets. We still need to, when a human looks at this data, we still want to translate that memory address to the actual function name, right? Right.
Starting point is 00:26:30 So there's some other stuff that we can talk about later that needs to happen there. But in terms of gRPC API to Arrow, it's really just converting one format to the other. Yeah. So I spent a little bit of time digging into FrostDB. So once this Arrow record gets into FrostDB land, it's kind of an interesting engine that you've set up. Can you walk me through what FrostDB does and how it differentiates between some of the other columnar stores that are out there? Yeah. So I think basically the biggest difference is something that we call dynamic columns. And it's kind of
Starting point is 00:27:06 similar to like white column databases, like with Cassandra, where essentially I come from the Prometheus world, right? And I want to be able to have my user defined dimensions that I control. I want to make the system the way that my organization functions and not have the tool force some able, some dimensions onto me. So that was always a core belief that we had for any observability data. And so we felt for profiling data, this needed to be true as well. I need to be able to slice and dice my data on whatever dimension is useful to me, whether that's data center, whether that's region, whether that's node, whether that's Kubernetes
Starting point is 00:27:44 namespace, or you have this homegrown system and you have totally different words for these things or service names, whatever it is, right? You need to be able to map your organization onto the tool. And so being able to have these kind of dynamic dimensions, but still be able to search by them very quickly is what inspired us to essentially build a database ourselves. Because these are kind of two things that are basically conflicting, right? Very fast aggregations and very fast searching. And very fast aggregations is why we chose a columnar layout. That's kind of the nature of every columnar database, right? Like you want to be able to do some number crunching on a lot of numbers very quickly. That's when you,
Starting point is 00:28:24 you know, in a nutshell, end up choosing a columnar database. But the combination of being able to also search for all these dimensions relatively quickly is why we ended up building this database. And the way you can think of it, and I'm conceptually saying this in reality, it's not always true, but the way you can think of it is that all the data is always globally sorted. And so because of that property, we can basically do a binary search and end up everywhere of what we're searching for within a binary search away. Again, data is not always actually sorted, but the engine ensures that enough metadata is around and
Starting point is 00:29:00 that enough is known that we can actually still do this in a very fast way. Gotcha. And so the way that this is implemented, as I understand it, is essentially that you've got an LSM, right? And the first level of that LSM is row-based, right? And so you're writing records down rather than columns, and then subsequent levels of the LSM, it gets compacted into column-based levels? Is that correct? It's actually already columnar, but it's basically only columnar in the sense that an insert is already many stack traces and their values associated with each other, but they all have the same timestamp basically. Gotcha. Okay. Okay. And so that first level is also Arrow records,
Starting point is 00:29:46 is that correct? And then the subsequent levels are Parquet, right? That's correct. Okay. What was the reasoning behind making that leap from Arrow to Parquet as you go down the subsequent levels? So to be 100% transparent, we haven't truly figured out when the right time is to pivot into parquet. Our current theory is maybe we'll actually get away with only ever having arrow as part of the ingestion node, and only when it ends up on object storage is it going to be parquet. But at the moment, the L1 layer, or sorry, the L0 layer is arrow. And when then compaction gets triggered, it gets turned into parquet, but all of this is still in memory.
Starting point is 00:30:33 The reason for that is basically arrow has this wonderful property that you can do like O of one accesses to anything within that arrow record. But that also fundamentally prevents you from doing some more sophisticated encodings to save memory and save disk space. And Parquet is that format that's basically Arrow, but allows doing that. A lot of Parquet and Arrow is one to one binary compatible when the same encodings are being used. Gotcha. Yeah, it's an interesting design. I guess I'm curious how you would contrast this if you're familiar at all with InfluxDB's IOX work that they've been doing. Because from a layman, it looks fairly similar to me. When I was reading
Starting point is 00:31:17 over it, I was like, okay, they're using Parquet, they're using Data Fusion. It has a lot of the same components. How would you differentiate between what you're doing and what they're doing? So I think the major difference is, first of all, we actually had lots of calls with Paul Dix and Andrew Lam when we started working on this, because we saw the same similarity, right? But this was all like two years ago. And so we were all super interested in this space. And so we were just kind of all exchanging knowledge and thoughts.
Starting point is 00:31:46 And so somewhat naturally, we ended up building things that look quite similar to each other. I think the major difference is InfoXDB IARC is still trying to be a general purpose database. We are not trying to be a general purpose database. We are laser focused on observability and observability only. And what I mean by that is essentially that data is always going to be immutable. And so essentially it's the nature of observability data, right? Like a server doesn't, after the fact say, hey, this log line was actually something different, right? So this allows us to do some super interesting kind of optimizations on this data, on the way that the system works
Starting point is 00:32:26 and so on that a general purpose database can't do. But we're not trying to be a general purpose database, right? And so that can go as far as basically our distributed system within PolarSignals looks a little bit like a CRBT where because everything's just depend only, we can just kind of gossip all the changes around and eventually everything's just append only, we can just gossip all the changes around and eventually everything's always going to be consistent and always going to be complete. That doesn't really work or it's way more complicated in a world where data is mutable. Or our isolation mechanism, basically we were inspired by this, I forget exactly what the Google paper was called,
Starting point is 00:33:06 but basically these batch transactions where we can release transactions in batches because we're basically just waiting for this next set of transactions all to be complete because nobody needs to actually read their own writes. Because, again, it's machines writing this data and humans accessing this data. The human doesn't know that this data was already written, right? So reading your own rights doesn't really make sense in our system. Gotcha. Gotcha. So I had another question. Hang on a minute here. Oh yeah. Jumping back, we were talking about the, I would call it metadata earlier. You were saying you were only recording the frames into Frost CB, right? And at some point,
Starting point is 00:33:48 the human needs to see like, okay, well, this is actually a function with parameters and stuff. So how does that part work? So basically, this is the other part of binaries. So there's something called debug infos. And this is something that a compiler outputs, basically, to do exactly this matching. And what we do is we basically, during the ingestions, we record which offsets have we seen before, which ones haven't we seen before, because then we asynchronously do these lookups and write them into a separate database, where we can then
Starting point is 00:34:22 do fast lookups, basically, for the symbolization at query time. And so this is kind of the other part of it. Gotcha. Gotcha. One final question that just occurred to me, oftentimes what's most interesting in terms of sampling is when something bad is happening. And when something bad is happening, it is not always the case that durability is all that great and that you get the samples that you need and stuff. Is there anything you guys do to not lose some of the samples, especially under high load or, you know, with flaky disks or flaky network? How do you guys think about, you know, instability in this architecture? So one really awesome thing about this like eBPF architecture is that if there is extremely high CPU pressure and the user space program can't, let's say, grab all this data in time and send it off, it actually just ends up accumulating further and further in these BPF maps. And so eventually we'll just say, okay, all of this data was collected, but it was actually collected over 13 seconds as opposed to 10 seconds.
Starting point is 00:35:29 And so this is how we have this natural mechanism that still ends up working, but not impacting your system too much. It's essentially all dictated by how much load your system is putting onto the entire node. And the agent tends to be very, very lightweight. Gotcha. So there's some built-in buffering essentially in memory. And then I think the argument would be, well, if you lose the machine or something really bad happens, you lose that data, well, we're sampling anyway. And so nothing's going to be a hundred percent. Okay. Yeah. Yeah. Yeah. But typically... Sorry. Do you do anything clever around dynamically adjusting the sampling or when something interesting or anomalous is happening,
Starting point is 00:36:05 doing more samples or anything like that? Or is it pretty much just linear, we're always doing X samples per second? So the default is that we just always do 19 hertz per CPU core, which actually just ends up being relatively little. But the point is, like I said earlier, the longer you do this, the higher the statistical significance gets. And so the base load is actually very, very minimal. As a matter of fact, we have yet to find a customer that can actually distinguish between just general CPU noise and our product being deployed. Sorry, where was I? I was talking about dynamically adjusting the sampling or...
Starting point is 00:36:48 Right. We do have kind of a secondary mechanism where this whole system was built so that we would profile the entire infrastructure, right? And the important part about this, and this is also something that came straight out of this Google paper, Google said you have to profile all of your infrastructure all the time in exactly the same ways, because only then can you actually compare things, right? And you can look at everything in a single report and say, this function is worth optimizing, especially if you're optimizing for cost, right?
Starting point is 00:37:19 This is the function that's worth optimizing at all. That said, Google also acknowledged there are cases where you want to do profiling for one specific process more. And so we have this mechanism also. It's relatively unsophisticated at this point because we were much more focusing on the system-wide and it's difficult enough of a problem to solve.
Starting point is 00:37:38 But we have this mechanism where you can basically do the scraping that we were traditionally doing. And you can say through like a Kubernetes annotation, for example, that, you know, this thing I want to profile right now at a very high frequency, let's say a hundred hertz or something, right? Like whatever high means in that context. And then you can, it will start scraping that like Go application, for example, while that annotation is- Gotcha. And while that annotation is set.
Starting point is 00:38:05 Gotcha. And does that annotation require a restart of the process or is it something you can do dynamically? Okay, fantastic. So if... But it does require that this process is instrumented with this like HTTP endpoint where we can grab the profiling data.
Starting point is 00:38:18 Yeah. But there's nothing stopping us from building this into the agent at one point in the future. What we're currently focusing on is essentially closing the gap on just about any language. So it's not a lie, but something incomplete from what I was saying earlier is we do sometimes need to have specific language support. And this goes especially for interpreted languages like Python or Ruby, where if we do the typical thing
Starting point is 00:38:48 of what we talked about so far, what you would get is the stack traces of the Ruby interpreter or the Python interpreter itself. Probably not super useful for most people writing Python, right? Unless you're actually working on the Python interpreter yourself. What we need to do is essentially build a custom unwinder that realizes, oh, actually, I'm in the Python interpreter loop right now, and then switch to the Python interpreter and say, okay, that ends up reading memory from the Python process and figures out what does the Python
Starting point is 00:39:20 interpreter think right now is the current function call stack. Because at the end of the day, interpreters look like a virtual CPU. They have stacks themselves and so on. And so we need to just figure out what does the interpreter, how does the interpreter do that and essentially re-implement it in EVPF to do the same thing. Yeah. And my guess is that that probably adds significant complexity to the caching you're doing. I don't know if I'm wrong there, but what does this lookup table we were talking about look like in a world where the interpreter is in between and doing stuff? So it's actually, it's just different, but it's more like frame pointers actually, because
Starting point is 00:39:59 it's basically this in-memory structure that has all this information of how to unwind in memory. And so we don't actually have to have any of these unwind tables. All we need to know is I am currently in a Python interpreter. That's the amount of metadata that we have, which is still not insignificant if you're thinking about the entire host scale, right? Like there can easily be tens of thousands of processes on a single Linux machine. So thus far, we have been talking about
Starting point is 00:40:32 sampling and stack traces confined to one system. You know, earlier you mentioned Jaeger. There's a lot of distributed trace stuff. How do you think about the work you're doing with PolarSignals and the CPU sampling and park and all that vis-a-vis the distributed trace side of the world?
Starting point is 00:40:52 Yeah, great question. So we can actually already attach arbitrary key value pairs to stack traces and therefore further differentiate them, even if it was the exact same stack trace. And so with distributed tracing, all that means is we'll attach the distributed tracing ID to these stack traces. And therefore, we'll be able to say, and we can say, we can already do this today. We can say this CPU time was actually coming exactly from this request or vice versa, right?
Starting point is 00:41:23 You're coming from a distributed trace and you want to understand, okay, this span was way larger than I expected it to be. What was the CPU activity during this time? We can directly attribute it to that request. Okay. How does that work? Is that something that requires cooperation from a service mesh or the web framework you're using? I guess I'm kind of curious about how you actually tie this stuff
Starting point is 00:41:50 back together with that RPC call that's coming in. Yeah. So there are two ways, the one that already exists today and the one that hypothetically will exist in the future. The one that exists today is the one that requires instrumentation. It essentially requires cooperation from the user space program, so our Go service, let's say, to say, okay, this is currently our distributed tracing ID when profiling occurs, basically. Now, if we go back, we're already reading process memory to do the unwinding for a Python process, for example, it's not significantly more complicated to end up reading some memory within the Go process to figure out, oh, this was actually the distributed tracing ID set onto the context, for example.
Starting point is 00:42:36 Right. So this is absolutely going to happen. We just haven't gotten to it, basically. Yeah, that's amazing. And very, very much aligned with your philosophy, like zero effort just works out of the box, right? That would be pretty cool to just drop it in and automatically you're getting distributed trace and CV profile stuff in the system. Yeah, it's definitely going to happen. Basically, our entire strategy has been, first,
Starting point is 00:42:57 we want to be able to capture any process on the planet, basically support any language out there, and then we'll continue to increase the features of the profiler and also end up building other profilers. Other things that people are interested in are like, where do memory allocations happen, right? Where does network IO happen? Where does disk IO happen? All these things.
Starting point is 00:43:18 Gotcha. I guess this is a nice segue into future work. We've already talked about one, which is adding the distributed trace support transparently. What other things are you guys thinking about on any of these projects, FrostDB, Parca, or Polar Signals? Yeah. So I think one thing that I'm particularly excited about is something that's called profile-guided optimization. So this is not very much a feature of any of these projects. It's more of a higher level concept. So profile guided optimizations have also kind of been around since the 1970s. That's when we first saw some mentions of this. And basically what it is, is you're passing a compiler profiling data,
Starting point is 00:43:57 and therefore the compiler can make opinionated decisions about how to compile this code and basically apply optimizations that it wouldn't usually apply. But because it has this profiling data, it can now apply them and knows they're definitely going to be good based on this data. And Google and Facebook have written about this pretty extensively and have shown that just doing this can get you anywhere from a 10% to a 30% improvement. No code changes. And so how does that work? Is that something where, uh, it's done dynamically at runtime behind the scenes? Or is that something where when you're compiling, like it's something that goes into LLVMs, you know?
Starting point is 00:44:32 Okay. It's the data. Um, that's exactly what it is. It's basically, it's a flag where you pass it in a file that contains profiling data. That's, that's all this GCC LLVM. They've all been able to do this for ages. Thanks to Google, Facebook, and so on.
Starting point is 00:44:44 Um, some, some really interesting, Facebook, and so on. Some really interesting, like, just-in-time compilers have also been doing this for some time. There's actually a reason why it's called the Java Hotspot VM. It's exactly what it does. It essentially records what are the hotspots of code that are frequently being executed, and it then figures out, okay, this is how I actually should be recompiling it again, because it will be running better. Gotcha. That'll be really interesting. All right. Well, I've got everything I wanted out of you. Where can people find you?
Starting point is 00:45:13 And is there anything you want to call out that I've missed? So it's pretty easy. It's PolarSignals.com. We have both a Discord for Polar Signals. If you have any questions about anything that we talked about today, we also have a separate Discord server for the Parka project. It's P-A-R-C-A. There's also the website, parka.dev. It's a separate brand and separate everything,
Starting point is 00:45:37 completely independent from Polar Signals. So yeah, that's where you can find us. Please try the project. We always love to hear about the magic of when people see profiling data across their entire infrastructure for the very first time, because it turns out
Starting point is 00:45:50 even the most sophisticated organizations out there probably haven't seen this across their entire infrastructure. And so one thing that I love talking about is one of our early customers, Materialize, they're a database company. In case people are not familiar, they're basically a streaming database.
Starting point is 00:46:05 They're already very conscious about performance. The first time they've deployed this on production, within hours, they found a 35% improvement that was fleet-wide. One change that they immediately were able to see because of this that they weren't able to see before, that basically cut their AWS build by 35%. Fantastic. All right. Well, thanks for taking the time to talk with me. Thanks for having me.
Starting point is 00:46:27 All right. Take care.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.