Materialized View Podcast - Parca, Polar Signals, and FrostDB with Frederic Branczyk
Episode Date: January 8, 2024I recently talked with Frederic Branczyk. Fredric is the founder of Polar Signals, a new always-on, zero-instrumentation profiler. Before Polar Signals, Fredric spent time at Red Hat and CoreOS, where... he worked on Kubernetes and Prometheus.In this interview, Frederic and I break down Polar Signals’s architecture and its main components: Parca and FrostDB. Parca is of particular interest; it is able to achieve its minimally invasive profiling claims by using an eBPF filter that samples the entire OS’s stack at 19hz. Data is then passed to a server and stored into FrostDB, an embedded storage engine built on Datafusion and Parquet.During the discussion, Frederic mentions several influential papers:* Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers* BOLT: A Practical Binary Optimizer for Data Centers and Beyond* Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications* Large-scale Incremental Processing Using Distributed Transactions and NotificationsYou can support me by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to new software engineers that you know.I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit materializedview.io
Transcript
Discussion (0)
I recently got to sit down with Frederick Branczyk.
Frederick is the founder of Polar Signals, a new always-on zero-instrumentation profiler.
Before Polar Signals, Frederick spent time at Red Hat and CoreOS, where he worked on Kubernetes and Prometheus.
In this interview, Frederick and I break down Polar Signals' architecture and its main components, Parca and FrostDB.
Parca is of particular interest to me, as it is able to achieve its minimally invasive profiling claims
using an eBPF filter that samples the entire OS's stack at about 19 hertz.
Data is then passed to a server and stored in FrostDB, and embedded storage ended built on top of Data Fusion and Parquet.
And now, Frederick Brenchik. All right. So I thought we would start with Prometheus, actually. We were emailing back
and forth and you mentioned that you were a committer on Prometheus. So how did you end up
working on Prometheus? Yeah. So that was basically in 2016, I joined a company called CoreOS. I don't
know if people still remember CoreOS. We were kind of one of the early Kubernetes
companies. And basically, we were working on automatically updating software. And we were
kind of thinking about, you know, if we're automatically updating all of the software
all the time, right, we need to understand whether the software is doing what it's supposed to be
doing and doing that successfully before, during and after upgrades. And so we quickly identified,
you know, monitoring and observability was going to be super key in
Khoros' business.
And so at the time, Prometheus was kind of the up and coming thing.
And Khoros hired one of the major maintainers at the time already, and he hired me.
And then that's how I kind of got started and then quickly became a maintainer of the Prometheus project
and kind of everything in that intersection
of Prometheus and Kubernetes,
either built or, you know, at the very least,
left my fingerprints on.
And then ultimately, through all of that collaboration,
I actually ended up becoming a tech lead
on the Kubernetes project as well.
Oh, wow. I didn't know that.
Okay, so you've worked on Kubernetes as well.
Yes.
And so how did
that experience lead you towards Polar Signals, which is the stuff you're working on now?
Yeah, so just maybe for setting the scene at Polar Signals, we build continuous profiling
software. So we basically profile all of your infrastructure all the time and then record this data so
that you can analyze it later.
And that basically happened because Corus was acquired by Red Hat in 2018.
And I ended up basically becoming architect for all things observability.
So we had kind of the classic metrics team, logging team, distributed tracing team.
And we were working on this super cutting edge observability software, but we
kind of found ourselves manually profiling our software pretty much every single day
because we were working with these super performance sensitive pieces of software, right?
Like Prometheus, Kubernetes, Jaeger, and so on.
And so for us, it was second nature to do profiling anyway.
And then I read this white paper that Google published a couple of years before that, that's the infamous Google white profiling paper,
where Google basically, they were the first ones to describe this publicly, to my knowledge at
least, how they're doing infrastructure-wide profiling. And that's where it really, really
came to me that we need to be dealing with profiling just as systematically as we
do with metrics, logs, and profiling.
And it's really observability just like everything else.
You always want the profiling data that you don't have.
Yeah.
So you mentioned you were doing it manually at that point.
And that's, frankly, prior to some of the APMs and stuff that I've worked with, how
I've done it in the past as well.
How did that look for you guys?
Were you connecting to production?
How were you doing the manual sampling?
I assume it was sampling or profiling before then.
Right.
So in the Go ecosystem, profilers are pretty good.
And basically in the Go ecosystem,
and this came basically also straight out of Google,
it's built into the Go runtime that you can have an HTTP endpoint that you hit and it will
record a profile over, let's say 10 seconds and return you with that data. So that's basically
what we were doing all day long in Kubernetes, port forwarding and doing that, for example.
But the difficulty with that is one,
you actually need to connect with production, and that comes with all the potential problems
as with SSHing and so on. But maybe more importantly, you don't have, like I said,
this data that you really want to have, which was like at this weird time when there was this
CPU spike or when an um kill has already occurred, right?
You always want the data that you don't have.
And this is kind of a theme that we keep seeing throughout all of the
observability.
And we saw that trend basically also from like Nagios checks to like
Prometheus and like Datadog, right?
Where we went away from just checking, is this thing up?
Is this thing up?
Is this thing up?
To like recording information at time series and alerting on the time series data.
And now that we're doing this with profiling as well, we can do a super interesting analysis.
Not only do we have all this data throughout time, we can also say, hey, I have this performance
regression in this new version of our software.
Tell me all the differences of CPU time from this previous version to this new version.
And boom, down to the line number, we can see exactly where new CPU time is now being spent.
And like this kind of analysis wasn't even possible before.
And if it was possible,
you maybe were able to like grab some profiling data
from the previous version because you rolled back
and took some profiling data again,
and then took some profiling data again.
So just like this super weird dance to get the right data.
Now, it's just always available.
Gotcha. So to recap, basically you were doing manual sampling against Go and they exposed an
HTTP endpoint. That makes sense. Okay. I think that's loosely, I come from, you know, Java world,
that's sort of loosely equivalent to what we were doing in JMX land. And there's, you know,
at least in the Java side, there's sort of performance penalties you can get if you don't
sample properly, and then you can turn on HProf and sample and stuff.
So there's sort of a trade-off between performance and granularity.
It sounds like you were doing more point lookups at that time.
So there was some issue, and you would go to the thing and say, hey, basically, give me stack traces for a 10-second window, and let me see what it looks like.
Exactly. Super reactive, basically.
Gotcha. see what it looks like. Exactly. Super reactive, basically. If we were good, maybe we were running
some benchmarks for a very specific thing that we wanted to optimize, but also building those
benchmarks is really difficult to actually have something that behaves truly like production.
Now, we use fuller signals ourselves all day long, and it's this super fast feedback cycle
where we can actually be sure this is exactly how it behaves in production because it is production data.
Yeah, that makes sense.
Were you guys doing any sampling at that point at the OS level?
I think what we've been talking about so far is really at the application level, right?
But alongside the application stack traces, there's a whole bunch of information around, you know, disk and all that kind of stuff. The thing I think about at LinkedIn back in the day, we ran SAR, which was this like system
activity report thing that came with, I think with Red Hat Linux, that would allow you to
sort of sample and record disk stuff.
And that looks a little bit more like traditional observability.
I'm assuming you were marrying up what you were doing with the more traditional stuff.
Is that true?
Yeah, we were doing that kind of stuff
more through like traditional,
we call it now traditional metrics, right?
Like at the time Prometheus was still like,
you know, just starting to be implemented within companies.
But yeah, like metrics is what we were using for that.
But this is actually a great kind of segue
into also what changed with PolarSignals
because something like when we started PolarSignals,
we went about it a little bit naively, right?
We came from the Go ecosystem, we were like, profilers are amazing, right?
Collection is a solved problem.
Turns out, they're only great in the Go ecosystem.
And so ultimately what we ended up doing is we ended up building a profiler completely
from scratch using eBPF, which allows us to kind of grab this data at the
operating system level.
And through a lot of really hard work, we now basically have this completely zero instrumentation
profiler.
You only need to add this profiler to your host, and it starts profiling everything,
no matter what language you have on that host.
Yeah, so I think let's start to dig into polar signals.
But before we do that, I think we've kind of been talking about profilers
without giving a really solid definition of what exactly we're talking about.
So can you give me sort of, in your view, how you define a profile and why it's useful?
Yeah. So profiling tools have basically been available ever since software engineering started
because what profiling gives us is down to the
line number and the function call stacks that led to some amount of resource usage. So it's
essentially a stack trace and a number. That's all that profiling data is. And that resource could be
anything. It could be CPU time. It could be memory. It could be file IO. It could be network IO.
Anything really. Most commonly we see CPU profiling data,
also because it tends to be the most expensive thing on a cloud bill. And so naturally that's
the one thing that people look at when they want to optimize for cost. But at the same time, CPU
also tends to be the thing that you look at when you optimize for latency. Because if it's not some
other IO call that you're doing, there's basically only CPU time left to optimize.
And so that's kind of what we also see our customers doing most of the time.
Gotcha. Okay. So yeah, let's dive into Polar Signals now.
So this is a new company you started. When did you kick it off?
So actually, we've done quite a bit of R&D.
So the company was founded about three years ago and we launched our product
publicly about two months ago. Gotcha. And so somewhere in the midst of this is sort of two
other projects. One of them is Parka and the other is FrostDB. So what are the relationship of those
two things to what Polar Signals is doing? So the profiler is part of the Parca project. So Parca is essentially two
components. One is the agent, that's the collection mechanism, everything that we've talked about so
far. And then the server side is kind of the Prometheus equivalent. It's a single statically
linked Go binary, extremely easy to get started with, extremely easy to deploy.
It kind of ships everything in one binary, a storage, an API, a UI, everything in one
box.
But similar to Prometheus, we very intentionally chose not to make this a distributed system
because distributed storage is a very, very difficult problem to solve.
And we kind of also decided that was something that at least for a
start, we're going to, if we manage to, we're going to try to figure this out as the business.
And for a start, it'll be our kind of competitive advantage. But yeah, that's kind of where we
started. And FrostDB is the storage layer within Parka. And it's also what our distributed storage
within Polar Signals is based on. It's kind of our RocksDB for the columnar database.
Okay.
Okay.
So I think the relationship is essentially FrostDB is embedded time series database.
You have a single node system built on that, which is Parka, that also comes with a CPU profiler, which is the eBPF thing
you talked about. And then, excuse me, Polar Signals is essentially the cloud hosted version
of this. Precisely. Got it. Got it. Okay. So why don't you run me through the eBPF CPU sampling
stuff? I think that's really interesting to me. So first off, I think I'll take a shot at
summarizing eBPF
because I'm very much a novice in that area, and then you can correct me where I'm wrong.
But my understanding of eBPF is essentially, it's an interface that allows you to implement
modules that go into the kernel in Linux and allow you to sort of almost like a filter chain
or something, inject in between kernel calls from the application space. Is that loosely correct? Yeah, definitely. I think that's pretty accurate. Essentially, it's a virtual
machine within the kernel that you can write C code for. And when you load it into the kernel,
it goes through this thing called the verifier to make sure that whatever you're going to be
executing within the kernel is actually going to be safe to be executed there. So it makes sure that certain areas of memory can be accessed.
It's all basically read only, except for very specific mechanisms. And then the way that
eBPF programs are run are through triggers. And that can be exactly like you said, it can be a
syscall being called, it can be a network packet being transmitted or received.
Or in our case, we register a custom event using the perf events subsystem within Linux,
where we're basically saying every X amount of CPU cycles call our program.
And what our program does is it figures out when our program starts,
we get just a pointer to the top of the operating system stack, right? Like when we, if we go back
to like computer science and how kind of programs gets executed, they have a stack, right? And the
operating system does as well. And so when our program is called, we only get basically a pointer
to the very top of the stack. And in a simplified form or in the best case, something called frame pointers are present.
And what that means is that there's a register that's reserved that tells us where the next
lower stack frame is.
And so in that case, we can just walk this linked list essentially.
And at every point, we collect what address of the instruction
is. And that's essentially what we can then use to translate to a function name. That's basically how programs work. And so this is how we then get that function call stack. And if we see that same
function call stack multiple times, statistically speaking, we're spending more time in this
function. Gotcha. So I think there's two things that I want to unpack there.
One of them is you said when you're lucky, there are the references to the subsequent frames, right?
And then the other one is, you know, my instinct is that these frames are essentially just, you know, bags of bytes.
And so to make it useful to the developer, you need to somehow attach it to what would look more like a traditional stack trace that you would see as a developer, which has the method name, the return, the parameters,
all that kind of stuff, package, all that. So how do those few things happen?
So frame pointers are kind of a hot topic, actually. We actually just had launched this
collaboration together with Canonical, where frame pointers are now going to be the default
configuration for compilers within the Ubuntu packages, unless the package specifically
overrides and says, I want to omit frame pointers. Because basically this is kind of coming from the
32-bit world where we have a certain number of registers. I forget exactly the amount. I think
it was eight registers, general purpose registers, and reducing that by one, only having seven, that actually
makes a big difference, right? But now we have 15 or 16 general purpose registers, and it actually
ends up making quite a bit less of a difference. In most benchmarks, you see absolutely no difference. Of course, you'll find edge cases for these things.
For example, the hottest loop within the Python interpreter basically makes use of exactly 16
registers. And therefore, when you enable frame pointers, it has a major performance degradation.
Just to make sure I understand, it sounds like the reason you would want to disable
frame pointers is for performance.
Exactly.
That's basically the only reason.
You also get a little bit smaller of a binary because essentially it's instructions that
need to set and retrieve the frame pointer.
Okay.
So in that world, your claim is essentially most of the time, especially with Ubuntu package
work you've been doing, you're going to have frame pointers.
And in the case where you're not, does pullarSignals just... What does it do? Yeah. So we can still profile everything.
And this is also part of what actually makes our profiler very innovative. So in the previous world,
when you use the profiler like Perf, Linux Perf, what it does is it copies the entire operating
system stack into user space, where
it can then be unwinded synchronously using something called unwind tables.
This is a special section in x86 binaries, and the x86 API specifies that this section
must be present.
Otherwise, Linux basically says, I don't know how to run, or execution of this program is
undefined.
And this is the same way as how C++ exceptions work.
Maybe you've seen this before.
When you don't have debug infos included in C++ binary
and it has an exception stack price,
all you get are memory addresses.
And you need to put those memory addresses
into a tool like adder to line
to convert those addresses into function names.
But basically these unwind tables, we needed to optimize very much so that the eBPF verifier
would be happy with us still doing everything that we need to be doing.
Because something that we didn't mention earlier with the eBPF verifier, it also limits how
many instructions can be run.
And it basically, one of the ways how it ensures that what you're running within the kernel is
safe to do is it basically solves the halting problem by saying you cannot run basically
Turing-complete programs. So you can't have loops that can potentially be endless. Everything has
to be bound, all of these kinds of things.
And so this was a really big part of what makes our profile really, really interesting
because it basically works under every circumstance.
However, walking a linked list, which is frame pointers, is still way cheaper than doing
table lookups, doing some calculations with the offsets, loading a bunch of registers,
writing a bunch of things, and then doing each jump.
So having frame pointers is still very, very much preferable.
And also this entire dance with the unwind tables, not every tool out there is going
to have the kind of resources that we had in order to make that happen.
If we look at debugging tools, maybe they want to figure out
where some network packet came from.
For this one-off thing,
people are not going to put that amount
of kind of engineering work into making that happen.
Gotcha.
So the unwind table exists in the binary.
And I'm guessing,
are you guys doing some sort of prefetching and caching?
Okay, so essentially you read this binary
and keep it somewhere.
And then your eBPF filter is able to access that in memory versus having to do full
disc reads and all that stuff every time a call is made.
Exactly. We also heavily optimize it so that searching for this data is very, very fast.
And then we could fill the entire podcast up episode just with this, but long story short,
this unwind information itself is Turing complete. And so we need to do our
utmost best to basically try to interpret this information as much as possible so that the EBPF
program needs to be as little work as possible so that it will predictably halt. Gotcha. And so one
of the things I was wondering about EBPF, I've never written a filter or anything. So like I
said, I'm very much new to this.
The question I have is around state, and I think we're kind of getting out of here with
this caching stuff.
So my assumption would be the validator is not going to allow you to keep anything state-wise
beyond memory.
How are you managing the caching state?
Yeah.
So I spoke earlier about kind of writable space, right?
I said that eBPF is basically read-only and that there are some exceptions to this.
And this is basically one of those exceptions, and they're called BPF maps.
And it's exactly the thing that's used, the only thing that you can use to communicate
from user space to kernel space and vice versa.
And basically what we do is we populate these maps with these optimized unwind tables,
which then the BPF program can use and read while it's doing the unwinding. And this is also how
then the resulting data is communicated back to user space. Gotcha. Okay. So continuing to pull
on this thread, you've got your unwind table that's loaded in this map. The filter is getting invoked
by the Linux kernel. And how is the sampling set up there? I'm assuming that's there's some config
file or something that you set, but how does that manifest itself as, say, 1% of the calls
or something like that? Yeah. So like I said, what we do is we register something called a perf event, which causes
the kernel to call our program every X amount of cycles.
And essentially we calculate, if we want, let's make the calculation simple for this
calculation.
We want a hundred samples per second, right?
That would mean that each sample represents statistically 10 milliseconds.
And so if we see the same function call stack 10 times, statistically speaking,
we have spent 100 milliseconds within this function.
And the longer we end up doing this,
the higher the statistical significance gets.
Yeah, okay, that makes sense.
And these callbacks that you register,
I'm trying to get the grasp on the granularity. Is it for the whole OS or is it per process or per thread? How does that?
Great question. So essentially we see the entire operating system stack. And so we end up
unwinding the kernel stack. We end up unwinding the user space stack. So we get everything.
We get exactly the state of the world of the CPU at this point in time.
And I would assume developers are generally only interested
in their Go binary that's running or whatever.
So is there some way to configure,
hey, only pay attention to this subsection of the tree or?
So that's something that you would do on the query end
because there are lots of cases where you actually want
to know what happened here, right?
Like, especially if you work
on super performance
sensitive pieces of software, you want to know there was an L1 cache miss, right? Or some page
fault and we needed to load this memory, right? Yeah, that's fascinating. And that gets toward
what I was talking about earlier about the SAR files that we used to have in Red Hat, where you
could sort of look into that kind of stuff. We use them heavily for Kafka, for example, to determine when there was a page cache mesh, because it turns out
that when you're streaming data, everything exists in the page cache in memory. And so second you
miss that, latency goes through the roof. So it's interesting to note that this approach lets you
very fluidly mix together the OS level and the application level stuff. That's really fascinating.
Okay. So now I think I'd like to dig into, uh, what you, what you do once you get the
sample. Cause a, I would imagine that the amount of data that you're dealing with in that sample
is actually, you know, fairly substantial given, even if you're doing it a hundred times per
second, like that's, it could potentially be a lot of data in that, in those frames. Um, so how do
you get that stuff, um, you know, into a parka or, well, let's start with
parka and then maybe we can unwind the polar signals aspect later. But how do you get the
data off the eBPF, you know, in memory thing into parka? Yeah. At the end of the day, like polar
signals is not significantly more interesting than parka. It's just the distributed version.
And there are lots of problems that need to be solved in that.
But, you know, conceptually speaking, it ends up being quite similar.
So basically the agent every 10 seconds, or basically we wait for this data to be populated
within the BPF map for 10 seconds.
And then we dump all of that data.
And, you know, the agent just keeps on going forever.
But every 10 seconds, we take what
happened over the last 10 seconds and send that off to a Parker compatible API. And Polar
Signals Cloud just happens to be a Parker compatible API.
Gotcha. So there's, again, the sort of DMZ between your BPF filter and the outside world
is this map and the samples are getting written into the map. And then on the other end, there's an agent reading the map and writing it out to park a, it's just like a go, sorry, a protobuf GRPC
based thing. That's exactly what it is. Yeah. And I think one thing that's interesting there is
from a security perspective, is the only thing that has access to these samples, the go agent,
how does that accessibility stuff
get managed? Is that via the kernel? Yeah. So actually it's funny that you
mentioned security. So from a security perspective, this is actually, and this is more of a byproduct,
wasn't really our intention, but it's actually way more secure than doing profiling with a tool like
Perf because we can do all the unwinding in kernel, we don't have to do that thing where we copy the entire operating system stack into user space, because absolute worst
case, you've just copied a private key into readable user space. That can be potentially
devastating for security purposes. The only data that we communicate from kernel space to user
space are the memory offsets into the binary. I see. I see.
And actually making sense of that is what happens on the server side,
like the translation of what does this memory address actually mean?
Okay. So the agent is sending the stacks over to the Parca server. And I think from there,
we touched on this earlier, but there's a path by which that data
gets into the disk via FrostDB. So can you run me through what that write path looks like?
Yeah. I mean, the write path is between Parker receiving the gRPC call to
writing to FrostDB is not complicated. So basically, FrostDB works on, inserts have to'll also mention is that we basically look at, okay, which binaries
were involved in this data.
And we basically maintain some metadata about this because what we insert into FrostDB is
basically only these memory offsets.
We still need to, when a human looks at this data, we still want to translate that
memory address to the actual function name, right?
Right.
So there's some other stuff that we can talk about later that needs to happen there.
But in terms of gRPC API to Arrow, it's really just converting one format to the other.
Yeah.
So I spent a little bit of time digging into FrostDB.
So once this Arrow record gets into FrostDB land, it's kind of an interesting engine that you've
set up. Can you walk me through what FrostDB does and how it differentiates between some of the
other columnar stores that are out there? Yeah. So I think basically the biggest difference
is something that we call dynamic columns. And it's kind of
similar to like white column databases, like with Cassandra, where essentially
I come from the Prometheus world, right? And I want to be able to have my user defined dimensions
that I control. I want to make the system the way that my organization functions and not have the tool
force some able, some dimensions onto me.
So that was always a core belief that we had for any observability data.
And so we felt for profiling data, this needed to be true as well.
I need to be able to slice and dice my data on whatever dimension is useful to me, whether
that's data center, whether that's region, whether that's node, whether that's Kubernetes
namespace, or you have this homegrown system and you have totally
different words for these things or service names, whatever it is, right? You need to be able to map
your organization onto the tool. And so being able to have these kind of dynamic dimensions,
but still be able to search by them very quickly is what inspired us to essentially build a database
ourselves. Because these are kind of two things that are basically conflicting, right? Very fast
aggregations and very fast searching. And very fast aggregations is why we chose a columnar layout.
That's kind of the nature of every columnar database, right? Like you want to be able
to do some number crunching on a lot of numbers very quickly. That's when you,
you know, in a nutshell, end up choosing a columnar database.
But the combination of being able to also search for all these dimensions relatively
quickly is why we ended up building this database.
And the way you can think of it, and I'm conceptually saying this in reality, it's not always true,
but the way you can think of it is that all the data is always globally sorted.
And so because of that property, we can basically do a binary search and end
up everywhere of what we're searching for within a binary search away. Again, data is
not always actually sorted, but the engine ensures that enough metadata is around and
that enough is known that we can actually still do this in a very fast way.
Gotcha. And so the way that this is implemented, as I understand it, is essentially that you've got
an LSM, right? And the first level of that LSM is row-based, right? And so you're writing records
down rather than columns, and then subsequent levels of the LSM, it gets compacted into
column-based levels? Is that correct? It's actually already columnar, but it's
basically only columnar in the sense that an insert is already many stack traces and their
values associated with each other, but they all have the same timestamp basically.
Gotcha. Okay. Okay. And so that first level is also Arrow records,
is that correct? And then the subsequent levels are Parquet, right?
That's correct. Okay. What was the reasoning behind making that leap from Arrow to Parquet
as you go down the subsequent levels? So to be 100% transparent, we haven't truly figured out
when the right time is to pivot into parquet.
Our current theory is maybe we'll actually get away with only ever having
arrow as part of the ingestion node, and only when it ends up on object storage is it going to be
parquet. But at the moment, the L1 layer, or sorry, the L0 layer is arrow. And when
then compaction gets triggered, it gets turned into parquet, but all of this is still in memory.
The reason for that is basically arrow has this wonderful property that you can do like O of one
accesses to anything within that arrow record. But that also fundamentally prevents
you from doing some more sophisticated encodings to save memory and save disk space. And Parquet
is that format that's basically Arrow, but allows doing that. A lot of Parquet and Arrow is one to
one binary compatible when the same encodings are being used.
Gotcha. Yeah, it's an interesting design.
I guess I'm curious how you would contrast this if you're familiar at all with InfluxDB's IOX work
that they've been doing. Because from a layman, it looks fairly similar to me. When I was reading
over it, I was like, okay, they're using Parquet, they're using Data Fusion. It has a lot of the
same components. How would you differentiate between what you're doing and what they're doing?
So I think the major difference is, first of all, we actually had lots of calls with
Paul Dix and Andrew Lam when we started working on this, because we saw the same similarity,
right?
But this was all like two years ago.
And so we were all super interested in this space.
And so we were just kind of all exchanging knowledge and thoughts.
And so somewhat naturally, we ended up building things that look quite similar to each other.
I think the major difference is InfoXDB IARC is still trying to be a general purpose database.
We are not trying to be a general purpose database.
We are laser focused on observability and observability only.
And what I mean by that is essentially that data is always going to be immutable. And so essentially it's the nature
of observability data, right? Like a server doesn't, after the fact say, hey, this log line
was actually something different, right? So this allows us to do some super interesting
kind of optimizations on this data, on the way that the system works
and so on that a general purpose database can't do.
But we're not trying to be a general purpose database, right?
And so that can go as far as basically our distributed system within PolarSignals looks
a little bit like a CRBT where because everything's just depend only, we can just kind of gossip
all the changes around and eventually everything's just append only, we can just gossip all the changes around
and eventually everything's always going to be consistent and always going to be complete.
That doesn't really work or it's way more complicated in a world where data is mutable.
Or our isolation mechanism, basically we were inspired by this, I forget exactly what the Google paper was called,
but basically these batch transactions where we can release transactions in batches because
we're basically just waiting for this next set of transactions all to be complete because
nobody needs to actually read their own writes.
Because, again, it's machines writing this data and humans accessing this data.
The human doesn't know that this data was already written, right? So reading your own rights doesn't really make sense in our system.
Gotcha. Gotcha. So I had another question. Hang on a minute here. Oh yeah. Jumping back,
we were talking about the, I would call it metadata earlier. You were saying you were
only recording the frames into Frost CB, right? And at some point,
the human needs to see like, okay, well, this is actually a function with parameters and stuff.
So how does that part work? So basically, this is the other part of binaries. So there's something called debug infos. And this is something that a compiler outputs, basically,
to do exactly this matching.
And what we do is we basically, during the ingestions,
we record which offsets have we seen before,
which ones haven't we seen before,
because then we asynchronously do these lookups
and write them into a separate database, where we can then
do fast lookups, basically, for the symbolization at query time.
And so this is kind of the other part of it. Gotcha. Gotcha. One final question that just
occurred to me, oftentimes what's most interesting in terms of sampling is when something bad is
happening. And when something bad is happening, it is not always the case that durability is all that great and that you get the samples that you need and stuff.
Is there anything you guys do to not lose some of the samples, especially under high load or,
you know, with flaky disks or flaky network? How do you guys think about, you know, instability
in this architecture? So one really awesome thing about this like eBPF architecture is that if there is extremely high CPU pressure and the user space program can't, let's say, grab all this data in time and send it off, it actually just ends up accumulating further and further in these BPF maps.
And so eventually we'll just say, okay, all of this data was collected, but it was actually collected over 13 seconds as opposed to 10 seconds.
And so this is how we have this natural mechanism that still ends up working, but not impacting your system too much.
It's essentially all dictated by how much load your system is putting onto the entire node.
And the agent tends to be very, very lightweight.
Gotcha. So there's some built-in buffering essentially in memory. And then I think the argument would be, well, if you lose the machine or something really bad happens,
you lose that data, well, we're sampling anyway. And so nothing's going to be a hundred percent.
Okay. Yeah. Yeah. Yeah. But typically...
Sorry. Do you do anything clever around dynamically adjusting the sampling
or when something interesting or anomalous is happening,
doing more samples or anything like that? Or is it pretty much just linear, we're always doing
X samples per second? So the default is that we just always do 19 hertz per CPU core, which
actually just ends up being relatively little. But the point is, like I said earlier, the longer you
do this, the higher the statistical significance gets.
And so the base load is actually very, very minimal.
As a matter of fact, we have yet to find a customer that can actually distinguish between just general CPU noise and our product being deployed.
Sorry, where was I?
I was talking about dynamically adjusting the sampling or...
Right.
We do have kind of a secondary mechanism where this whole system was built so that we would
profile the entire infrastructure, right?
And the important part about this, and this is also something that came straight out of
this Google paper, Google said you have to profile all of your infrastructure all the time in exactly the
same ways, because only then can you actually compare things, right?
And you can look at everything in a single report and say, this function is worth optimizing,
especially if you're optimizing for cost, right?
This is the function that's worth optimizing at all.
That said, Google also acknowledged
there are cases where you want to do profiling
for one specific process more.
And so we have this mechanism also.
It's relatively unsophisticated at this point
because we were much more focusing on the system-wide
and it's difficult enough of a problem to solve.
But we have this mechanism where you can basically
do the scraping that we were traditionally doing.
And you can say through like a Kubernetes annotation, for example, that, you know, this
thing I want to profile right now at a very high frequency, let's say a hundred hertz
or something, right?
Like whatever high means in that context.
And then you can, it will start scraping that like Go application, for example, while that
annotation is- Gotcha. And while that annotation is set.
Gotcha.
And does that annotation require a restart of the process
or is it something you can do dynamically?
Okay, fantastic.
So if...
But it does require that this process is instrumented
with this like HTTP endpoint
where we can grab the profiling data.
Yeah.
But there's nothing stopping us
from building this into the agent at one point in the future.
What we're currently focusing on is essentially closing the gap on just about any language.
So it's not a lie, but something incomplete from what I was saying earlier is we do sometimes
need to have specific language support.
And this goes especially for interpreted languages
like Python or Ruby, where if we do the typical thing
of what we talked about so far, what you would get
is the stack traces of the Ruby interpreter
or the Python interpreter itself.
Probably not super useful for most people writing Python, right?
Unless you're actually working on the Python interpreter yourself.
What we need to do is essentially build a custom unwinder that realizes, oh, actually,
I'm in the Python interpreter loop right now, and then switch to the Python interpreter and say,
okay, that ends up reading memory from the Python process and figures out what does the Python
interpreter think right now is the current function call stack. Because at the end of the day,
interpreters look like a virtual CPU. They have stacks themselves and so on. And so we need
to just figure out what does the interpreter, how does the interpreter do that and essentially
re-implement it in EVPF to do the same thing. Yeah. And my guess is that that probably adds
significant complexity to the caching you're doing. I don't know if I'm wrong there, but what
does this lookup table we were talking about
look like in a world where the interpreter is in between and doing stuff?
So it's actually, it's just different, but it's more like frame pointers actually, because
it's basically this in-memory structure that has all this information of how to unwind in memory.
And so we don't actually have to have any of these unwind tables. All we need to know is I am
currently in a Python interpreter. That's the amount of metadata that we have, which is still
not insignificant if you're thinking about the entire host scale, right? Like there can easily be
tens of thousands of processes
on a single Linux machine.
So thus far,
we have been talking about
sampling and stack traces
confined to one system.
You know, earlier you mentioned Jaeger.
There's a lot of distributed trace stuff.
How do you think about
the work you're doing with PolarSignals
and the CPU sampling and park and all that
vis-a-vis the distributed trace side of the world?
Yeah, great question.
So we can actually already attach arbitrary key value pairs to stack traces
and therefore further differentiate them,
even if it was the exact same stack trace.
And so with distributed tracing, all that means is we'll attach the distributed tracing
ID to these stack traces.
And therefore, we'll be able to say, and we can say, we can already do this today.
We can say this CPU time was actually coming exactly from this request or vice versa, right?
You're coming from a distributed trace and you want to understand, okay, this span was
way larger than I expected it to be.
What was the CPU activity during this time?
We can directly attribute it to that request.
Okay.
How does that work?
Is that something that requires cooperation from a service mesh or the web
framework you're using? I guess I'm kind of curious about how you actually tie this stuff
back together with that RPC call that's coming in. Yeah. So there are two ways, the one that
already exists today and the one that hypothetically will exist in the future. The one that exists
today is the one that requires instrumentation. It essentially requires cooperation from the user space program, so
our Go service, let's say, to say, okay, this is currently our distributed tracing ID when
profiling occurs, basically. Now, if we go back, we're already reading process memory
to do the unwinding for a Python process, for example, it's not
significantly more complicated to end up reading some memory within the Go process to figure
out, oh, this was actually the distributed tracing ID set onto the context, for example.
Right.
So this is absolutely going to happen.
We just haven't gotten to it, basically.
Yeah, that's amazing.
And very, very much aligned with your philosophy, like zero effort just works
out of the box, right? That would be pretty cool to just drop it in and automatically
you're getting distributed trace and CV profile stuff in the system.
Yeah, it's definitely going to happen. Basically, our entire strategy has been, first,
we want to be able to capture any process on the planet, basically support any language
out there, and then we'll continue to increase
the features of the profiler and also end up building other profilers.
Other things that people are interested in are like, where do memory allocations happen,
right?
Where does network IO happen?
Where does disk IO happen?
All these things.
Gotcha.
I guess this is a nice segue into future work.
We've already talked about one, which is adding the distributed trace support transparently. What other things are you guys thinking about on any of these projects,
FrostDB, Parca, or Polar Signals? Yeah. So I think one thing that I'm
particularly excited about is something that's called profile-guided optimization. So this is
not very much a feature of any of these projects. It's more of a higher level concept. So profile
guided optimizations have also kind of been around since the 1970s. That's when we first saw some
mentions of this. And basically what it is, is you're passing a compiler profiling data,
and therefore the compiler can make opinionated decisions about how to compile this code and
basically apply optimizations that it wouldn't
usually apply. But because it has this profiling data, it can now apply them and knows they're
definitely going to be good based on this data. And Google and Facebook have written about this
pretty extensively and have shown that just doing this can get you anywhere from a 10% to a 30%
improvement. No code changes. And so how does that work? Is that something where, uh, it's done dynamically at runtime behind the scenes?
Or is that something where when you're compiling, like it's something
that goes into LLVMs, you know?
Okay.
It's the data.
Um, that's exactly what it is.
It's basically, it's a flag where you pass it in a file
that contains profiling data.
That's, that's all this GCC LLVM.
They've all been able to do this for ages.
Thanks to Google, Facebook, and so on.
Um, some, some really interesting, Facebook, and so on.
Some really interesting, like, just-in-time compilers have also been doing this for some time. There's actually a reason why it's called the Java Hotspot VM. It's exactly what it does.
It essentially records what are the hotspots of code that are frequently being executed,
and it then figures out, okay, this is how I actually should be recompiling it again,
because it will be running better.
Gotcha. That'll be really interesting.
All right. Well, I've got everything I wanted out of you.
Where can people find you?
And is there anything you want to call out that I've missed?
So it's pretty easy. It's PolarSignals.com.
We have both a Discord for Polar Signals.
If you have any questions about anything that we talked about today,
we also have a separate Discord server for the Parka project.
It's P-A-R-C-A.
There's also the website, parka.dev.
It's a separate brand and separate everything,
completely independent from Polar Signals.
So yeah, that's where you can find us.
Please try the project.
We always love to hear about the magic
of when people see profiling data
across their entire infrastructure
for the very first time,
because it turns out
even the most sophisticated organizations out there
probably haven't seen this
across their entire infrastructure.
And so one thing that I love talking about
is one of our early customers, Materialize,
they're a database company.
In case people are not familiar,
they're basically a streaming database.
They're already very conscious about performance.
The first time they've deployed this on production, within hours, they found a 35% improvement
that was fleet-wide.
One change that they immediately were able to see because of this that they weren't able
to see before, that basically cut their AWS build by 35%.
Fantastic.
All right. Well, thanks for taking the time to talk with me.
Thanks for having me.
All right. Take care.