Postgres FM - turbopuffer
Episode Date: September 12, 2025Nik and Michael are joined by Simon Eskildsen from turbopuffer — among other things, they discuss ANN index types, tradeoffs that can make sense for search workloads, and when it can make s...ense to move search out of Postgres. Here are some links to things they mentioned:Simon Eskildsen https://postgres.fm/people/simon-eskildsenturbopuffer https://turbopuffer.comUse ULID Idempotency Keys (tip 6 in this blog post from Shopify) https://shopify.engineering/building-resilient-payment-systemsPostgreSQL 18 Release Candidate 1 https://www.postgresql.org/about/news/postgresql-18-rc-1-released-3130Understanding DiskANN (blog post by Junaid Ahmed) https://www.tigerdata.com/blog/understanding-diskannSPFresh: Incremental In-Place Update for Billion-Scale Vector Search (paper) https://arxiv.org/abs/2410.14452Amazon S3 adds new functionality for conditional writes https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writesAmazon S3 Vectors https://aws.amazon.com/s3/features/vectors~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is PostGGS FM. As usual, my name is Nick PostGIS AI, and my co-host is Michael. Pigeumastert. Hi, Michael.
Hi, Nick.
And we have an unexpected guest today, Simon Eskilsen, CEO and co-founder of TurboPuffer. Hi, Simon.
Thank you for having me.
Yeah, thank you for coming. And it was very unexpected because we mentioned TurboPuffer last time and you message me on Twitter.
and I think it's a great idea sometimes to look outside of traditional postgres like a system.
I think it's beneficial for everyone should be.
So yeah, thank you for coming for sure.
It's a great idea, I think.
Yeah, the origin story is kind of funny, and it was only a couple days ago.
I have a, there's a script where if TurboPuffer is mentioned anywhere, then I'll get an email or a summary.
And you guys were discussing last time, different A&N, like both in Postgres and outside, and TurboFuffer was mentioned.
And so I just DM'd you and ask, hey, I'd come on, chat about Postgres, chat about MySQL,
chat about databases and when you choose one over the other and when Postgres breaks
for some of these workloads that we've seen and when it's great.
And now it's what?
Yeah, two or three days later and we're on.
Including the weekend, actually.
So podcast was out on Friday and we record this on Monday.
That's velocity everyone should try to achieve.
I like it.
Yeah, and you've had chicken hatch in the meantime.
Oh, yeah.
That's why my camera is overexposed because a whole night, this camera, I used it to broadcast.
I didn't have time to tune it properly.
But again, thank you for coming.
This is about databases.
This time, probably not so much about Posgis, but definitely we should talk about vectors, right, and ANN.
And maybe we should start from distance and talk about your background.
And I've heard some MySQL is involved.
can you discuss it a little bit for sure yeah my my background is um i spent almost almost a decade
at shopify scaling mainly the database layer there but pretty much anything that would break as
as the scale increased and increased through the 2010s shopify like most companies started in the
early 2000s was on my sequel so a lot of the work that i did was was with my sequel but also
every other database that shopify employed like redis memcash elastic search and
a ton of others, Kafka, so on. So it's been a long time there. When I joined, it was a couple hundred
requests per second. And when I left, it was in the millions and very, very intimate with the
data layer there. I was on the last resort pager for many, many years and that it has informed
a lot about how I write software today. And a couple years ago, I started a company called
turbopuffer because I thought, I, sorry for interrupting. I remember Shopify, actually, we always
recommend the article.
Should I have a couple of blog posts about UVID
and how UID version 4 is not good
and version 7 is much better.
And it doesn't matter my SQL or Postbis,
the mechanics behind the scene is the same.
How B3 behaves.
So I remember.
And I think, Michael, we mentioned that article
on podcast a few times as well.
We have mentioned it, Nick, for sure.
And I think shopfire has come up before.
I thought, is it a Vitesse shop?
I thought it might have been one of the
fairly the test adopters.
Shopify is using a little bit of a test, but not very much.
The test was not around when we did all the sharding back in the early 2010s.
And so we did it all at the application layer.
I wasn't part of the actual sharding decision, but I was part of a lot of the sharding work over the years.
And it's funny because at the time I know that they looked at a bunch of proxies and all of the
businesses they later looked at had gone out of business.
It's not a great business, unfortunately.
But everything was done in Ruby Land through a just a module called shirons.
And it did a lot of things and a lot of monkey patches into rails.
But let's talk about this like UUIDV4 thing,
because I think if we wanted to do a pros and cons,
MySQL versus Postgres,
I spent quite a bit of time with both.
And this one, this one actually, to my knowledge,
only really matters for MySQL.
Well, it actually matters for Postgres as well,
but in a different way.
So on MySQL, right, the primary key dictates how the V3 is laid out
for the primary key, right?
So for the entire row.
So if you have UUIDV4,
it's completely randomly scattered along the B tree.
So whenever you're doing an insert, right,
it will just kind of fall somewhere random in the B tree,
which makes the updates very expensive
because if you're updating 10,
if you're adding 10 rows at a time,
so you're not update to ads,
doing 10 insertions that are in 10 different leaves,
while you're just doing a lot more disk I-o
and your right infigation is high.
In Postgres, of course, you're just appending to the heap
with all of its drawbacks and benefits,
and so it doesn't matter as much,
other than on the indexes.
right is my knowledge but on the indexes it will matter a lot because on the indexes of course
it is sorted by that and if you have some temporal locality then it's not going to matter as much
so that's my understanding so this matters a lot in my sequel now that article i think was after
i left and shoplify doesn't use uIDs as primary keys for anything so i don't really know where
this mattered it must be something tangential because my sequel really just does auto increment
and every shard does an auto increment with basically like with a 32k
auto increment number and then every shard has a plus offset into that to allow it to grow to
32k shards given how much of a pain in the answer would be to change that that's probably still
the case but I always really like that scheme so some tables over time at shopify ended up
having a primary key on the shop ID comma the ID because that would give locality for a shop
because otherwise you have a lot of redamification if you're trying to dump out a bunch of
products for a shop because the chance that there's going to be multiple products for a shop
like on on in a leaf unless you do that is just a lot lower um so that ended up working really well
and this is a pain to do in postgres because if you want to rewrite the primary key or the heap
buy an id you have to rewrite the entire thing um that was one of my surprises having having worth a bit
more of postgres in in later years yeah yeah yeah and i agree with you
And I agree that in post-guise it also matters, but only for B3 itself, primary key B3,
if it's after increment or in post-gris it's called big serial,
a serial, for example, or auto-generated right now, there's another method.
But behind the scene, it's also like sequence and it inserts.
Oh, no, in this case it's not sequence.
There should be a function which generates UD version 4, and if it's random, like version 4 is random.
version 7 is closer to regular numbers, basically, right, because it's monotonically growing, right, lexic graphically ordered, right?
So in this case, you insert only on the right side of B3 and dirty pages, if you think about how checkpointer is working.
Also, in postgast there is also overhead after each checkpointer is full page right, which involves indexes as well.
So if you touch random pages all the time, a disk I.O. overhead and replication and backups, actually, everything receives additional overhead while in version 7 we write on the right side all the time. It's much better. But hip, yes, hip is different. So I agree with this. And anyway, like I just wanted to say we use my SQL article because it's written very well. And in Postgreas, we didn't have version 7 for quite some time.
Last Thursday, version 18 release candidate was out, which will include full implementation
of UID version 7, which was vibe coded on PostGGTV with like a couple of friends, just
in Cursor, I think.
We just, oh no, it was not on Coursen, it was before that, but anyway, it was just like,
it was created right online with, we did it and right now.
It took a couple of years to reach maturity because Postgres always waits until
RFCs are finalized and so on.
Anyway, soon version 8 will be 18 will be out and UID version 7 is inside.
But I think everyone is already using it generating it on client.
Okay, great.
So you had this great career and then decided to create another, can we call it database system
database management system.
It's certainly a full-blown database.
You know, underneath TurboPuffer is an LSM.
An LSM works really well for OBIC storage.
And, you know, every successful database ends up implementing every query eventually, right, in the limit.
And TurboPuffer will end up doing the same thing.
But every good database also starts with some specialization, right?
And our specialization has been on search workloads.
I would say that it's by no means a replacement for Postgres.
There always becomes a time where it starts to make sense to move parts of data into more specialized algorithms,
more specialized data structures, and more specialized storage.
In general, my hypothesis on when it's time to create a database is that you need two things in the air
in order to create a generational database company.
The first thing that you need is that you need a new storage architecture,
because if you don't have a new storage architecture, some new way to store the data,
ideally both data structure-wise and also the actual medium that you're persisting on,
there's no reason why in the limit the other databases want to do the workload better.
They already have the existing momentum. They already have the workloads.
In the Postgres case, of course, you know, it's the classic relational architecture
where you replicate every byte onto three disks, and this works phenomenally well, right?
We've had this in production for probably more than 40 years, and it works great.
It has high performance. It has a very predictable performance profile, and it works really, really well with the page cache or the buffer pool, whatever database you're using.
The problem with this model is that if the data is not very valuable, this model is expensive.
Every gigabyte of disks, a network-bound disk is about 10 cents per gigabyte.
And unless you are really like really risky DBA, you're going to run those disks at 50% utilization on all the replicas and on the primary.
So you're paying for this disk kind of three times, which ends up with a whole all-in cost for 60 cents per gigabyte.
And that's not even accounting for all the CPUs that you also need under replicas, because you need the replicas to process the rights as fast as the primary.
So you're kind of paying for the same thing three times.
So the all-in sort of per terabyte cost, when you also take into consideration the D-Systems here can be a little bit more memory-hungry, is probably around 60 cents to $2 per gigabyte per month, right?
Per month. Yeah, per month, USD. On OBIC storage, the base cost is two cents per
gigabyte, right? And when we need that data in RAM or on disk, we only have to pay for one.
And we only have to pay for some percentage of the data to be in cash at all times, right?
You mentioned cursor earlier. Curser is a turbo profit customer.
And they don't need every single code base on SSDs or in memory at every all times.
They need some percentage in memory, the ones that are queried a lot right now, and some percentage
on disk, the ones that are going to be corried again in a few minutes or maybe in a few hours.
And so you end up paying a lot less because we only have to keep one copy of a subset of the
data rather than three copies of all of the data. Now, that comes with a fundamental set of
trade-offs, right? We want to be as public as that you can't use turbo-puffer everything.
If you want to do a right to turbo puffer, we commit that to S3, right? So by the time it's
committed to turbopuffer, the guarantee is actually stronger than most relational systems because
we've committed it into S3, which takes a couple hundred milliseconds, but the durability guarantee
is very strong. But if you're building a system like Shopify, well, you can't live with a commit
time in hundreds of milliseconds. It's just not acceptable. So that's a trade-off that means that this is
not a system that will ever replace a relational database store. The other downside is that because
not all the data is on a disk or in memory at all times, it means that you can have tail agency. And that
can be really catastrophic in very large systems that are doing millions of queries per second. If you can't
rely on a very predictable query profile. You can have massive outages to hydrate the caches.
I've seen these outages on disks. I've seen them. And just even the workload changing slightly
can mess with the buffer pool in a way where you have a massive outage. So these two things
may sound like small tradeoffs, but they're massive tradeoffs for very large production systems.
But it might mean that if you have, let's say, a billion vectors and you're trying to store them
into Postgres, the economics just don't make sense. You're paying thousands of thousands, if not
10 to 1000 of dollars in hardware costs for workload that might cost tens or hundreds of dollars on
turbo cover how much is it really one billion vectors if one vector is what's like kilobytes
700 and uh 768 dimensions yeah it's um it's it's three terabytes
three terabytes to store one billion vectors yeah and also don't have a good index for
one billion scale yeah i mean for for you in post
with PG vector HNSW won't work with one billion.
I mean, we could just run the math a couple different ways, right?
I'm not saying this is how it works in PG vector.
I'm less familiar with it now, but even if you have three terabytes of data raw,
you're probably going to need to store more than that.
You might be able to do some tricks to make the, make the vector smaller,
so you only have to store maybe, I don't know, a terabyte or something along those lines, right?
But remember that a gigabyte of DRAM is $5 per month.
And you need that three times on all your replicas.
So you're paying $15 per gigabyte per month.
So if you're doing that, if you have to store that three times,
you put everything in memory and you're somehow able to get it down to a terabyte,
then you're talking about $15,000 per month, right, across the three replicas,
just for the, for the RAM alone.
Yeah, I agree with you, HNSW.
you, it's like memory. Memory is the key, and creation of index takes a lot of time.
And for billion, I will already have issues with a few million vectors scale.
I know time scale, which I now renamed to Tiger Data, they developed another index based
on disk N from Microsoft, I think, which is more like for disk, right?
But I agree with you, like for this scale, it's not convenient.
But also, I think it's not only about vectors, this argument that we need to save on storage costs.
And it's insane to pay for storage and memory as well when we have replicas and we scale.
In PostGos, if it's physical replica, it's everything.
So you replicate everything, all indexes, everything.
You cannot replicate even.
There is no ability to replicate only one logical database.
You need to get all logical, whole cluster.
And it means that you multiply costs for storage.
and for memory and it would be so great to have some tiered storage maybe with partitioning
as much automated as possible and offload all data to S3 as basically you like you consider
S3 is great for this and I remember I was actually I explored Tobafer through Cursor.
It's great like to see the documentation.
I knew Cursor is using Posgis it was a few months ago but then I heard they considered
moving to Planet Scale. That was before Planet Scale announced support of PostGGS. So I was thinking
are they switching to my SQL. And then I saw vectors are stored in TurboPyfer. Great. Then I learned
that several our clients who get consulting from us and use PostGGGGES, they also store vectors
in TurboPathel. It was, I think, photo room, a couple of more companies. It was a surprise for me.
And I started to think, oh, that's interesting. And then I, yeah, and then I, and then I,
checked your talks. I think you also mentioned there that if we, so there is this approach with
SPFresh algorithm, right, different. It's not HNSW, different types of index, but also some
additional interesting ideas about economics you mentioned, right? Not only these sense. Can you elaborate
a little bit? I think it might be, it might be helpful to just talk at a very high level about
the different algorithms to do vector indexing into.
I'll try to simplify it as much as possible, and we can dig into it further if you want.
Fundamentally, the simplest way to do vector search is that you just store all of the vectors
in a flat array on disk, right?
And then on every search, you just compare the query vector to all of those.
The problem with that is that you very quickly run up against bandwidth limits, right?
If you have a gigabyte of vectors and you're searching that at maybe 10 gigabytes per second,
if you can exhaust the memory bandwidth, which is unlikely in a big point.
production system, you're only doing, you know, maybe five quarries per second and the quarry
latency into hundreds of milliseconds. So if you have very, very few quarries and you don't care
about latency, this can be feasible on small scale. And lots of people are doing that in production.
But if you want to search a million vectors in, you know, less than a couple hundred milliseconds
and maybe 10 milliseconds and as part of a bigger pipeline, you need some kind of index. The problem
with indexing vectors is that there is no known way to do it in a perfect way, right? If I search
for a query vector about fruit or whatever.
I know that if I'm searching for banana,
I get the closest fruit.
Maybe I don't know, maybe there's a plantain.
I don't know, right, in the cluster.
But you have to build an approximate index
in order to make this faster.
Because of too many dimensions.
For small number of dimensions, there are approaches
and the poses have them for years,
but for hundreds of thousands of dimensions, yeah.
That's right.
Yeah, there's KD trees and so on for the smaller
dimensional space, which we can use for geo-coordinates
and things like that and simpler geometry.
So for very high dimensional spaces,
these things fall apart.
Curse of dimensionality, it's called.
So they're very large is also important about the vectors.
If you have a kilobite of text,
it can easily turn into tens and tens
of kilobytes of vectors,
which is why separating into cheaper storage
makes a lot of sense.
So there's two fundamental ways
that you can index the data.
There's the graph-based approaches,
HNSW and Diski and N,
where the two you mentioned before,
and there's the clustering-based approach.
The graph-based approach is phenomenal
if you can store all of the data in memory
and you have very high QPS and very low latency requirements.
So if you have, let's say, 100 million vectors
and you're searching that at 100,000 queries per second
and you need very low latency,
you're not gonna beat HNSW.
It's gonna create a very, very good graph
to navigate it with.
The problem is that it's very expensive.
And the other problem is that almost no workloads
in the real world actually look like this.
So HNSW got very popular
because it's fairly simple to implement correct.
and it's very simple to maintain.
So when you create the graph,
which is essentially just, you know,
points that are close in vector space
are close in the graph.
It's very simple to incrementally maintain.
You put one thing in, you search the graph,
and then you add it.
There's very simple rules.
You can implement something like HNSW
and tens of lines of code
if you did a very, very simple implementation of it.
The problem at H&SW is that every time you do a write,
you have to update a lot of data, right?
In database land, we call this right implication,
where every byte or every page you update,
you have to have to have to update.
a lot of others. The reason for that is that you add something, you add a
graph, you add a node in the graph, and then you have to update all the other
thing to do connections to that node in the graph. So this works great in memory
because memory is very fast at updating and very fast at random rights. But the
problem is also on the read path. So memory is very fast at doing random reads.
You can do a random read at 100 nanoseconds, but a random read on S3 or on a
disk is much slower into hundreds of microseconds to hundreds of milliseconds on S3.
And on a graph, right, it's, you don't really, there's no speculation that helps, right?
If you start at the middle of the graph and then greedily navigate the graph from the query vector to the closest matching vectors,
well, every single time you do that, there's a round trip.
And on S3, that's a round trip that takes hundreds of milliseconds.
So you're sort of navigating from the route.
It's like 200 milliseconds.
You go out one, 200 milliseconds.
You go out another one, 200 milliseconds.
And in general, for HNSW on a million, this might be in the tens of round trips.
So this gets very slow, right?
into seconds to do this on S3, and even on a disk, this very quickly adds up to tens of milliseconds.
That's the fundamental problem with graphs. Now, disk A&N is essentially using a lot of the same
ideas as other graph-based indexes, but instead of trying to have, you know, the tens of round
trips that H&HW has, that's very good for memory, disc A&N basically tries to shrink the graph
so there are fewer jumps, right? So instead of 200 milliseconds 30 times, it tries to get it to
maybe six or seven times or 10 times by shrinking the graph as much as possible. That's essentially
the insight in disk ANN. The problem with disk ANN is that after you have added more than 10 or so 10 to
20 percent of the size of the data, you have to rebuild the entire graph, which is an incredibly
expensive operation. That is absolutely terrifying to me to someone that's been on call for large databases
to just have like you could max out like 128 cores rebuilding this graph in prod and it could take you
down at 3 a.m. because you don't know when you hit that threshold. And if you don't do it,
then the approximation start getting bad and you start getting bad search results. The nice thing
about the graphs is just they have very low latency, but they're just very expensive to maintain.
Now the first way, then there's the other type of index which are the clustered indexes. Clustered indexes
are very simple to picture in your head, right? If you have a cluster of, let's say you took
every song in Spotify and you cluster them in a coordinate system and you can visualize this
and through two dimensions.
If you, then you plotted all the songs, and the songs that are adjacent, of course,
also adjacent in vector space, and genres will emerge, right?
There'll be a rap cluster.
There'll be a rock cluster.
You zoom in and you get, you know, like little subclusters, I don't know, death metal,
black metal, I don't know what all the different rock genres are, right?
Somewhere in these clusters.
And you can generate great recommendations based on that because you can look at, okay,
what did Michael listen to and what are some songs that are close by that he hasn't listened
to and same for Nick.
So it's very simple.
You create a clustering algorithm that basically just try to divide the data set into clusters.
And when you do a query, instead of looking at all the vectors, you look at, well, clearly the user is asking about a rock song.
So we're only going to look in the rock cluster.
And that way you divide down the number of vectors that you have to seek.
Now the problem with this is that if you have everything in memory, it's not necessarily as optimal,
because you might have to look at more data than you do in a graph-based approach.
And because RAM has such good random read latency,
the penalty is not necessarily worth it if everything is in memory at all times.
But this is great for disks and it's great for S3 because I can go to S3 and in 200 milliseconds get
gigabytes of data back, right? It doesn't matter if I'm getting like, you know, a megabyte or a
gigabyte. I can often get that in the same round trip time if I exhaust the network.
So if you then go into S3, basically you have to download all the clusters. So like let's say
clusters of JSON to really simplify this and then you just look at the closest end clusters to
your query vector, and then you download, you know, cluster 1. JSON,
cluster 2.jsons, whichever ones are closest, is in two round trips.
And now instead of on the graph-based ones, where you're doing 200 milliseconds,
200 milliseconds, 200 milliseconds to navigate the graph, you just have to 200 milliseconds to get
the clusters, and then 200 milliseconds to get all the clusters that were adjacent.
The nice thing about these clustered indexes is that with algorithms like SPFresh and lots of
modifications to them, we can incrementally maintain these clusters, because you can imagine
that when you add a vector, you just have to add it to the cluster, and it's just one right.
The write invocation is very low.
Once in a while, that cluster will grow beyond a size.
Let's say it's 1,000 elements long and you have to split the cluster,
and then you have to do some modifications.
That's essentially what SPFresh is.
And then there's a little bit higher right identification.
But it's stable in the way that you never reach this threshold where, okay, I've added 20% of the data set.
I have to rebuild the entire thing as you do in disk Gn.
HNSW don't have to do that, which is why it's very, very nice, but it's just slow still.
So SPFresh, we think, right, is a really, really nice.
set of trade-offs where it's going to be a little bit slower but slower in terms of okay instead of a
search returning in a millisecond it might take five milliseconds and just no one cares in production for
search this matters for a point look up into a relational database but for search it's a perfect
set of trade-offs question on the cluster splitting does that mean we don't ever need to rebuild the
hold index because i think that was a limitation of the first cluster-based i vf flat i think we're
started with in PGVector and that didn't have the splitting as far as I'm aware.
And therefore we had to rebuild the whole index from time to time if the data changed significantly.
That's that's right.
And it was also not only split, there is also merge in SPFresh as I remember.
Yes, there's merge as well.
Makes sense.
And because you might do deletes, which are also a pain.
To my knowledge, PGVector does not implement SPFresh.
It's a...
It is very difficult to implement.
implement correctly.
But also funny, I did a little bit research in May, I think, when I discovered
TurboPuffer, I started reading about this.
I saw original implementation was actually on forked PostGos, SPFresh.
I saw some, you need to dig into it, like in some Microsoft repositories, also I think
some Chinese engineers were involved, something like there.
Like I saw some repository which was like, basically, forked PostGus and original SPF
refresh implementation was on top of that.
Maybe I'm hugely mistaken, but I saw something like this.
But it's hard.
I agree.
And I also recalled what I wanted to ask.
I was lost in my previous question because it was too many things.
I recalled in your talks, you discussed that S3 is ready to store data.
S3 is ready because over the last few years, they added important features.
Can you recall what you talk about at the talks?
Yeah. I mentioned that there were two things that you needed to build a new database. And there's a reason why a database like Toppo Buffer hasn't already been built. It's a new storage architecture. It's only really possible now. The three things that have enabled a database like Topopuffer to exist with this sort of pufferfish architecture, right, where it's, you know, when the pufferfish is deflated, it's in S3. And when it's, you know, somewhere in between, it's in SSD. And then it's, it's in memory when it's all the way inflated. The reason that's possible is because of three things. The first one is our
NVME SSDs. NVMEES have a new set of trade-offs. They act completely differently than other
disks, right? SSDs are just sort of like the old SSDs were very fast, but NVME SSDs have
just phenomenal performance potential where basically on an NVMESD, the cost per gigabyte is 100
times lower than memory, but the performance, if you use it correctly, is only about five times
worse. And you have to, by using NVMESD correctly is really that you have to put a lot
of concurrency on the disk. But again, similar to S3, every single round trip takes into hundreds
of microseconds, but you can drive a lot of bandwidth. Old storage engines have not been
designed for that. You have to design from that from the day that you write the first line of code.
Otherwise, it takes a very long time to retrofit. It happens to be that that exact usage pattern
is also what's very good for S3. NVMESDs were not available in the cloud.
until 2017, 2018.
So this is, in database language, this is relatively new.
Second thing that needs to happen is that S3 was not consistent until December of 2020.
I think this is the most counterintuitive to most, because most of us just think that it always
has been, but it hasn't.
What that means is that if you put an object on S3 and then you read it immediately after,
you were, as of December of 2020, guaranteed that it was going to be read after right consistency.
The third thing that needed to happen, and this is very informed by the fact that I was on call for Shopify for so long,
when you're on call for a long time, you gravitate towards very simple systems,
and ideally systems that are constantly tested on their resiliency,
so you don't get page when something abnormal happens.
And for us, what was very important for us to be on call for a database for Justin and I
was that it only had one dependency.
And that dependency could only be one of the most reliable systems on Earth,
which is S3, Google Cloud Storage, right?
and the other derivatives like Azure Blubstores and so on.
They're very, very, very reliable systems.
But you could not build the metadata layer on top, right?
So snowflake and Databricks and others that have built on top of this
in the last generation of new databases needed another metadata layer.
Some consensus layer like FoundationDB or their own Paxes or RAF protocol to essentially
enforce the read-after-write consistency, but also to do various metadata operations
atomically. But in late 2024, S3 finally announced that reinvent, compare and swap. And what
compare and swap allows you to do is to put a file, let's say metadata.com, you download the file,
you do some modifications to it, and then you upload it again with a version number, and you only
upload it if the version number is the same, right? Basically guaranteeing that you did an atomic
and nothing has changed it in the interim. Very important when you're building a distributed
systems, right? You can really implement anything on top of that, as long as you're willing to
to take the performance hit of going back and forth to S3. And of course, they have a whole
metadata packs us whatever thing to implement that. In GCS, it's Spanner, but I don't have
to worry about that, right? That's for them to formally verify and whatever they need to do to uphold
those constraints. Those were the three things that needed to happen, right? And that is
requirement number one to build a new database. And so that's what was in the air for Turbo Puffer
to grab, right? The second thing that you need for a new database is that you need a new workload
that's begging for that particular storage architecture, right?
So for Snowflake and Databricks, we saw that in, well, we want big-scale analytics.
And it doesn't make sense also for the dollar per gigabyte that we have to pay in operational
databases and adding indexes on everything.
So there was a new OLAB workload and there was, or at least an acceleration of the OLAB
workload with all the successful web applications, and then the new storage architecture
on top of S3, but they had to do some more work because these APIs that I just mentioned
weren't available and a new workload now is connecting lots of data to to AI and this storage
architecture is a very good fit for it yeah so that's great thank you for elaborating about this
three changes at s3 but they also made another change recently and they announced as three vectors
right what do you think about this comparing to your system yeah i think that s3 vectors is if you if you
writing the vectors once and you don't query them that much and you don't care that much about
the query latency, it can be a useful, useful product in the same way that you might be using
S3 today. But S3 vectors is not, doesn't do full-text search, right? It doesn't do lots of these
features that you need for a serious system. And even S3 vectors recommends that you load into open
search. And so for archivalal vectors, this can make sense. So they're still operational,
but there's lots of limitations that would make it very difficult for it to go into production
systems, right? If you do a quarry to S3 vectors, it takes hundreds and hundreds of milliseconds
to get the result back, whereas the turbopuffer, you can get the result back in less than 10 milliseconds.
Thanks to cash, right? Thanks to cash. Yeah. Yeah, that's totally makes sense. I still, I have skeptical
questions more of them, if you don't mind. Let's go. Yeah, one of them is your cash layer is on
local in the MEs, right? Yeah. But why, like, we could store right there. Plain scale,
recently came to post-guid ecosystem and they said let's stop having fears about using local
ms and yes it's like ephemeral storage and so on and there is a reliable fellow and and so on like it
let's do it super fast yes i agree it's super fast and uh four terabytes for billion or three
terabytes for billion vectors probably won't cost too much because this price is usually in
AWS, for example, in the instances which have local in the emis, it's basically included.
And the limit is many, many dozens of terabytes of local and emits storage these days on larger
instances. So we are fine to store not only vectors, but everything else. So my question is like
back from S3 to my SQL, for example, my SQL supports storage engines for many years.
Have you considered building storage engine for, for, for, for, for,
my SQL, for example, and using local Ndeme.
You can't outrun the economics, right?
The economics are still the same.
You have to replicate, whether it's local NVME or not, which is not necessarily cheaper,
maybe only marginally than a network volume.
You still have to replicate it three times, right?
You still have to put it on three machines.
You still need all the cores and memory and so on that you would need on the primary to keep up,
unless you start to have a heterogeneous replica stack, which for a variety of reasons would be a really bad idea.
So you're still paying for all of that.
Now, up to a certain point, that makes a lot of sense, right?
If I have a customer who gets on a call with our sales team and they have a couple million vectors in PG vector,
there is no reason to move off of it.
That is perfect.
You should not be taking on the complexity of ETLs and so on.
But if you have like tens of terabytes of vector data, it is not a cannot.
for a lot of businesses. Now, for some businesses, it are, right? But the art of a business is to earn a return on the underlying costs. And for some business, it's very, very challenging to earn a return on storing this on three replicas, this vector data. It's generally not valuable enough to the business. So turbopfer doesn't make sense to, as a storage engine into into a MySQL or into a Postgres. It's just fundamentally not compatible with the way that we do things outside of replica chains, right? You could maybe come up with a storage engine where,
you page everything into s3 and all of that but you're you're you're now trying to build a
new database inside of an extremely old database right the storage layer is it's too especially um more
so in postquest i think than in my sequel is very very weaved into into how the how the
query planner works and all of that so at that point you're rebuilding the query planner to be
very round-trip sensitive you're rebuilding a storage layer it it doesn't make sense anymore
it's a completely different database okay another skeptical question i i i'll
I go to TurboPuffer website, and by the way, great design, very popular these days with monospace
font and like these types of graphic.
And you're advertised for one billion scale, one billion vectors for one billion documents.
But there, if I check one billion, the concept of namespaces pops up.
And namespaces, if I choose one billion, if I choose 100 million, it's okay.
I can't have one namespace.
but if it's one billion there is a warning that probably you should split it to 10 namespaces
and this means 10 different indexes right that's right yeah so it's not actual one billion scale
or like something is off here in my mind like one billion it's a single index should be
be index, single index. If we talk about one billion but divided by 10 collections and
indexes also, like it's already a different story. And so can you can explain this to me?
I saw this in the beginning and I'm happy in this position I can ask the founder himself about
this. Yeah, we try to be extremely transparent in our limits, right? And I think your your mental
model is correct. Before this, we just had a limit that said that we could do in a single name
space around 250 million vectors, right? But I mean, even that's, even that is a simplification
because how big of the vectors? If there are 128 dimensions, they, they are a lot less space,
which is ultimately what matters here, which is why we also put the gigabytes. So when we in the
past, we had just 250 million vectors on there as a limit, people came to us and said, or we knew
that people weren't testing with us because they wanted to sort of search a billion.
at once. And they didn't realize that you could just do ID modulus N, and then you could do a
billion. And it would be very economical for them. So we sort of had to, you know, put in the docks
like, yes, you can do a billion at once, but you have to shard it. Now, I would love to handle that
sharding for you, right? I mean, that's what Elasticsearch does. And it's what a lot of databases
do, because the only way to scale any database is sharding, right? You don't get around it. The question
is where the complexity lives, right? Does the complexity live inside of the database to handle it for
you, like some of the most sophisticated sharding you will find lives inside of cockroach and
spanner in these and these kinds of systems. And the simplest type of sharding is what we're
exposing, where every single shard is just a directory on S3, and you can put as many of them
as you want, and you can query as many of them as you want. Now, over time, of course, we need
to create an orchestration layer on top of that so that a logical namespace to the user is
actually multiple namespaces underneath. But we're challenging ourselves to make every
individual namespace as large as it possibly can. When I ran an Elasticsearch cluster or was involved
in scaling Elasticsearch clusters, every shard was around 50 gigabytes. That was roughly what was
recommended for Elasticsearch. Quite small, right? It's quite small, right? Like on Postgres,
it's a lot larger than that. It's basically the size of the machine. But the problem with a small
shard size is that for something like a B-tree, right, it's obviously like it's log-end, right,
to the number of searches you have to do.
And if you have log n and n is very high,
well, that's a great number of operations that you have to do.
But for every shard, there's sort of like an M log n.
And now M is very high.
If the shard size is small and n is small.
So you're doing a lot more operations.
So we want the shards to be as large as possible.
Because, of course, you can get to a billion by just doing, you know,
a thousand, one million charge.
But like, that's incredibly competitionally and effective.
So we have shards now for some.
users that we're testing that do almost a billion documents at once, right?
But it requires some careful tuning on our end. So we want to push that number as
high as possible. The other thing with namespaces is that TurboPuffer is a multi-tenant
system and we have to bin pack these namespaces. So if a namespace is five
terabytes large, it's much harder for us to bin pack bin pack on node sizes that makes sense.
So we have to walk some threshold there where we try to make the charge as small as
possible from the logical data. We have the bin pack and then we have to index by
without slowing the indexing down because the larger an a and n index gets the harder it is to
maintain the index so those are the constraints we walk so over time we update these limits continuously
you will see that shark size increase but you're not going to find anyone who is doing a hundred
billion vectors in a single index on a single machine aka m is one and n is a hundred billion
but we want to get as high as possible because that is the most competitionally ineffective
that's how we get our users the best prices and we see the most ambitious use cases and and the
shards are these actual shards or partitions so it's like shards meaning that a different compute
nodes behind the scenes are used if it's 10 namespaces it's 10 shards then different compute
nodes so um a namespace for us and this is also why we've chosen a different a different name for
it right a namespace to us is a directory on s3 and which compute node that goes to is essentially
just a compute consistent hash of the of the of the of the name of the prefix and the and the user
or ID, right? So compute node one could hash to maybe, you know, 100,000 different namespaces,
and they share the compute. And then we scale the compute, right, with that. And that's why the bin
packing problem comes in. It's flexible, I see. Yes. And that's also part of I we can do get really
good economics. Yeah, I wanted to shout out again with the like front page is beautiful.
It explains like basics, architecture and pricing right here, latencies right here and case
studies this is like how it should be for for engineers you know so you quickly see all the numbers
and understand that's great thank you for transparency that's great i think it's it just comes from
the fact that i was on that you know your side at the table my whole career right i've been buying
databases evaluating databases and sometimes i swear i end up on a database website and i don't
know if they're marketing a sneaker or a database yeah go figure out
storage or compute price in 8able, yes or GCP.
It's like always a task.
So we really wanted to just put the foot forward of like the diagram is right there.
Okay, this is my mental model.
This is what it does.
This is what it costs.
This is how fast it is.
These are the limits.
These are the customers and how they use it.
You go to the documentation.
We talk about guarantees.
We talk about tradeoffs.
The kinds of things that I always look for immediately to slot it into my model.
Because I don't want you to use our database at all costs.
I want you to use it if it makes sense for you.
I want it to be a competitive advantage to you.
I want to save you like millions of dollars a year.
And in order for that to make sense,
you have to just put all the bullshit aside
and make it very clear what you're good at and what you're not good at.
What about, for example, if I,
I know there is vector search and full tech search supported, right?
But what about additional filtering when we have some dimensions
when we want to filter on them?
Is it possible?
Yes.
For example, some categories or like,
time ranges or something. It's all possible, right, in both types of search in
double five. I'll go back to my line before of every successful database eventually supports
every query, right? And we're, I don't know, I like maybe 5% of the way there. So we support
filters. And filtering on vector search is actually very difficult, right? Because if you do
something like, let's say I'm searching for banana and it's an e-commerce site and I have to
filter for ships to Canada, well, that might cut off a fruit cluster, which is actually where I
want it to be. And then like, you know, I get to the banana on a t-shirt cluster, but it's really
far away. So you have to do very sophisticated filtering to get good results. I don't know what
PG vector does, but I know that our query planner is recall aware. It does a lot of statistics
and distribution of where the vectors are to ensure that we have high recall. And for a percentage
of quarries in production, we check against an exhausted search what the recall is, the accuracy
of the index. I don't know if PG vector does this, but I know it's been a monumental effort
for us a big headache there. It's a big headache. Actually, it's a good point that additional
filtering can change semantics of the search. If it's searching for bananas, size equals
nano, it's very different. One thing they've implemented relatively recently is iterative index
scans. So I think they had a problem where, for example, you'd put a limit, you'd do a similarity
search with a limit and a filter, and you'd ask for 100 results, and you'd get back fewer,
even though they had a problem so they go back to the index and request more so yeah it's it's like a
workaround i'd say more than a solution but it's pretty effective for a lot of use cases i think the
problem there is that i would be very skeptical of the recall in a case like that right because okay
so you're just like you got 100 but you probably should have looked at a lot more data to know what
the true top k was so you know we that solution was not acceptable to us like we we had to go much much
further to ensure high recall.
And I think the problem with a lot of these solutions
is that people are not evaluating the recall
because it's actually very difficult to do.
And it's very competitionally expensive.
You're not going to have your production Postgres
run an exhaustive search on millions of vectors.
Like it's too dangerous to do that
where you're doing your transactional workloads.
But again, if you're hitting these kinds of problems,
then it may be time to consider maybe for a subset
of your workloads, whether something more specialized
makes sense, whether the tradeoffs makes sense to you.
But I think with the pre-filtering and post-filtering,
It can be, it's very challenging to create a query planner that can do that well.
So we support all the queries, right?
You can do and you can do range queries.
You can do exact queries.
You can operate with arrays.
You can do all types of queries, set intersections that people use for the permissions.
We can do on we can do full text search queries.
We can do group buys.
We can do simple aggregates.
So we can do more and more queries.
And we're constantly expanding that with, with what our users demand from the system that we built.
One more.
Another question.
I know it's not open source.
There is no free version.
at all, even.
I'm pretty sure it was
the decision not to go with open core
or open source model at all
and can you explain why?
Yeah, I think there's never been
any particular resistance
to open source.
I mean, I love open source.
The reason we're not open source
is because open source is if you want to do it well,
it's a lot of effort.
And it's also a lot of effort
to build a company.
It's a lot of effort to find product market fit.
And we decided to pour all of our energy
into that.
And it's similar argument for the minimum.
It's really the only thing that a startup has over, you know, the big incumbents is focus.
And so our focus has been on the customers that are willing to pay.
And one day I would love to give a free tier.
I would love for people's blogs to run on TurboPuffer.
But for now, I'm afraid that we have to prioritize the customers that are willing to put some money behind it.
It's not because we have provisioned hardware, anything like that.
it's really just that we need to know that you're willing to put some money behind it so we can give you
amazing support right a lot of the times there'll be engineers working on the database who are supporting your tickets
and we can't do that with a free tier yeah makes sense yeah and i we michael and i have different
points of you on this area i think that was a good answer i'm on your side simon i had a question
on the full text search and semantic search we had a discussion a while back about hybrid we called
I think it's generally called hybrid search.
Are you seeing much of that where people are wanting to kind of, yeah, mix the results and order between them?
Yeah, we do.
And what we see is that the embedding models are pretty phenomenal at finding good results for subjects that they know about, right?
And terms that they know about.
You know, you run a project called PG Mustard, right?
And maybe the embedding model doesn't know.
I think it's popular enough that it will know what it is, but let's say it didn't know.
And then it puts it in the cluster close to ketchup and the results are just horrible and the text the full text search tends to be really good at this like recall for things that the embedding model can't know about if you're searching for an algae, you know these like TV skews. It's like you know it's indecipherable. Actually the embedding models are actually quite good at these because they happen enough on the web. But imagine that it wasn't these kinds of searches it's good for. The other thing is searches you type. So if I'm searching for SI, the embedding
model might find that well he's searching for something agreeable in Spanish right
but really what I wanted was the document that started with Simon so sometimes
you just need to complement these twos and I think that that's why we've doubled
down on the full-text search implementation in TurboPuffer to do this hybrid but
people can get a lot of mileage of embedding models alone wonderful yeah thank you so much
yeah it was super interesting good luck with your system and yeah I think it's an
interesting idea, I guess, for PostGus users who need, as you said, like to store more
and have all characteristics, TurboPaffer have, maybe it's like worth considering, right,
to keep all LTP data, all TPP workloads and data in regular Postgres while moving
vectors to TurboPaffer, right?
I mean, it's just similar to a lot, you know, people have taken workloads out of
Postgres, you know, it's like updating a posting list.
It's a very expensive operational and the transactional store.
And it's similar in A&N decks, updating that with the same kind of acid semantics that Postgres upholds.
It's very expensive.
And I've ripped full-text search out of Postgres many times because it's very, very expensive to do.
So we do that because we don't want to shard Postgres because it's like a lot of work.
And so we start by moving some of these workloads out first.
And search is one of the early ones to go.
So your point is that it probably postpones the moment when you need to shut.
that's interesting same same reason same reason from mcash and redis right you you you you
you separate out these workloads as soon as possible to avoid sharding mm-hmm
mm-hmm interesting idea okay again thank you so much for coming it was a super
interesting discussion i enjoyed the lot thank you so much yeah really nice to meet you
thanks for joining us take care