Postgres FM - turbopuffer

Starting point is 00:00:00 Hello, hello. This is PostGGS FM. As usual, my name is Nick PostGIS AI, and my co-host is Michael. Pigeumastert. Hi, Michael. Hi, Nick. And we have an unexpected guest today, Simon Eskilsen, CEO and co-founder of TurboPuffer. Hi, Simon. Thank you for having me. Yeah, thank you for coming. And it was very unexpected because we mentioned TurboPuffer last time and you message me on Twitter. and I think it's a great idea sometimes to look outside of traditional postgres like a system. I think it's beneficial for everyone should be. So yeah, thank you for coming for sure.

Starting point is 00:00:39 It's a great idea, I think. Yeah, the origin story is kind of funny, and it was only a couple days ago. I have a, there's a script where if TurboPuffer is mentioned anywhere, then I'll get an email or a summary. And you guys were discussing last time, different A&N, like both in Postgres and outside, and TurboFuffer was mentioned. And so I just DM'd you and ask, hey, I'd come on, chat about Postgres, chat about MySQL, chat about databases and when you choose one over the other and when Postgres breaks for some of these workloads that we've seen and when it's great. And now it's what?

Starting point is 00:01:11 Yeah, two or three days later and we're on. Including the weekend, actually. So podcast was out on Friday and we record this on Monday. That's velocity everyone should try to achieve. I like it. Yeah, and you've had chicken hatch in the meantime. Oh, yeah. That's why my camera is overexposed because a whole night, this camera, I used it to broadcast.

Starting point is 00:01:34 I didn't have time to tune it properly. But again, thank you for coming. This is about databases. This time, probably not so much about Posgis, but definitely we should talk about vectors, right, and ANN. And maybe we should start from distance and talk about your background. And I've heard some MySQL is involved. can you discuss it a little bit for sure yeah my my background is um i spent almost almost a decade at shopify scaling mainly the database layer there but pretty much anything that would break as

Starting point is 00:02:07 as the scale increased and increased through the 2010s shopify like most companies started in the early 2000s was on my sequel so a lot of the work that i did was was with my sequel but also every other database that shopify employed like redis memcash elastic search and a ton of others, Kafka, so on. So it's been a long time there. When I joined, it was a couple hundred requests per second. And when I left, it was in the millions and very, very intimate with the data layer there. I was on the last resort pager for many, many years and that it has informed a lot about how I write software today. And a couple years ago, I started a company called turbopuffer because I thought, I, sorry for interrupting. I remember Shopify, actually, we always

Starting point is 00:02:52 recommend the article. Should I have a couple of blog posts about UVID and how UID version 4 is not good and version 7 is much better. And it doesn't matter my SQL or Postbis, the mechanics behind the scene is the same. How B3 behaves. So I remember.

Starting point is 00:03:09 And I think, Michael, we mentioned that article on podcast a few times as well. We have mentioned it, Nick, for sure. And I think shopfire has come up before. I thought, is it a Vitesse shop? I thought it might have been one of the fairly the test adopters. Shopify is using a little bit of a test, but not very much.

Starting point is 00:03:26 The test was not around when we did all the sharding back in the early 2010s. And so we did it all at the application layer. I wasn't part of the actual sharding decision, but I was part of a lot of the sharding work over the years. And it's funny because at the time I know that they looked at a bunch of proxies and all of the businesses they later looked at had gone out of business. It's not a great business, unfortunately. But everything was done in Ruby Land through a just a module called shirons. And it did a lot of things and a lot of monkey patches into rails.

Starting point is 00:03:55 But let's talk about this like UUIDV4 thing, because I think if we wanted to do a pros and cons, MySQL versus Postgres, I spent quite a bit of time with both. And this one, this one actually, to my knowledge, only really matters for MySQL. Well, it actually matters for Postgres as well, but in a different way.

Starting point is 00:04:14 So on MySQL, right, the primary key dictates how the V3 is laid out for the primary key, right? So for the entire row. So if you have UUIDV4, it's completely randomly scattered along the B tree. So whenever you're doing an insert, right, it will just kind of fall somewhere random in the B tree, which makes the updates very expensive

Starting point is 00:04:31 because if you're updating 10, if you're adding 10 rows at a time, so you're not update to ads, doing 10 insertions that are in 10 different leaves, while you're just doing a lot more disk I-o and your right infigation is high. In Postgres, of course, you're just appending to the heap with all of its drawbacks and benefits,

Starting point is 00:04:49 and so it doesn't matter as much, other than on the indexes. right is my knowledge but on the indexes it will matter a lot because on the indexes of course it is sorted by that and if you have some temporal locality then it's not going to matter as much so that's my understanding so this matters a lot in my sequel now that article i think was after i left and shoplify doesn't use uIDs as primary keys for anything so i don't really know where this mattered it must be something tangential because my sequel really just does auto increment and every shard does an auto increment with basically like with a 32k

Starting point is 00:05:23 auto increment number and then every shard has a plus offset into that to allow it to grow to 32k shards given how much of a pain in the answer would be to change that that's probably still the case but I always really like that scheme so some tables over time at shopify ended up having a primary key on the shop ID comma the ID because that would give locality for a shop because otherwise you have a lot of redamification if you're trying to dump out a bunch of products for a shop because the chance that there's going to be multiple products for a shop like on on in a leaf unless you do that is just a lot lower um so that ended up working really well and this is a pain to do in postgres because if you want to rewrite the primary key or the heap

Starting point is 00:06:05 buy an id you have to rewrite the entire thing um that was one of my surprises having having worth a bit more of postgres in in later years yeah yeah yeah and i agree with you And I agree that in post-guise it also matters, but only for B3 itself, primary key B3, if it's after increment or in post-gris it's called big serial, a serial, for example, or auto-generated right now, there's another method. But behind the scene, it's also like sequence and it inserts. Oh, no, in this case it's not sequence. There should be a function which generates UD version 4, and if it's random, like version 4 is random.

Starting point is 00:06:45 version 7 is closer to regular numbers, basically, right, because it's monotonically growing, right, lexic graphically ordered, right? So in this case, you insert only on the right side of B3 and dirty pages, if you think about how checkpointer is working. Also, in postgast there is also overhead after each checkpointer is full page right, which involves indexes as well. So if you touch random pages all the time, a disk I.O. overhead and replication and backups, actually, everything receives additional overhead while in version 7 we write on the right side all the time. It's much better. But hip, yes, hip is different. So I agree with this. And anyway, like I just wanted to say we use my SQL article because it's written very well. And in Postgreas, we didn't have version 7 for quite some time. Last Thursday, version 18 release candidate was out, which will include full implementation of UID version 7, which was vibe coded on PostGGTV with like a couple of friends, just in Cursor, I think. We just, oh no, it was not on Coursen, it was before that, but anyway, it was just like,

Starting point is 00:08:02 it was created right online with, we did it and right now. It took a couple of years to reach maturity because Postgres always waits until RFCs are finalized and so on. Anyway, soon version 8 will be 18 will be out and UID version 7 is inside. But I think everyone is already using it generating it on client. Okay, great. So you had this great career and then decided to create another, can we call it database system database management system.

Starting point is 00:08:37 It's certainly a full-blown database. You know, underneath TurboPuffer is an LSM. An LSM works really well for OBIC storage. And, you know, every successful database ends up implementing every query eventually, right, in the limit. And TurboPuffer will end up doing the same thing. But every good database also starts with some specialization, right? And our specialization has been on search workloads. I would say that it's by no means a replacement for Postgres.

Starting point is 00:09:08 There always becomes a time where it starts to make sense to move parts of data into more specialized algorithms, more specialized data structures, and more specialized storage. In general, my hypothesis on when it's time to create a database is that you need two things in the air in order to create a generational database company. The first thing that you need is that you need a new storage architecture, because if you don't have a new storage architecture, some new way to store the data, ideally both data structure-wise and also the actual medium that you're persisting on, there's no reason why in the limit the other databases want to do the workload better.

Starting point is 00:09:42 They already have the existing momentum. They already have the workloads. In the Postgres case, of course, you know, it's the classic relational architecture where you replicate every byte onto three disks, and this works phenomenally well, right? We've had this in production for probably more than 40 years, and it works great. It has high performance. It has a very predictable performance profile, and it works really, really well with the page cache or the buffer pool, whatever database you're using. The problem with this model is that if the data is not very valuable, this model is expensive. Every gigabyte of disks, a network-bound disk is about 10 cents per gigabyte. And unless you are really like really risky DBA, you're going to run those disks at 50% utilization on all the replicas and on the primary.

Starting point is 00:10:26 So you're paying for this disk kind of three times, which ends up with a whole all-in cost for 60 cents per gigabyte. And that's not even accounting for all the CPUs that you also need under replicas, because you need the replicas to process the rights as fast as the primary. So you're kind of paying for the same thing three times. So the all-in sort of per terabyte cost, when you also take into consideration the D-Systems here can be a little bit more memory-hungry, is probably around 60 cents to $2 per gigabyte per month, right? Per month. Yeah, per month, USD. On OBIC storage, the base cost is two cents per gigabyte, right? And when we need that data in RAM or on disk, we only have to pay for one. And we only have to pay for some percentage of the data to be in cash at all times, right? You mentioned cursor earlier. Curser is a turbo profit customer.

Starting point is 00:11:15 And they don't need every single code base on SSDs or in memory at every all times. They need some percentage in memory, the ones that are queried a lot right now, and some percentage on disk, the ones that are going to be corried again in a few minutes or maybe in a few hours. And so you end up paying a lot less because we only have to keep one copy of a subset of the data rather than three copies of all of the data. Now, that comes with a fundamental set of trade-offs, right? We want to be as public as that you can't use turbo-puffer everything. If you want to do a right to turbo puffer, we commit that to S3, right? So by the time it's committed to turbopuffer, the guarantee is actually stronger than most relational systems because

Starting point is 00:11:52 we've committed it into S3, which takes a couple hundred milliseconds, but the durability guarantee is very strong. But if you're building a system like Shopify, well, you can't live with a commit time in hundreds of milliseconds. It's just not acceptable. So that's a trade-off that means that this is not a system that will ever replace a relational database store. The other downside is that because not all the data is on a disk or in memory at all times, it means that you can have tail agency. And that can be really catastrophic in very large systems that are doing millions of queries per second. If you can't rely on a very predictable query profile. You can have massive outages to hydrate the caches. I've seen these outages on disks. I've seen them. And just even the workload changing slightly

Starting point is 00:12:33 can mess with the buffer pool in a way where you have a massive outage. So these two things may sound like small tradeoffs, but they're massive tradeoffs for very large production systems. But it might mean that if you have, let's say, a billion vectors and you're trying to store them into Postgres, the economics just don't make sense. You're paying thousands of thousands, if not 10 to 1000 of dollars in hardware costs for workload that might cost tens or hundreds of dollars on turbo cover how much is it really one billion vectors if one vector is what's like kilobytes 700 and uh 768 dimensions yeah it's um it's it's three terabytes three terabytes to store one billion vectors yeah and also don't have a good index for

Starting point is 00:13:16 one billion scale yeah i mean for for you in post with PG vector HNSW won't work with one billion. I mean, we could just run the math a couple different ways, right? I'm not saying this is how it works in PG vector. I'm less familiar with it now, but even if you have three terabytes of data raw, you're probably going to need to store more than that. You might be able to do some tricks to make the, make the vector smaller, so you only have to store maybe, I don't know, a terabyte or something along those lines, right?

Starting point is 00:13:48 But remember that a gigabyte of DRAM is $5 per month. And you need that three times on all your replicas. So you're paying $15 per gigabyte per month. So if you're doing that, if you have to store that three times, you put everything in memory and you're somehow able to get it down to a terabyte, then you're talking about $15,000 per month, right, across the three replicas, just for the, for the RAM alone. Yeah, I agree with you, HNSW.

Starting point is 00:14:16 you, it's like memory. Memory is the key, and creation of index takes a lot of time. And for billion, I will already have issues with a few million vectors scale. I know time scale, which I now renamed to Tiger Data, they developed another index based on disk N from Microsoft, I think, which is more like for disk, right? But I agree with you, like for this scale, it's not convenient. But also, I think it's not only about vectors, this argument that we need to save on storage costs. And it's insane to pay for storage and memory as well when we have replicas and we scale. In PostGos, if it's physical replica, it's everything.

Starting point is 00:15:00 So you replicate everything, all indexes, everything. You cannot replicate even. There is no ability to replicate only one logical database. You need to get all logical, whole cluster. And it means that you multiply costs for storage. and for memory and it would be so great to have some tiered storage maybe with partitioning as much automated as possible and offload all data to S3 as basically you like you consider S3 is great for this and I remember I was actually I explored Tobafer through Cursor.

Starting point is 00:15:35 It's great like to see the documentation. I knew Cursor is using Posgis it was a few months ago but then I heard they considered moving to Planet Scale. That was before Planet Scale announced support of PostGGS. So I was thinking are they switching to my SQL. And then I saw vectors are stored in TurboPyfer. Great. Then I learned that several our clients who get consulting from us and use PostGGGGES, they also store vectors in TurboPathel. It was, I think, photo room, a couple of more companies. It was a surprise for me. And I started to think, oh, that's interesting. And then I, yeah, and then I, and then I, checked your talks. I think you also mentioned there that if we, so there is this approach with

Starting point is 00:16:25 SPFresh algorithm, right, different. It's not HNSW, different types of index, but also some additional interesting ideas about economics you mentioned, right? Not only these sense. Can you elaborate a little bit? I think it might be, it might be helpful to just talk at a very high level about the different algorithms to do vector indexing into. I'll try to simplify it as much as possible, and we can dig into it further if you want. Fundamentally, the simplest way to do vector search is that you just store all of the vectors in a flat array on disk, right? And then on every search, you just compare the query vector to all of those.

Starting point is 00:17:03 The problem with that is that you very quickly run up against bandwidth limits, right? If you have a gigabyte of vectors and you're searching that at maybe 10 gigabytes per second, if you can exhaust the memory bandwidth, which is unlikely in a big point. production system, you're only doing, you know, maybe five quarries per second and the quarry latency into hundreds of milliseconds. So if you have very, very few quarries and you don't care about latency, this can be feasible on small scale. And lots of people are doing that in production. But if you want to search a million vectors in, you know, less than a couple hundred milliseconds and maybe 10 milliseconds and as part of a bigger pipeline, you need some kind of index. The problem

Starting point is 00:17:38 with indexing vectors is that there is no known way to do it in a perfect way, right? If I search for a query vector about fruit or whatever. I know that if I'm searching for banana, I get the closest fruit. Maybe I don't know, maybe there's a plantain. I don't know, right, in the cluster. But you have to build an approximate index in order to make this faster.

Starting point is 00:17:57 Because of too many dimensions. For small number of dimensions, there are approaches and the poses have them for years, but for hundreds of thousands of dimensions, yeah. That's right. Yeah, there's KD trees and so on for the smaller dimensional space, which we can use for geo-coordinates and things like that and simpler geometry.

Starting point is 00:18:16 So for very high dimensional spaces, these things fall apart. Curse of dimensionality, it's called. So they're very large is also important about the vectors. If you have a kilobite of text, it can easily turn into tens and tens of kilobytes of vectors, which is why separating into cheaper storage

Starting point is 00:18:30 makes a lot of sense. So there's two fundamental ways that you can index the data. There's the graph-based approaches, HNSW and Diski and N, where the two you mentioned before, and there's the clustering-based approach. The graph-based approach is phenomenal

Starting point is 00:18:42 if you can store all of the data in memory and you have very high QPS and very low latency requirements. So if you have, let's say, 100 million vectors and you're searching that at 100,000 queries per second and you need very low latency, you're not gonna beat HNSW. It's gonna create a very, very good graph to navigate it with.

Starting point is 00:19:02 The problem is that it's very expensive. And the other problem is that almost no workloads in the real world actually look like this. So HNSW got very popular because it's fairly simple to implement correct. and it's very simple to maintain. So when you create the graph, which is essentially just, you know,

Starting point is 00:19:18 points that are close in vector space are close in the graph. It's very simple to incrementally maintain. You put one thing in, you search the graph, and then you add it. There's very simple rules. You can implement something like HNSW and tens of lines of code

Starting point is 00:19:30 if you did a very, very simple implementation of it. The problem at H&SW is that every time you do a write, you have to update a lot of data, right? In database land, we call this right implication, where every byte or every page you update, you have to have to have to update. a lot of others. The reason for that is that you add something, you add a graph, you add a node in the graph, and then you have to update all the other

Starting point is 00:19:48 thing to do connections to that node in the graph. So this works great in memory because memory is very fast at updating and very fast at random rights. But the problem is also on the read path. So memory is very fast at doing random reads. You can do a random read at 100 nanoseconds, but a random read on S3 or on a disk is much slower into hundreds of microseconds to hundreds of milliseconds on S3. And on a graph, right, it's, you don't really, there's no speculation that helps, right? If you start at the middle of the graph and then greedily navigate the graph from the query vector to the closest matching vectors, well, every single time you do that, there's a round trip.

Starting point is 00:20:24 And on S3, that's a round trip that takes hundreds of milliseconds. So you're sort of navigating from the route. It's like 200 milliseconds. You go out one, 200 milliseconds. You go out another one, 200 milliseconds. And in general, for HNSW on a million, this might be in the tens of round trips. So this gets very slow, right? into seconds to do this on S3, and even on a disk, this very quickly adds up to tens of milliseconds.

Starting point is 00:20:46 That's the fundamental problem with graphs. Now, disk A&N is essentially using a lot of the same ideas as other graph-based indexes, but instead of trying to have, you know, the tens of round trips that H&HW has, that's very good for memory, disc A&N basically tries to shrink the graph so there are fewer jumps, right? So instead of 200 milliseconds 30 times, it tries to get it to maybe six or seven times or 10 times by shrinking the graph as much as possible. That's essentially the insight in disk ANN. The problem with disk ANN is that after you have added more than 10 or so 10 to 20 percent of the size of the data, you have to rebuild the entire graph, which is an incredibly expensive operation. That is absolutely terrifying to me to someone that's been on call for large databases

Starting point is 00:21:32 to just have like you could max out like 128 cores rebuilding this graph in prod and it could take you down at 3 a.m. because you don't know when you hit that threshold. And if you don't do it, then the approximation start getting bad and you start getting bad search results. The nice thing about the graphs is just they have very low latency, but they're just very expensive to maintain. Now the first way, then there's the other type of index which are the clustered indexes. Clustered indexes are very simple to picture in your head, right? If you have a cluster of, let's say you took every song in Spotify and you cluster them in a coordinate system and you can visualize this and through two dimensions.

Starting point is 00:22:08 If you, then you plotted all the songs, and the songs that are adjacent, of course, also adjacent in vector space, and genres will emerge, right? There'll be a rap cluster. There'll be a rock cluster. You zoom in and you get, you know, like little subclusters, I don't know, death metal, black metal, I don't know what all the different rock genres are, right? Somewhere in these clusters. And you can generate great recommendations based on that because you can look at, okay,

Starting point is 00:22:31 what did Michael listen to and what are some songs that are close by that he hasn't listened to and same for Nick. So it's very simple. You create a clustering algorithm that basically just try to divide the data set into clusters. And when you do a query, instead of looking at all the vectors, you look at, well, clearly the user is asking about a rock song. So we're only going to look in the rock cluster. And that way you divide down the number of vectors that you have to seek. Now the problem with this is that if you have everything in memory, it's not necessarily as optimal,

Starting point is 00:22:59 because you might have to look at more data than you do in a graph-based approach. And because RAM has such good random read latency, the penalty is not necessarily worth it if everything is in memory at all times. But this is great for disks and it's great for S3 because I can go to S3 and in 200 milliseconds get gigabytes of data back, right? It doesn't matter if I'm getting like, you know, a megabyte or a gigabyte. I can often get that in the same round trip time if I exhaust the network. So if you then go into S3, basically you have to download all the clusters. So like let's say clusters of JSON to really simplify this and then you just look at the closest end clusters to

Starting point is 00:23:35 your query vector, and then you download, you know, cluster 1. JSON, cluster 2.jsons, whichever ones are closest, is in two round trips. And now instead of on the graph-based ones, where you're doing 200 milliseconds, 200 milliseconds, 200 milliseconds to navigate the graph, you just have to 200 milliseconds to get the clusters, and then 200 milliseconds to get all the clusters that were adjacent. The nice thing about these clustered indexes is that with algorithms like SPFresh and lots of modifications to them, we can incrementally maintain these clusters, because you can imagine that when you add a vector, you just have to add it to the cluster, and it's just one right.

Starting point is 00:24:05 The write invocation is very low. Once in a while, that cluster will grow beyond a size. Let's say it's 1,000 elements long and you have to split the cluster, and then you have to do some modifications. That's essentially what SPFresh is. And then there's a little bit higher right identification. But it's stable in the way that you never reach this threshold where, okay, I've added 20% of the data set. I have to rebuild the entire thing as you do in disk Gn.

Starting point is 00:24:27 HNSW don't have to do that, which is why it's very, very nice, but it's just slow still. So SPFresh, we think, right, is a really, really nice. set of trade-offs where it's going to be a little bit slower but slower in terms of okay instead of a search returning in a millisecond it might take five milliseconds and just no one cares in production for search this matters for a point look up into a relational database but for search it's a perfect set of trade-offs question on the cluster splitting does that mean we don't ever need to rebuild the hold index because i think that was a limitation of the first cluster-based i vf flat i think we're started with in PGVector and that didn't have the splitting as far as I'm aware.

Starting point is 00:25:09 And therefore we had to rebuild the whole index from time to time if the data changed significantly. That's that's right. And it was also not only split, there is also merge in SPFresh as I remember. Yes, there's merge as well. Makes sense. And because you might do deletes, which are also a pain. To my knowledge, PGVector does not implement SPFresh. It's a...

Starting point is 00:25:33 It is very difficult to implement. implement correctly. But also funny, I did a little bit research in May, I think, when I discovered TurboPuffer, I started reading about this. I saw original implementation was actually on forked PostGos, SPFresh. I saw some, you need to dig into it, like in some Microsoft repositories, also I think some Chinese engineers were involved, something like there. Like I saw some repository which was like, basically, forked PostGus and original SPF

Starting point is 00:26:05 refresh implementation was on top of that. Maybe I'm hugely mistaken, but I saw something like this. But it's hard. I agree. And I also recalled what I wanted to ask. I was lost in my previous question because it was too many things. I recalled in your talks, you discussed that S3 is ready to store data. S3 is ready because over the last few years, they added important features.

Starting point is 00:26:30 Can you recall what you talk about at the talks? Yeah. I mentioned that there were two things that you needed to build a new database. And there's a reason why a database like Toppo Buffer hasn't already been built. It's a new storage architecture. It's only really possible now. The three things that have enabled a database like Topopuffer to exist with this sort of pufferfish architecture, right, where it's, you know, when the pufferfish is deflated, it's in S3. And when it's, you know, somewhere in between, it's in SSD. And then it's, it's in memory when it's all the way inflated. The reason that's possible is because of three things. The first one is our NVME SSDs. NVMEES have a new set of trade-offs. They act completely differently than other disks, right? SSDs are just sort of like the old SSDs were very fast, but NVME SSDs have just phenomenal performance potential where basically on an NVMESD, the cost per gigabyte is 100 times lower than memory, but the performance, if you use it correctly, is only about five times worse. And you have to, by using NVMESD correctly is really that you have to put a lot of concurrency on the disk. But again, similar to S3, every single round trip takes into hundreds

Starting point is 00:27:42 of microseconds, but you can drive a lot of bandwidth. Old storage engines have not been designed for that. You have to design from that from the day that you write the first line of code. Otherwise, it takes a very long time to retrofit. It happens to be that that exact usage pattern is also what's very good for S3. NVMESDs were not available in the cloud. until 2017, 2018. So this is, in database language, this is relatively new. Second thing that needs to happen is that S3 was not consistent until December of 2020. I think this is the most counterintuitive to most, because most of us just think that it always

Starting point is 00:28:17 has been, but it hasn't. What that means is that if you put an object on S3 and then you read it immediately after, you were, as of December of 2020, guaranteed that it was going to be read after right consistency. The third thing that needed to happen, and this is very informed by the fact that I was on call for Shopify for so long, when you're on call for a long time, you gravitate towards very simple systems, and ideally systems that are constantly tested on their resiliency, so you don't get page when something abnormal happens. And for us, what was very important for us to be on call for a database for Justin and I

Starting point is 00:28:50 was that it only had one dependency. And that dependency could only be one of the most reliable systems on Earth, which is S3, Google Cloud Storage, right? and the other derivatives like Azure Blubstores and so on. They're very, very, very reliable systems. But you could not build the metadata layer on top, right? So snowflake and Databricks and others that have built on top of this in the last generation of new databases needed another metadata layer.

Starting point is 00:29:12 Some consensus layer like FoundationDB or their own Paxes or RAF protocol to essentially enforce the read-after-write consistency, but also to do various metadata operations atomically. But in late 2024, S3 finally announced that reinvent, compare and swap. And what compare and swap allows you to do is to put a file, let's say metadata.com, you download the file, you do some modifications to it, and then you upload it again with a version number, and you only upload it if the version number is the same, right? Basically guaranteeing that you did an atomic and nothing has changed it in the interim. Very important when you're building a distributed systems, right? You can really implement anything on top of that, as long as you're willing to

Starting point is 00:29:54 to take the performance hit of going back and forth to S3. And of course, they have a whole metadata packs us whatever thing to implement that. In GCS, it's Spanner, but I don't have to worry about that, right? That's for them to formally verify and whatever they need to do to uphold those constraints. Those were the three things that needed to happen, right? And that is requirement number one to build a new database. And so that's what was in the air for Turbo Puffer to grab, right? The second thing that you need for a new database is that you need a new workload that's begging for that particular storage architecture, right? So for Snowflake and Databricks, we saw that in, well, we want big-scale analytics.

Starting point is 00:30:29 And it doesn't make sense also for the dollar per gigabyte that we have to pay in operational databases and adding indexes on everything. So there was a new OLAB workload and there was, or at least an acceleration of the OLAB workload with all the successful web applications, and then the new storage architecture on top of S3, but they had to do some more work because these APIs that I just mentioned weren't available and a new workload now is connecting lots of data to to AI and this storage architecture is a very good fit for it yeah so that's great thank you for elaborating about this three changes at s3 but they also made another change recently and they announced as three vectors

Starting point is 00:31:09 right what do you think about this comparing to your system yeah i think that s3 vectors is if you if you writing the vectors once and you don't query them that much and you don't care that much about the query latency, it can be a useful, useful product in the same way that you might be using S3 today. But S3 vectors is not, doesn't do full-text search, right? It doesn't do lots of these features that you need for a serious system. And even S3 vectors recommends that you load into open search. And so for archivalal vectors, this can make sense. So they're still operational, but there's lots of limitations that would make it very difficult for it to go into production systems, right? If you do a quarry to S3 vectors, it takes hundreds and hundreds of milliseconds

Starting point is 00:31:52 to get the result back, whereas the turbopuffer, you can get the result back in less than 10 milliseconds. Thanks to cash, right? Thanks to cash. Yeah. Yeah, that's totally makes sense. I still, I have skeptical questions more of them, if you don't mind. Let's go. Yeah, one of them is your cash layer is on local in the MEs, right? Yeah. But why, like, we could store right there. Plain scale, recently came to post-guid ecosystem and they said let's stop having fears about using local ms and yes it's like ephemeral storage and so on and there is a reliable fellow and and so on like it let's do it super fast yes i agree it's super fast and uh four terabytes for billion or three terabytes for billion vectors probably won't cost too much because this price is usually in

Starting point is 00:32:44 AWS, for example, in the instances which have local in the emis, it's basically included. And the limit is many, many dozens of terabytes of local and emits storage these days on larger instances. So we are fine to store not only vectors, but everything else. So my question is like back from S3 to my SQL, for example, my SQL supports storage engines for many years. Have you considered building storage engine for, for, for, for, for, my SQL, for example, and using local Ndeme. You can't outrun the economics, right? The economics are still the same.

Starting point is 00:33:24 You have to replicate, whether it's local NVME or not, which is not necessarily cheaper, maybe only marginally than a network volume. You still have to replicate it three times, right? You still have to put it on three machines. You still need all the cores and memory and so on that you would need on the primary to keep up, unless you start to have a heterogeneous replica stack, which for a variety of reasons would be a really bad idea. So you're still paying for all of that. Now, up to a certain point, that makes a lot of sense, right?

Starting point is 00:33:53 If I have a customer who gets on a call with our sales team and they have a couple million vectors in PG vector, there is no reason to move off of it. That is perfect. You should not be taking on the complexity of ETLs and so on. But if you have like tens of terabytes of vector data, it is not a cannot. for a lot of businesses. Now, for some businesses, it are, right? But the art of a business is to earn a return on the underlying costs. And for some business, it's very, very challenging to earn a return on storing this on three replicas, this vector data. It's generally not valuable enough to the business. So turbopfer doesn't make sense to, as a storage engine into into a MySQL or into a Postgres. It's just fundamentally not compatible with the way that we do things outside of replica chains, right? You could maybe come up with a storage engine where, you page everything into s3 and all of that but you're you're you're now trying to build a new database inside of an extremely old database right the storage layer is it's too especially um more

Starting point is 00:34:52 so in postquest i think than in my sequel is very very weaved into into how the how the query planner works and all of that so at that point you're rebuilding the query planner to be very round-trip sensitive you're rebuilding a storage layer it it doesn't make sense anymore it's a completely different database okay another skeptical question i i i'll I go to TurboPuffer website, and by the way, great design, very popular these days with monospace font and like these types of graphic. And you're advertised for one billion scale, one billion vectors for one billion documents. But there, if I check one billion, the concept of namespaces pops up.

Starting point is 00:35:32 And namespaces, if I choose one billion, if I choose 100 million, it's okay. I can't have one namespace. but if it's one billion there is a warning that probably you should split it to 10 namespaces and this means 10 different indexes right that's right yeah so it's not actual one billion scale or like something is off here in my mind like one billion it's a single index should be be index, single index. If we talk about one billion but divided by 10 collections and indexes also, like it's already a different story. And so can you can explain this to me? I saw this in the beginning and I'm happy in this position I can ask the founder himself about

Starting point is 00:36:26 this. Yeah, we try to be extremely transparent in our limits, right? And I think your your mental model is correct. Before this, we just had a limit that said that we could do in a single name space around 250 million vectors, right? But I mean, even that's, even that is a simplification because how big of the vectors? If there are 128 dimensions, they, they are a lot less space, which is ultimately what matters here, which is why we also put the gigabytes. So when we in the past, we had just 250 million vectors on there as a limit, people came to us and said, or we knew that people weren't testing with us because they wanted to sort of search a billion. at once. And they didn't realize that you could just do ID modulus N, and then you could do a

Starting point is 00:37:08 billion. And it would be very economical for them. So we sort of had to, you know, put in the docks like, yes, you can do a billion at once, but you have to shard it. Now, I would love to handle that sharding for you, right? I mean, that's what Elasticsearch does. And it's what a lot of databases do, because the only way to scale any database is sharding, right? You don't get around it. The question is where the complexity lives, right? Does the complexity live inside of the database to handle it for you, like some of the most sophisticated sharding you will find lives inside of cockroach and spanner in these and these kinds of systems. And the simplest type of sharding is what we're exposing, where every single shard is just a directory on S3, and you can put as many of them

Starting point is 00:37:48 as you want, and you can query as many of them as you want. Now, over time, of course, we need to create an orchestration layer on top of that so that a logical namespace to the user is actually multiple namespaces underneath. But we're challenging ourselves to make every individual namespace as large as it possibly can. When I ran an Elasticsearch cluster or was involved in scaling Elasticsearch clusters, every shard was around 50 gigabytes. That was roughly what was recommended for Elasticsearch. Quite small, right? It's quite small, right? Like on Postgres, it's a lot larger than that. It's basically the size of the machine. But the problem with a small shard size is that for something like a B-tree, right, it's obviously like it's log-end, right,

Starting point is 00:38:29 to the number of searches you have to do. And if you have log n and n is very high, well, that's a great number of operations that you have to do. But for every shard, there's sort of like an M log n. And now M is very high. If the shard size is small and n is small. So you're doing a lot more operations. So we want the shards to be as large as possible.

Starting point is 00:38:48 Because, of course, you can get to a billion by just doing, you know, a thousand, one million charge. But like, that's incredibly competitionally and effective. So we have shards now for some. users that we're testing that do almost a billion documents at once, right? But it requires some careful tuning on our end. So we want to push that number as high as possible. The other thing with namespaces is that TurboPuffer is a multi-tenant system and we have to bin pack these namespaces. So if a namespace is five

Starting point is 00:39:16 terabytes large, it's much harder for us to bin pack bin pack on node sizes that makes sense. So we have to walk some threshold there where we try to make the charge as small as possible from the logical data. We have the bin pack and then we have to index by without slowing the indexing down because the larger an a and n index gets the harder it is to maintain the index so those are the constraints we walk so over time we update these limits continuously you will see that shark size increase but you're not going to find anyone who is doing a hundred billion vectors in a single index on a single machine aka m is one and n is a hundred billion but we want to get as high as possible because that is the most competitionally ineffective

Starting point is 00:39:52 that's how we get our users the best prices and we see the most ambitious use cases and and the shards are these actual shards or partitions so it's like shards meaning that a different compute nodes behind the scenes are used if it's 10 namespaces it's 10 shards then different compute nodes so um a namespace for us and this is also why we've chosen a different a different name for it right a namespace to us is a directory on s3 and which compute node that goes to is essentially just a compute consistent hash of the of the of the of the name of the prefix and the and the user or ID, right? So compute node one could hash to maybe, you know, 100,000 different namespaces, and they share the compute. And then we scale the compute, right, with that. And that's why the bin

Starting point is 00:40:38 packing problem comes in. It's flexible, I see. Yes. And that's also part of I we can do get really good economics. Yeah, I wanted to shout out again with the like front page is beautiful. It explains like basics, architecture and pricing right here, latencies right here and case studies this is like how it should be for for engineers you know so you quickly see all the numbers and understand that's great thank you for transparency that's great i think it's it just comes from the fact that i was on that you know your side at the table my whole career right i've been buying databases evaluating databases and sometimes i swear i end up on a database website and i don't know if they're marketing a sneaker or a database yeah go figure out

Starting point is 00:41:26 storage or compute price in 8able, yes or GCP. It's like always a task. So we really wanted to just put the foot forward of like the diagram is right there. Okay, this is my mental model. This is what it does. This is what it costs. This is how fast it is. These are the limits.

Starting point is 00:41:43 These are the customers and how they use it. You go to the documentation. We talk about guarantees. We talk about tradeoffs. The kinds of things that I always look for immediately to slot it into my model. Because I don't want you to use our database at all costs. I want you to use it if it makes sense for you. I want it to be a competitive advantage to you.

Starting point is 00:42:00 I want to save you like millions of dollars a year. And in order for that to make sense, you have to just put all the bullshit aside and make it very clear what you're good at and what you're not good at. What about, for example, if I, I know there is vector search and full tech search supported, right? But what about additional filtering when we have some dimensions when we want to filter on them?

Starting point is 00:42:20 Is it possible? Yes. For example, some categories or like, time ranges or something. It's all possible, right, in both types of search in double five. I'll go back to my line before of every successful database eventually supports every query, right? And we're, I don't know, I like maybe 5% of the way there. So we support filters. And filtering on vector search is actually very difficult, right? Because if you do something like, let's say I'm searching for banana and it's an e-commerce site and I have to

Starting point is 00:42:50 filter for ships to Canada, well, that might cut off a fruit cluster, which is actually where I want it to be. And then like, you know, I get to the banana on a t-shirt cluster, but it's really far away. So you have to do very sophisticated filtering to get good results. I don't know what PG vector does, but I know that our query planner is recall aware. It does a lot of statistics and distribution of where the vectors are to ensure that we have high recall. And for a percentage of quarries in production, we check against an exhausted search what the recall is, the accuracy of the index. I don't know if PG vector does this, but I know it's been a monumental effort for us a big headache there. It's a big headache. Actually, it's a good point that additional

Starting point is 00:43:29 filtering can change semantics of the search. If it's searching for bananas, size equals nano, it's very different. One thing they've implemented relatively recently is iterative index scans. So I think they had a problem where, for example, you'd put a limit, you'd do a similarity search with a limit and a filter, and you'd ask for 100 results, and you'd get back fewer, even though they had a problem so they go back to the index and request more so yeah it's it's like a workaround i'd say more than a solution but it's pretty effective for a lot of use cases i think the problem there is that i would be very skeptical of the recall in a case like that right because okay so you're just like you got 100 but you probably should have looked at a lot more data to know what

Starting point is 00:44:15 the true top k was so you know we that solution was not acceptable to us like we we had to go much much further to ensure high recall. And I think the problem with a lot of these solutions is that people are not evaluating the recall because it's actually very difficult to do. And it's very competitionally expensive. You're not going to have your production Postgres run an exhaustive search on millions of vectors.

Starting point is 00:44:36 Like it's too dangerous to do that where you're doing your transactional workloads. But again, if you're hitting these kinds of problems, then it may be time to consider maybe for a subset of your workloads, whether something more specialized makes sense, whether the tradeoffs makes sense to you. But I think with the pre-filtering and post-filtering, It can be, it's very challenging to create a query planner that can do that well.

Starting point is 00:44:54 So we support all the queries, right? You can do and you can do range queries. You can do exact queries. You can operate with arrays. You can do all types of queries, set intersections that people use for the permissions. We can do on we can do full text search queries. We can do group buys. We can do simple aggregates.

Starting point is 00:45:11 So we can do more and more queries. And we're constantly expanding that with, with what our users demand from the system that we built. One more. Another question. I know it's not open source. There is no free version. at all, even. I'm pretty sure it was

Starting point is 00:45:24 the decision not to go with open core or open source model at all and can you explain why? Yeah, I think there's never been any particular resistance to open source. I mean, I love open source. The reason we're not open source

Starting point is 00:45:38 is because open source is if you want to do it well, it's a lot of effort. And it's also a lot of effort to build a company. It's a lot of effort to find product market fit. And we decided to pour all of our energy into that. And it's similar argument for the minimum.

Starting point is 00:45:54 It's really the only thing that a startup has over, you know, the big incumbents is focus. And so our focus has been on the customers that are willing to pay. And one day I would love to give a free tier. I would love for people's blogs to run on TurboPuffer. But for now, I'm afraid that we have to prioritize the customers that are willing to put some money behind it. It's not because we have provisioned hardware, anything like that. it's really just that we need to know that you're willing to put some money behind it so we can give you amazing support right a lot of the times there'll be engineers working on the database who are supporting your tickets

Starting point is 00:46:28 and we can't do that with a free tier yeah makes sense yeah and i we michael and i have different points of you on this area i think that was a good answer i'm on your side simon i had a question on the full text search and semantic search we had a discussion a while back about hybrid we called I think it's generally called hybrid search. Are you seeing much of that where people are wanting to kind of, yeah, mix the results and order between them? Yeah, we do. And what we see is that the embedding models are pretty phenomenal at finding good results for subjects that they know about, right? And terms that they know about.

Starting point is 00:47:09 You know, you run a project called PG Mustard, right? And maybe the embedding model doesn't know. I think it's popular enough that it will know what it is, but let's say it didn't know. And then it puts it in the cluster close to ketchup and the results are just horrible and the text the full text search tends to be really good at this like recall for things that the embedding model can't know about if you're searching for an algae, you know these like TV skews. It's like you know it's indecipherable. Actually the embedding models are actually quite good at these because they happen enough on the web. But imagine that it wasn't these kinds of searches it's good for. The other thing is searches you type. So if I'm searching for SI, the embedding model might find that well he's searching for something agreeable in Spanish right but really what I wanted was the document that started with Simon so sometimes you just need to complement these twos and I think that that's why we've doubled down on the full-text search implementation in TurboPuffer to do this hybrid but

Starting point is 00:48:04 people can get a lot of mileage of embedding models alone wonderful yeah thank you so much yeah it was super interesting good luck with your system and yeah I think it's an interesting idea, I guess, for PostGus users who need, as you said, like to store more and have all characteristics, TurboPaffer have, maybe it's like worth considering, right, to keep all LTP data, all TPP workloads and data in regular Postgres while moving vectors to TurboPaffer, right? I mean, it's just similar to a lot, you know, people have taken workloads out of Postgres, you know, it's like updating a posting list.

Starting point is 00:48:46 It's a very expensive operational and the transactional store. And it's similar in A&N decks, updating that with the same kind of acid semantics that Postgres upholds. It's very expensive. And I've ripped full-text search out of Postgres many times because it's very, very expensive to do. So we do that because we don't want to shard Postgres because it's like a lot of work. And so we start by moving some of these workloads out first. And search is one of the early ones to go. So your point is that it probably postpones the moment when you need to shut.

Starting point is 00:49:15 that's interesting same same reason same reason from mcash and redis right you you you you you separate out these workloads as soon as possible to avoid sharding mm-hmm mm-hmm interesting idea okay again thank you so much for coming it was a super interesting discussion i enjoyed the lot thank you so much yeah really nice to meet you thanks for joining us take care

Postgres FM - turbopuffer

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.