The Infra Pod - Building a serverless full-text + vector search on S3! Chat with Simon from Turbopuffer

Starting point is 00:00:00 Welcome back to InfraPod. This is Tim from Essence and let's go. This is Ian, lover and builder of things that are secure and developer friendly. I'm super excited Tim. We've got Simon, CEO of TurboPupper, the fastest vector store known to man apparently on the podcast today. Simon, why don't you introduce yourself and tell us a little bit about how you got into into building TurboPupper and why?

Starting point is 00:00:29 Yeah, absolutely. I it goes I mean, it goes way back. Ian, you and I were just reminiscing about being two Canadians. I came to Canada back in 2013 to work at what was at the time a little e commerce startup called Shopify. And I spent almost a decade there working on infrastructure, even before there was much of an infrastructure team at all. So basically just played bottleneck whack-a-mole for almost a decade, scaling that platform as the Kardashians rolled through and sold merchandise, challenging

Starting point is 00:00:59 the platform on trial accounts and what have you. So I spent a long time doing that. When I joined, it was probably a couple hundred, maybe a thousand requests per second. And when I left, we had events in excess of a million requests per second and worked on more or less every single part of the stack. Of course, the fundamental bottleneck

Starting point is 00:01:18 of most SaaS platforms like that is the database. So I spent most of my time working somewhere between the Rails level and MySQL, building abstractions, sharding, moving shops around, balancing shards, like multi data center, all of this kind of stuff. Some point we rewrote the entire storefront for performance, for geographical distribution of all the shops and all this kind of stuff. So just used NGINX, Redis and MySQL as my weapon for almost 10 years. I left in 2021 and spent a couple of years kind of bouncing around, helping my

Starting point is 00:01:50 friends at their startups with various infrastructure scalability problems or anything that they found interesting. Turns out the biggest infrastructure challenge in 2021, 2022 is tuning Postgres Auto Vacuum. And I was led to believe from the Orange site that Postgres was this holy grail after spending a decade with MySQL. But it turns out it's extremely difficult to tune. And so, yeah, spend a bunch of time doing that

Starting point is 00:02:15 and working with a bunch of other companies on various scalability issues. And that's where I discovered, I guess, rediscovered that search is still difficult. And there is sort of three separate events happening around 2020 to 2022 that all led it to seem like a good time to maybe sit down and rethink how you might do a search engine.

Starting point is 00:02:36 So one company I was working with, actually also a Canadian founder, is this company called Readwise. And essentially what they allow you to do is when you're reading on your Kindle, then it will import the highlights, you can do daily reviews, and now they have a reader product as well

Starting point is 00:02:50 that allows you to save articles, PDFs, read them, interact with AI with them, and all these kinds of things. And of course, naturally for something like that, you might want to have recommendations for the type of content that you wanna see. And so this company was spending maybe five grand a month in their Postgres storing hundreds of millions of articles and all of this data for this application.

Starting point is 00:03:11 And so we started doing some experiments when I was doing a little stint there on doing recommendations. And so we did some vector search, some embedding, some chunking, and I was like, okay, this actually, this is looking pretty cool and it's pretty interesting. Let's run the back of the envelope on doing this on some of the incumbent solutions

Starting point is 00:03:28 at the time. And it would have cost 25 to 30 grand a month. At a big company, this is nothing. But for a company that's spending five grand a month on Postgres, it's just, I feel sort of whack that you're spending half an order of magnitude more on the search, right, rather than than the canonical storage. So we kind of ditched the idea and put it in the bucket of, wow, we'll return to this later when it gets cheaper.

Starting point is 00:03:52 Token costs are coming down, LLM costs are coming down. Surely someone's gonna bring the vector search cost down. When traveling with my wife for a bit, and I just couldn't stop thinking about this. So I just started, I was just reading papers on how does vector indexing work? How does all of this work? Because what seems so exciting to me about vector indexing

Starting point is 00:04:09 was that at Shopify and elsewhere that I've worked on search, it's always been this problem of how do we take these strings and turn them into things? I'm searching for red shoe and they have a burgundy sneaker, right? That used to be kind of a PhD level search problem to do ergonomically across lots of different queries.

Starting point is 00:04:27 And now any off the shelf embedding model would just do incredibly well at that. So we were kind of moving from this world of you're gonna write a thousand lines of like very well massaged JSON to express to Elasticsearch or OpenSearch or Solr some of these Lucene based solutions, how we're gonna turn all these strings into actual things. to express to Elasticsearch or OpenSearch or Solr,

Starting point is 00:05:01 But with vectors, it was just so simple. It was like, oh my God, these things that used to be incredibly difficult and we have to train the sources to map burgundy and red, and maybe you would even skip this art. Certainly it was just very easy. So that was the first thing that seemed worth paying attention to was that search maybe seemed due for a little bit of a revamp because it was now easier

Starting point is 00:05:21 to do semantic search. The second thing that happened, right, is that lots of companies around this time wanted to start connecting their LLMs to their data. And that's always been interesting because semantic search in itself has never really had a super large scale outcome other than elastic search.

Starting point is 00:05:36 There's never been a ton of players, but suddenly lots of people wanted to search a lot more data. And more importantly, now machines could do it, right? A good database needs to be something that's being queried by machines as well, otherwise you're limited by the number of humans that do searches.

Starting point is 00:05:51 And even Google is only doing tens of thousands of searches per second. Very few companies are doing much more than that, including Shopify. And then the third thing that had happened is that the way that we can build databases today is different. There's like a series of events leading up to the architecture

Starting point is 00:06:07 that's working really well for TurboPuffer and for a bunch of other types of databases. But NVMe SSDs have gotten incredibly fast. They came out around 2018 in AWS. Of course, they've been around for longer, but just absolutely phenomenal price performance and storage density. The second thing that's happened was that in 2020, S3 finally became consistent, which was launched at reInvent back then.

Starting point is 00:06:31 It's hard to even think about that now, but that actually had to happen and happened in 2020. The other thing that needed to happen, which happened at the end of 2024 and now to that reInvent, was that S3 finally became, had compare and swap. And with those three dependencies, it means that you can build online databases with only object storage as the only dependency, just a bunch of binaries running and using object storage. And so those three things, one, like being easier to do search, LLMs being connected to a lot of data and then fundamentally being able to build databases with only object storage as a dependency, seemed like maybe this is a

Starting point is 00:07:09 good time to write a search engine of the future. I mean, that's pretty compelling. And there's quite the story. One of the things you sort of started at is you started at readwise, you're thinking about like the order of magnitude around, like how much more expensive it was, the post-tress vacuum compaction. What is it about things like Elastic that make and post-tress, post-tress has things like PG Vector, they has like PG full search.

Starting point is 00:07:34 Like post-tress has this ecosystem and the whole idea or what the PG ecosystem has been selling people for a very long time is there's a wire protocol and then there's like a storage thing and then it's all extensible and you'll have like one database to rule the world. And so it sounds like you have a very specific take on why that is that's not the case and what the problems with Postgres specifically are. I have my own view on Postgres when it comes from like vertical scaling when you get to like large data sizes and when things fall apart specifically with the way that the write ahead log works and like if you have high velocity edited rows like it turns into like this massive like vacuum impaction problem. I'm curious, from your perspective,

Starting point is 00:08:06 what was it about Postgres that you were looking at that made it cost so much and why it was like, hey, this is actually the wrong tool for this semantic search problem set in the first place? Because I think everyone sits back, and if you were to say, the common nature from the average developer who isn't deep in the space like you are,

Starting point is 00:08:22 is probably like, oh, I can just use Postgres for vectors or semantic search. What are the things you basically end up to it, and when does it start to break? Yeah, to be clear, the 25k to 30 grand a month was not PG Vector. Back in 2022, when we were looking at that, PG Vector just would not be able to handle that 100 million plus scale that we were looking at. So that number is quoted from the cost calculator from what is now some of my colleagues in the space, right? So let's talk a little bit about the fundamentals and kind of the first principle observation that leads you

Starting point is 00:08:53 to build something like TurboPuffer. I think my general belief, if I was on the SaaS side, not building a database, but building a company, is that I would also, pardon my French, but abuse the shit out of MySQL slash Postgres, the canonical OTP store for as long as possible. In fact, we're still doing that at TurboPuffer. We just have a big like MySQL that we abuse for billing and analytics until TurboPuffer can do that kind of thing.

Starting point is 00:09:18 And so I'm very much in favor of that. But there becomes a tipping point where the complexity of over and overhead of adopting another database and ETLing into it ends up being the right choice. One of the first things that people rip out of Postgres as your company grows is full-text search. Um, the reason for that, and Ian, you're nodding and probably you've seen this as well is that fundamentally updating and inverted index, right? The index that you build

Starting point is 00:09:45 for full-text search, is a pretty expensive operation to do under the tight transactional guarantees that Postgres provides. So you can get pretty far with that, but you can't get into, you know, storing many terabytes of full-text search data in Postgres without the complexity of dealing with all of that and the performance starting to tipping the scales towards another solution. And that's why people adopt open search or elastic search or these other Lucene based solutions traditionally. That and the concurrency of implementing a queue

Starting point is 00:10:17 on Postgres are typically some of the two first things that you extract out to avoid having to shard the Postgres instance. So I think that's the first one. And there are just various workloads like that. Of course, the third thing that you extract out to avoid having to shard the Postgres instance. So I think that's the first one. And there are just various workloads like that. Of course, the third thing that you would adopt out other than Qs and search is various types of caching, right? So that's why you have a memcache or Redis or something like that.

Starting point is 00:10:36 But you kind of start with the big monolithic thing and then you rip it out as the economics and the complexity justifies it. So the other thing is cost. Fundamentally, with something like Postgres, you just need to store every bit on three disks, right? A replication chain of two is a little gnarly. I think very few companies have good enough backup mechanisms that they're comfortable running with that.

Starting point is 00:11:00 So you're going to run on three replicas. One disk, whether it's an EBS or a PD, depending on your cloud provider, costs somewhere between 10 and 12 cents per gigabyte. You're not going to run these provisioned disks at 100% capacity. You're going to run them at closer to 50%. So in reality, you're probably paying

Starting point is 00:11:17 somewhere between 20 and 24 cents per gigabyte of data per disk. But you don't have one disk. You have three disks, right? So it's three times 20 cents or 24 cents. And so you just get close to 66 cents per gigabyte of data stored. And of course, when you put in one byte, more than one byte is actually stored on the disk. You have to build all these all these indexes and various other derivative data structures. So there's some

Starting point is 00:11:43 space amplification involved as well. So every logical gigabyte, it might turn into maybe 1.5 or something like that, depending on how many indexes that you're building and how compressible the content is. So fundamentally, this just means that if you're using any SaaS vendor, they're gonna charge you close to a $1 per gigabyte

Starting point is 00:12:01 to make a little bit of money and to have a margin for the operational overhead and all of that to that. And that's on disk, right? But generally, if you're churning really hard on this in Postgres, which you need for something like vectors and a lot of search, you're storing a lot of that in memory. And memory is closer to one to $2 per gigabyte.

Starting point is 00:12:17 So that's the legacy cost structure. And that's what the traditional search engines also do, replicating this around. So I approach it from the angle of, well, what's the fundamentally cheapest way that you can store data in a responsible way in the cloud, given the primitives we have today? Well, it's just to put all that data on obiX storage.

Starting point is 00:12:38 ObiX storage is two cents per gigabyte, lower than that at lower storage classes. Of course, that's quite slow. Like a cold query on something like TurboPuffer to a vector index with a million vectors is maybe around half a second when you've really optimized it. But at the end of the day every round trip is going to have a p90 of around 200 milliseconds. You have to design around that. But then you can pull the data that's actually worn into an NVMe SSD. And most people don't even use NVMe SSDs for their Postgres because if you restart the instance, often that's gone and it's operationally

Starting point is 00:13:12 just a little too scary unless you have a very mature operational environment. Those NVMe SSDs are somewhere around 10 cents per gigabyte. If you run them on a spot instance or commit to some usage, it can get all the way down to 3 to 4 cents per gigabyte. And you run them on a spot instance or commit to some usage, it can get all the way down to $0.03 to $0.04 per gigabyte. And you can utilize these disks at 100% utilization with a multi-tenancy cache, right? So you're not paying anything extra. And then, of course, you can leverage your own buffer

Starting point is 00:13:36 pools in memory to also have 100% utilization on the caching here. And now you have this nice pufferfish architecture, right? There you can inflate the pufferfish as much as you want to get the performance you need. But if you're not querying the data, which most companies have some Pareto distribution of what data is actually being cached, it just stays in obiX storage. And even if you do query it once a month, it's half a second. And that's often acceptable.

Starting point is 00:13:59 So for searching a lot of data, this is a phenomenal architecture because search can often tolerate a tail latency of half a second, and then once it's in cache, after loading into the cache at a gigabyte or whatever per second, there's no reason why we can't be as fast as any other solution. Fascinating. So we talked to the workstream team on our pod before, and probably one of the most memorable things for me

Starting point is 00:14:23 was really about the trade-offs of actually building a system on top of S3 Because I'm sure not not the very first one But it's probably the most recent memorable one that has been talked about a lot right Kafka with S3 There was a bunch of things to consider. There's a bunch of things that isn't really people thought of why is building on S3 even possible, but nor why is a heart and I'm reading her sort of like limitations and trade-offs docs part of your sections. And it's just super fascinating to me that

Starting point is 00:14:54 I think it's actually intuitively understandable now, given the S3 nature, you have to choose certain things. But I want to hear from you, right? What is the trade-offs you guys are willing to take or choosing to take? Because I feel like this is not just an S3 trade-off here. There are other trade-offs you're willing to kind of choose, like configurable performances, open source or free tier,

Starting point is 00:15:15 like things you don't want to do at all. So give us sort of like the belief here. Why TurboBuffer wants to be known for doing this sort of performance characteristics and this sort of limitations? And how does that fit finding the right customers for you as well? Yeah, I think every database I wish had a trade-off section like we do in our docs. I've tried to make our website feel more akin to a specification sheet than a marketing website. Because I think it resonates with the type of engineer.

Starting point is 00:15:47 Every time I go to a database website, there are just things I want to know, like, what are you giving up? What are you gaining? Everyone's making fundamental trade-offs. And as you sit in that five-dimensional set of trade-offs, every database sits a little bit differently, right? Our biggest trade-off is high write latency. Every time you do a write and we return back

Starting point is 00:16:09 a successful response from the API, that's committed to object storage. Guarantees don't really get much stronger than that, right? That's the F sync of 2025 is committed to object storage. But the latency of that can be a couple hundred milliseconds depending on how much data you're writing and what mood that S3 partition is in. And that's the biggest trade-off with TurboPuffer.

Starting point is 00:16:33 The other trade-off is that we do have tail latency, right? Because not all the data is sitting on disks and replicated disks at all times. It would be very unusual to have a tail latency of 500 millisecond on an actual persistent disk or EPS volume. But with Alpic storage, of course, you will have that. So if you can tolerate that once in a while or you have some heuristic you can use in your application, like pre-firing a query, which is, for example, what Notion will do, pre-fire a query to warm up the cache

Starting point is 00:17:02 and then the user is not waiting for as long because the cache is hydrated. Those are the biggest trade-offs. A third trade-off, also on the latency front, is that by default, TurboPuffer is consistent. It's a very unusual choice for a search engine. I don't know of any other search engine that is consistent out of the box. And by consistent, what we mean is that if you do a write,

Starting point is 00:17:23 you insert Tim and Ian into the database, and you do a query immediately after that write has returned, it will be visible. Most search engines don't do that. They will refresh the inverted index on some periodic interval, like a second or 30 seconds or whatever. But we always return it immediately. And in order to do that for a database

Starting point is 00:17:44 that only has object storage as the dependency, with no other dependency, no consensus plane, no control plane, such as what WarpStream has, is that we have to go to object storage and make sure that what we have in the local cache is the most up-to-date part from the index. And that round trip on GCS, the P50 is about 16 milliseconds.

Starting point is 00:18:05 I know that P50 off heart because that is our P50. And if you look at our traces, it's maybe a millisecond of vector search time and 16 milliseconds of waiting to make sure that the cache was consistent so we return a consistent response. On S3, this latency is more like 7 to 8 milliseconds, a little bit better.

Starting point is 00:18:24 I think that this will come down over time. It's essentially that Spanner floor latency that sits in front of Google Cloud Storage. You can turn that off, and we can give you still very, very strong guarantees. Probably 99.999% of the time, you're going to get a consistent read. But I feel very strongly that you

Starting point is 00:18:41 should design a database for consistency first of our nature, because it's very hard to walk that back. Those are the trade offs that I see them. The biggest ones. I'm really curious, like to dig into the consistency thing for a second, because, you know, broadly in our market, like we go to the top level of the database space, you kind of have like operational data turns, you know, asset compliant transactional stores, and you've got sort of like

Starting point is 00:19:02 OLAP analytical stores and these've been in these different worlds. And you have the like magical HTAP thing that people have talked about, like the folks from Cedar DB. I'm like very curious. It makes sense to me that oftentimes people have thought of like full tech search. So things you put in Elasticsearch is like an analytics workload,

Starting point is 00:19:18 but vector stores, like in the use case of an LL, I'm building a chat bot, I'm doing something. And especially if you're building like an app where you kind of have your, your asking compliant, you know, soar, then you have your specific thing like a turbo puffer is like, well, actually when I talk to the agent and I update the agent some state that agent, however it's getting that state, they ultimately want to like return to the user a consistent experience from my perspective. I would love to hear your perspective.

Starting point is 00:19:43 A lot of use cases on full tech search have been like long-term analytics workloads where it's like, it's okay if it's out of date by a minute, 10 minutes, an hour, a day, 30 days, depending upon where you were applying it. And so do you think this is like a net new thing that's unique to the application of building like basically AI apps where you're doing a lot like RAG needs to be up to date, needs to have up to date, needs to have very update data so that you're actually returning a consistent experience to the user. What's driving your view on why consistency is really important?

Starting point is 00:20:13 Because the way you describe it sounds more like your view is we're actually building a semantic search database for a world that requires consistency for a world where actually search is core to the operational experience of the user. Is that true? Because that's what I'm reading into. I'm kind of curious to get your perspective. Yeah, there's two reasons why we do the consistency by default. The first is that it's just too easy to imagine use cases

Starting point is 00:20:34 where it's very useful to have consistency. An example in the LLM case, right, is that someone is uploading a document or something like that, maybe a huge document or indexing an entire Google Drive or whatever into TurboPuffer. How is the user supposed to know that that is searchable other than I've inserted into TurboPuffer? Now you have to have this whole separate API to do that. And that's a very common use case. When you open up Cursor,

Starting point is 00:21:01 it will index the entire codebase into TurboPuffer. And it's very useful for them to know that when all the responses have come back from TurboPuffer, it is indexed and it is retrievable. So if you're uploading a PDF and you want to chat with it, well, you know, when you've uploaded the PDF, you expect it to be searchable, right? In the data set of what you're ragging over. So that's one.

Starting point is 00:21:20 I think also it expands use cases. So to give you an example from the Shopify days, of course, there's two big search use cases in Shopify. There's the shop app, and then there's searching on a storefront. Searching on a storefront is probably the most common one. And that's okay to be a little bit delayed. A user or buyer doesn't really know if the merchant has added a product a minute ago or 10 seconds ago, but it decreases the

Starting point is 00:21:45 utility of the solution if it's not consistent. Because imagine, for example, that you have a collection and you define that collection based on a search query, and you're using those collections as you're managing your inventory and doing merchandise. Well, if you can't really rely on, I've inserted the document and it's visible in this collection as a core primitive of the database, it limits what you're building and the document and it's visible in this collection as a core primitive of the database. It limits what you're building and the experience of it. So it really just boils down to from a product perspective,

Starting point is 00:22:11 it's just too easy to imagine use cases for consistency is really, really useful. The second comes down to recall. What we see ourselves is a broader search engine doing both full text and vector search. For vector search in particular, a core metric is recall. Recall is essentially a measure of how accurate your index is. The only way to know which vector is closest in vector space to another vector for sure

Starting point is 00:22:37 is to exhaustively search through every single vector in the corpus. If you have a million vectors of maybe 768 dimensionality, that's around 3 gigabytes, right? You could search 3 gigabytes in probably around 100-200 milliseconds depending on how fast your machine is, but that's maxing out a machine. So you'll be able to do just a couple of queries per second there. So what you do instead, instead of exhaustively searching through all of it, is that you build an approximate index. And that index essentially has a score between zero and one for how accurate it is.

Starting point is 00:23:09 That accuracy is defined as, okay, we know that these are the 10 closest vectors in vector space to this query vector by exhaustively searching through it in those 200 milliseconds. And here is what our faster index got back in one millisecond. What is the percentage overlap? And that percentage overlap is what we refer to as recall.

Starting point is 00:23:27 Recall is a very important metric for a vector database. If your recall is 10%, it means that you're doing a very poor job and products built on top of you are not going to show relevant results. We generally see among our users that they like something above 90%. That's a good balance between performance and accuracy.

Starting point is 00:23:47 And so to return back to the consistency point, recall is a pretty difficult thing to tune for, especially in the types of approximate near neighbor index that we've chosen. And we were all very scared, essentially, that we would, if we saw recall and production that was below 90%, just sort of like put it under the rug as, oh yeah, it's inconsistent. So probably just what happened was that it was as of that LSN in the wall on one

Starting point is 00:24:15 node and the other node was evaluating it, it was probably out of date. It's fine. But when it's consistent, there's a simple model and you know that that is the accurate number. So you can't explain it away. The third reason to be consistent is operational. If you are consistent, it means that every time someone does a write, before you merge it into the indexes, both the full-text search and the vector index, you have to exhaustively search and apply these writes onto whatever is in the index. This is how every database works.

Starting point is 00:24:43 If the indexing falls behind, you start getting errors, right? Someone probably gets woken up because the system is not working. If you're eventually consistent, and we've seen that with some of the other newer entrants to this space, you might have search results that are hours and hours out of date

Starting point is 00:24:59 because no one's getting woken up and it becomes a bit of an operational crutch, right? Things get fixed if people get paged. That's a lot of pain to induce to ourselves, but we think we have a real responsibility to our customers. So to recap, one, it's easy to imagine use cases where consistency is useful.

Starting point is 00:25:16 Two, you can't explain away recall because it's very simple to reason about that the results are consistent. And three, operationally, it induces pain directly on us and people get woken up if the results are not consistent, right? And so coming from our previous episode with Chris, we're all debating around BIOC, you know, and I think seeing that you have to bring your own cloud options, I was looking at other vector databases.

Starting point is 00:25:45 It seems like most people are offering sort of BIOC options. I still think not everybody truly believes in we have to offer BIOC, or at least have to offer BIOC early, right? And you're relatively so early in the database journey, right, but you're choosing to kind of bring BIOC pretty early right now. Can you talk about like,

Starting point is 00:26:04 are you seeing that from a customer's poll as a major demand? And also is there any other trade-offs you're willing to take here? Because I think folks, you know, sometimes like I want to delay it as much as possible because I want to have the fastest way to do manage, multi-tenant so I can have cost sharing and stuff like that.

Starting point is 00:26:21 What's your take on BIOC? Like, is it something you are forced to or you actually do believe this is probably the right choices for your customers to? Yeah, so I'd say that BIOC and having to do it so earlier comes out of talking to our customers. And I think that early on, vector database is that success because the vectors were sufficiently obfuscated

Starting point is 00:26:42 that even large companies were comfortable putting this data into relatively small and newer companies because it's very hard to go back to the origin data. But as you expand into things like full-text search and having real customer data and the actual text of it there, it starts to become a more dicey customer conversation for our customers to have with their customers about what are you doing with our data exactly?

Starting point is 00:27:08 In plain text, that starts to get more tricky before you've built up the type of trust of a GCP, AWS or Snowflake or others. I think that for analytical databases, it can be a little bit easier to sort of separate the data out, like what are we ETLing out into Snowflake or something else. With text fundamentally, you're storing the data out, like what are we ETLing out into Snowflake or something else? With text fundamentally, you're storing the plain text, right?

Starting point is 00:27:28 You're storing like the company secrets right there, depending on the use case, right? So that seemed like a fundamental from first principle reasoning that resonated with me. But I think BIOC is also a little bit of a chicken and egg problem, right? I come from a background of operating very large scale multi-tenancy for Shopify,

Starting point is 00:27:47 which of course doesn't have a BYOC offering. And that's what we're really good at. And operating a big SaaS is something we live and breathe and BYOC operations is quite different. So we prefer to have everyone on the SaaS solution, sure. But there's a chicken and egg here on the trust, right? You know, Databricks also build up their trust with BYOC and then now I think are really pushing people towards the serverless platform and they can do that, right? No one's getting scorned in

Starting point is 00:28:13 the public markets because you're trusting your customer data with Databricks anymore. They've just been around long enough. So I think that's another component to it. A third, of course, is that you can use your negotiated discounts and things like that alongside it. That's a pro that we see as well. The fourth reason why we're doing it early is because we can. And the reason we can is that I was on the last resort pager of Shopify for six years or so. And that changes how you write software forever. Because you've been paged by every conceivable system, every conceivable interaction and you just write software differently. One of the things that we do differently is that

Starting point is 00:28:51 obiX storage is our only dependency. We're just a bunch of Rust binaries on stateless nodes talking to obiX storage. With that model, it's been very easy for us to operationalize turbo puffer, right? If a node is about to run out of memory, we can just put it onto a bigger node very, very quickly, because there's no state on that box and it seamlessly moves. We can auto scale very quickly. We can just double the number of nodes and there's no rebalancing of petitions or tablets as in a traditional database. It just starts hydrating from Mopic storage. And so,

Starting point is 00:29:24 operationally, this has just been one of the best systems to run that I've ever run in my career. And that gives us the confidence to also let other people run it. We do our BYOC in a way that is what I would have bought when I was on the other side of the fence, where we're also on call for our customers clusters.

Starting point is 00:29:45 And we help them operate it as much as we possibly can. I think that BYOC is still very much in its infancy. I think most people, if you ask them, what is BYOC? If you ask 10 people that, you're going to get 10 different answers. There are various solutions out there that will try to package up the billing and the control plane components of it.

Starting point is 00:30:04 We decided to do that completely on our own for a variety of reasons. And I think every flavor is gonna look slightly differently. But fundamentally, what I do believe is that because TurboPuffer has these stateless workflows and Obliq storage is reliable and ubiquitous, we're in a very, very good position for it. I think complexity is the antidote of very good position for it.

Starting point is 00:30:25 I think complexity is the antidote of a good BYOC solution. And that's why Ritchie and WarpStream did so well with that model is because they didn't even use disks as far as I'm concerned, right? So they're even less stateful than we are. I think it's more difficult once you start having to operate consensus and things like that in your customer's accounts. That's super interesting. I mean, I just pose another question I really want to ask you going back to Postgres as an example, right? Because like if you typically think as a developer, it's like, I'm going to like have my giant

Starting point is 00:30:56 Postgres database, I'm going to have my PG factor, and I'm going to have all these add-ons and like, I'll build all these triggers and all these things. Like it's all consistent. And like when I insert like a document, I make an insert, it like everything's all up to date and I'll build all these triggers and all these things. When I make an insert, everything's all up to date and I consistently solve. Obviously you're used to it outside. You're building on top of Object Store.

Starting point is 00:31:12 You've built this incredibly fast, consistent system for the state that is in the system. And you provide all these incredible guarantees. What's the architecture that you see customers using to deploy you to get data into TurboPuffer? And what's the relationship between that and all the different types of data sources? I'm sure some data sources are cold storage,

Starting point is 00:31:29 they don't change very often. Then you have other data sources that are high velocity in terms of the high rate of change. And how does that impact how TurboPupper kind of sits in the broader customer's data ecosystem? I know you like Cursor as an example as a customer, and you don't need to tell us anything about Cursor specifically,

Starting point is 00:31:44 but some of these cost companies have very different use cases. Like I noticed you like Cursor as an example, as a customer, and you don't need to tell us anything about Cursor specifically, but I mean, some of these cost companies have very different use cases. I'm kind of curious how TurboPuffer sits into these different use cases and how the parameters of those use cases kind of change. Yeah, I think TurboPuffer right now is still extremely opinionated in the simplicity that we provide as a product.

Starting point is 00:32:03 The way we talk about this internally is that I'm always reminding the team that we're not in the business of ergonomics yet. We're still too much in the infancy of our product to start doing that. And it's also far too early for us to start bundling integrations and others. It moves and dilutes the focus of me and Justin,

Starting point is 00:32:21 my co-founder and CTO, and the rest of the engineering team. Our customers are extremely smart, extremely capable, and one of the things that most companies have an idiosyncratic version of is some kind of ETL pipeline. That's not a business I'm particularly interested in getting into. I just want to make sure that we plug in really nicely with that and they can pump in hundreds of thousands of vectors per second, which is what's possible with TurboPuffer, right? We store more than 100 billion vectors and we do almost half a million vector writes of thousands of vectors per second,

Starting point is 00:33:03 How you do that, we don't really get into the business of. You have to create your own embeddings. You have to do all of that. We're completely focused at this point in time on just creating a phenomenal first stage retrieval search engine. We are not in the business of running rerankers, running on GPUs, creating embeddings, like helping you ingest the data and all of that. We think that the right people right now that are adopting it just are okay with just a sharp tool, not a framework that's trying to do everything. I think that AI has had a lot of that and like a land grab of like,

Starting point is 00:33:34 we got to own the workflow. No, we just want to do a really good job of building a dumb, cheap, very simple solution that can take hundreds of billions of documents and full-text search and vector search them. And I think TurboPuffer is one of the best solutions for that type of scale for exactly that. But we don't have a simple language thing you just drop into Spark and it just pumps things in. But generally, if you talk to the teams and our customers, that's not where their pain is, right? Their pain is that they want to search across way more data than they can right now, but they can't 10 X their bill on open search.

Starting point is 00:34:11 They want to search through like just enormous amounts of data and connect into LLM and connect them with their customers and just nothing else works for the economics of that scale. There are very few companies in the world that can earn a return on storing full text search and vector indexes in memory. Very, very few. But there's lots of companies that can ship useful product, searching billions and hundreds of billions of documents from their customers at a reasonable price. Awesome. Well, here's for our favorite section of this podcast, we call it the spicy future.

Starting point is 00:34:53 I think you already kind of talked some level of your belief or S3 system and stuff, but we'll leave it open for you. What is a spicy hot take or just any take that you believe in that you think that war may not have fully believed in yet? I think there's a lot of excitement right now about building databases on top of object storage, but I don't think that there's a ton more databases to be built on top of object storage

Starting point is 00:35:16 as simply as you can do with WarpStream and TurboPuffer. I think that search and streaming are useful. The time series in OLAB, we've done it for a very long time. And I think the nascent fifth category that we see a lot of are these hyper specific databases within companies that they write for a very singular workload. And I have friends inside of companies where they just, OK, I just need to store this exactly thing. I know exactly how to compact it. It's going to be way simpler, way cheaper for us to just do this ourselves. And I think that's a phenomenal things that that's unlocked from it.

Starting point is 00:35:49 I think there is a lot of hype around building databases on obiX storage, but I don't think there's a ton more categories here would love to be proven wrong. But that's my general take on this category. That's super interesting. I feel like I don't think the world even knows what is possible, not possible, right or wrong. It just seems like, hey, everything should be S3 backed, right? Everything should be in a sort of serverless, like all these turnouts are just being thrown out there. And is there a particular type of example, like a database you saw, we don't

Starting point is 00:36:19 have any names, but like a type of nature database or being built with S3 right now, you feel like this is not going to go well. I think I'm not calling any names and it's not something I've spent an enormous amount of time thinking about. But again, if we go back to the scar tissue of being on call for databases for as long as I have, I think that there are certain databases where it's too hard for me to reason about what's going on at 3 a.m. that I feel good about operating it. That may be a bit of Stockholm syndrome to that database because maybe these big distributed

Starting point is 00:36:55 databases that are more complicated actually makes up for it in that complexity. But I think there's kind of like three types of relational databases. There are the big distributed ones and some of them are so mature now that they may actually be a really, really good fit. But for me, as someone that has all the scar tissue of operations, they can be a bit scary because they're hard to reason about. But I think they may have crossed the threshold now

Starting point is 00:37:18 when maybe they're so good that it doesn't matter anymore. But it's not something I'm super educated on. The second one for relational databases that we see now, more relational databases that are backed by obiX storage. The performance profile of that is kind of scary to me, again, in the 3 AM scenario. That's why I like InnoDB, MySQL, just so much.

Starting point is 00:37:36 It's very simple. I can reason about it. I know exactly what's going on. Yes, I have to do all this sharding and stuff, but it's a predictable quantity. So I think that that's an area of relational storage is an area where S3 is being applied to where it's not particularly appealing to me as an operator. Awesome. Well, we have so much more we could ask. But as you know, we always want to go

Starting point is 00:37:59 over time. Where do people can find you or turbo buffer? Like what was the best place for you for folks that actually get interested to try it out? Where do people can find you or TurboBuffer? What is the best place for folks to actually get interested to try it out? Yeah, turbopuffer.com. If you have a good use case, you just hit the website and just apply and describe your use case. Usually if it's a good fit, we're better at Rust than React. But it's not in beta. It's running some of the biggest production workloads in the world. We just want to be very close to our customers.

Starting point is 00:38:29 The second is you can find me on X or Blue Sky or whatever. At Syrupsen is my handle, S-I-R-U-P-S-E-N, on those areas. And then my website is syrupsen.com. You'll find lots of napkin math blog posts and things like that on there. Amazing. It's been so interesting talking to you, Simon. Super informative. And it's always great to meet another Canadian infrastructure founder. Thank you so much for joining us. Thank you so much for having me.

The Infra Pod - Building a serverless full-text + vector search on S3! Chat with Simon from Turbopuffer

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.