Postgres FM - Multi-tenant options

Starting point is 00:00:00 Hello and welcome to Postgres FM, a weekly show about all things Postgres QRL. I am Michael, founder of PG Mustard, and I'm joined as usual by my co-host, Nikolai, founder of Postgres AI. Hey, Nikolai. Hi, Michael. Hi. And today we are joined by special guest, Gwen Shapira, co-founder and chief product officer at Nile to talk all things multi-tenancy. Hello, Gwen. Thank you for joining us. Thank you for having me. It's very exciting for me to be on your show.

Starting point is 00:00:24 Oh, we're excited to have you. Thank you for having me. It's very exciting for me to be in your show. Oh, we're excited to have you. Thank you. So to start, perhaps you could give us a little bit of background and what got you interested in this, the topic of multi-tenancy in general. As many things, it started with an incident, but this time not actually one of mine. So when my co-founder and I started Nile, we actually started with a very different idea after about nine months, we were like, this idea is not working very well. We developed some things. We're not finding the market. We hope to find we're sitting in a hacker space over in mountain view.

Starting point is 00:01:02 And we're like, what did we learn? We talked to 200 companies all doing SAS. What have we found out as a result? And the things that call to us is that very early on, basically when the first lines of code are getting written, you have to choose a multi-tenancy model. And then about two, three or four years later, depending on how fast you're growing,

Starting point is 00:01:33 you have to change it. And we heard a lot of stories on what caused people to change it and whether they regret earlier choices or they're like we didn't know better and how things went for them and then we started looking in different blogs and we found so many by very famous companies with either incidents were something that was done to a single tenant caused a whole chain of events that took down their entire system, sometimes for days.

Starting point is 00:02:12 And also a lot of slightly better stories how we sharded our highly multi-tenant database. And we found story after story after story on how people had to re-architect their entire database, which is, as you guys know, extremely painful to do after you're a successful company three years in. And we're like, this is a good problem. So many people have it. It is so common. My past was in databases, not as much Postgres, more Oracle and MySQL, but I've seen this problem again and again in all kinds of companies. My co-founders seen this problem again and again in all kinds of companies. This is such a good problem. Everyone has it and nobody's working on it. Why is nobody working on it? And that's how we got into it.

Starting point is 00:03:07 Yeah, nice. Should we go back then in terms of how do you like to describe the different models or the different options that people have at the, at the, in the early stages? Yeah. So everyone basically starts with one out of two And I'm using AWS terminology, even though there is other terminologies that people apply to it. AWS calls it the pulled model versus isolated model. And in a pulled model, you basically create your tables as normal, and then add a tenant ID column to each and every one of your tables, pretty much. Some, maybe not everyone, some have like shared data, but most of your tables are going to end up with a tenant ID column that tells you which tenant this row belongs to. Very easy when you start out and all you have to do is sprinkle some work losses

Starting point is 00:04:08 every now and then and you're pretty much good to go. How hard can it possibly be? The places where it gets you is that you have no solution. Everyone is in one big pool. And if you have a problem tenant, if someone grows really, really large and suddenly you need to query, start getting slow for them. If you need to do an upgrade and one customer absolutely refuses to accept changes or needs their own time window. There is a lot of different ways you may discover that by putting all your customers in one big pool, you save a lot of effort, you save a lot of money.

Starting point is 00:04:49 This is by far the cheapest option you're going to have. It's shared resources in one database, but you are not allowing yourself to do anything specific for any one customer should they need it. The other approach is basically the reverse. It gives every customer its own database or sometimes it's its own schema. This still counts as isolated, even though it's not all that isolated. You share quite a lot in that scenario, but the schema is separate and it's quite a bit easier to move them out if needed, if the schema is separated.

Starting point is 00:05:28 And in this scenario, first of all, there is a nice benefit that you get help from your a lot of popular frameworks. I think Ruby has, and I think it's called apartment plugin for having this kind of multi-tenancy model. Django has something, so a lot of very popular frameworks have something that helps you in that model. But if you accidentally grow to a large number of databases,

Starting point is 00:05:54 it starts being very painful. Obviously, a database with large number of objects is no longer as easy to work with. Suddenly you start learning how much space in memory the catalog can really take when you have connections. Suddenly you start learning that PGDump can take a very long time. If you actually have a database for each tenant,

Starting point is 00:06:17 then doing any kind of maintenance on 100 databases is already not fun. If you end up going into a thousand databases, it's really not fun. And if you think about it, a lot of SaaS have customers in the hundreds of thousands, not just thousand. So it becomes very painful exactly when you grow.

Starting point is 00:06:43 Yeah, for example, upgrade includes dump or restore schema and we had cases with 250,000 tables to million indexes well indexes are there not dumped but tables exactly this is worse than not dumped right if only we could dump indexes yeah and then you need to update statistics after upgrade for all of those tables. It's nightmare honestly. Yeah, on the other hand, at least you get to do it to a customer at a time. Imagine that everyone is in one really big database and now you have to upgrade all of

Starting point is 00:07:15 them together. Yeah. Yeah, it's painful, all this. Yeah, and usually, and then there is the mixed model where I think pretty much everyone ends up with where you basically have, you start with the pool model and as you grow, you shard it, but you're actually pretty smart and you realize that not all customers are over the same size and you can have some dedicated charts for your biggest or most sensitive or most demanding customers. And this is, I think, if you look five years into the

Starting point is 00:07:55 life of the company, I would say that this is the dominant model. Some variation of we have model. Some variation of we have shared databases with pool and then some dedicated databases with specific customers. What is the problem we're trying to solve? Is it security or performance or both? And if we go back to some customers, they share this pool, it affects security goal, right? So we don't achieve it. Absolutely. So this is one, first of all, this is one drivers

Starting point is 00:08:31 that people actually start with the isolated model. They know that they're going into a sensitive area. They're focusing on SaaS for healthcare, SaaS for finance. Those companies definitely start with isolated model and try to figure out how to manage large number of databases. A lot of times those companies don't become huge. There is that many hospitals in the United States,

Starting point is 00:09:02 but they still have to build all the tooling to manage large number of isolated databases. For other companies, it's more complicated. I would say maybe 70% of the time, the reason for eventually moving customers out and sharding and isolating would be performance. It's amazing how many performance problems can be solved by just having less data in each database. On the other side, there are the story where two years in, suddenly a very sensitive customer shows up or you want to sell into like you thought you're building a normal CRM or some kind of a rug database. But then a health care company shows up, a bank shows up or even worse, a government shows up and they show up with a list of demands.

Starting point is 00:10:06 And since they usually have good amounts of money to back those demands, there is a lot of incentive to figure out a solution for them. Nice. We have one kind of ugly duckling in the post-gress world that I'm not sure quite fits either of these models. I wonder if it's worth discussing row level security briefly, because if I was to bucket it based on those definitions, it's the kind of the pulled model in a way because all of the data is together, but there is some isolation between tenants.

Starting point is 00:10:40 Absolutely. Yeah. Yeah. I have a love hate relationship with RLS. I think a lot of people do because you're right. On one hand, it's absolutely a lifesaver in the pooled model. Developers make mistakes as joins get and conditions gets more complicated. It's very easy to misplace a work laws and actually leak data that you don't want to.

Starting point is 00:11:07 So RLS will prevent you from doing it if you do it right. It turns out that a lot of times the rules get complicated and then it leads to bugs. It also turns out that a lot of times the rules get complicated and it leads to terrible performance. And one thing that developers really don't realize, I'd say almost that almost no developer realizes it until they run into it. The wear conditions that RLS introduces are not optimized like the wear conditions that you introduce. Because Postgres, thankfully, is very good about security,

Starting point is 00:11:48 it treats the work conditions in RLS differently. It's, I think they have, they call it security conditions or something like that. And they are very, very conservative on how they optimize it and how they plan for it. This has benefits. You get very strong security guarantees, very few bugs as a result.

Starting point is 00:12:09 This is fantastic. On the other hand, the plan would be sometimes significantly worse than what you would come up with if you were to look at it really hard and do it yourself. And with RLS, there is basically no way to force the plan you want. Like you cannot set, enable, disable different rules because again, like the main overriding rule is that we're very conservative on how we optimize those RLS conditions.

Starting point is 00:12:44 So some people call RLS a performance killer. I wouldn't necessarily go this far, but you can definitely run into gotchas and you need to be aware that it's not a normal wear condition that you're looking at. Yeah, nice. So what does Nile offer today and what is the ideal solution to all this in Postgres context of course? Yeah, so basically we wanted to do maybe three things. First of all, give isolation while not degrading the developer experience. So for example, we partition data by tenant out of the box for you completely transparently because we know that a bit later on you're going to want it and

Starting point is 00:13:39 it's going to be a pain in the ass to edit. We shard it transparently. Basically, your database may be spread across multiple different shards. We will route the queries for you and make sure that they are working and issued. So in a way, you get the model you will have anyway in four to five years, but you're getting it from the get go. And without doing a lot of the work because we are doing a lot of the management for you. The other thing is that we have done some work to basically bypass RLS and give you, still give you isolation. So the queries, you kind of do the same RLS set tenant ID equal. We use that to actually direct queries.

Starting point is 00:14:33 We rewrite the queries immediately to the partitions that we know has the data for that tenant. So we have a small extension that kind of replaces table names with partition names in the query itself. And this vastly improves performance in the majority of cases. I mean, we've seen it in a bunch of cases, especially if you have slightly weird indexes that RLS may conservatively not use. The improvement is quite stark. It can, depending

Starting point is 00:15:09 obviously on table sizes, you can get a benchmark that proves anything. So I don't want to throw numbers out there. But obviously if you break down a table with a million rows into 1000 tenants with 1000 throws each, then you can show quite a you can see where I'm going with that. The other thing we did, and that was probably the most work and this is still work in progress, is allow moving tenants around, because one of the biggest problems is that the tenant gets large or noisy and you want to give it its own machine. Moving it is usually a long downtime. If you catch it after the tenant is already large by doing the compute storage separation, we can basically make it transparent. It's a latency spike while we're holding off some queries, while we're moving things like setting up sequence ideas, moving, pointing into a different compute into the same part of the storage.

Starting point is 00:16:15 But it's essentially a no downtime operation. So we think it's a huge deal. Because again, it's just a problem that we keep seeing again and again. Yeah, I'm curious, is it all open source what you build or only parts of it? Right now it's mostly hidden. We have started registering parts of it under a Apache license veto. So yeah, the goal is to open source it and we already publicly declared that it's going to be open source. We have not a date, but a point of completion where we plan to open it. Yeah. So I've heard about this this several interesting things here. So one

Starting point is 00:17:05 is extension for this like I guess it's called and your documentation is called virtual like RLS virtualization or how's it called? We call it tenant virtualization. The extension itself I think we called it Karnak. We call everything after stuff in Egypt and Karnak is a famous temple. So data is stored in separate tables and in the same database but extension rewrites queries to basically route query to proper table right? Is this based on? It's in separate partitions and we rewrite queries to go to the correct partition. We basically bypass RLS, we bypass the planner trying to make those calls. We found out that with a large number of partitions, this is significantly more efficient. I see. So it's a postbis position.

Starting point is 00:18:05 I see. I see. And you mentioned also another thing you mentioned here is sharding, right? Yeah. So we use foreign data wrappers to allow. So we'd have two things in the architecture. First of all, we have a proxy, surrounding proxy. So it keeps track of every connection, which tenant is the current tenant, and it routes it to the shard that has the correct tenant in it.

Starting point is 00:18:36 And then we also have some cases where some developers want to write queries that touch multiple tenants. Those are not going to be as fast, where some developers want to write queries that touch multiple tenants. Those are not going to be as fast, but we do allow them by use of foreign data wrappers and mixing partitions with our own partitioning rules with foreign data wrappers and still keeping things efficient. We didn't want the planner on any machine to be aware of all

Starting point is 00:19:07 the partitions in all the other shards, because it just explodes the planning time in ways that we saw as unacceptable. So what we did is represent each shard with a table and then put hierarchical table inheritance on top of it. And the end result is basically a union all between the table with the table inheritance that points to all those other shards and the table with the partitions. Now this gives us basically predicate pushdown, because the planner will push the plan to all those different shards. Only the other shards know that they have partitions, which the source planner didn't know.

Starting point is 00:19:58 They will plan correctly with all the partitions, but they only know about the subset of the partitions. So we see it's a bit hacky and it's a bit weird to explain. And we think we can do better with some modifications to Postgres, which we have not done yet. But this does give us predicate push down, fairly fast planning and the ability to do queries that cross tenants in situations where this is

Starting point is 00:20:27 required. Yeah, if you don't involve 2PC, just rely on foreign data wrappers, no two-phase commit. I'm curious what kind of anomalies can happen there. Yes, and we prevent a lot of things that could cause anomalies. So we do have a transaction coordinator, but in order to not overload and also not overcomplicate our architecture, we limit some things. So DDL has to be done on a single tenant, and you cannot mix cross-tenant queries,

Starting point is 00:21:02 sorry, DML inserts and updates have to be done on a single tenant and you cannot mix in the transaction cross. So the moment you start a transaction, you have to know what tenant you're working on. And then we route it to the correct chart, which has the correct table and everything has the absolutely correct guarantees. If you need to do something cross tenanttenant, you do not involve it with any kind of update. You could still be exposed to some anomalies, I agree, because there could be ongoing transactions from other people in other places. So you get the basic

Starting point is 00:21:40 read committed guarantees that Postgres gives you, I believe, but not anything more than that. But yeah, again, we believe that cross-tenant queries are rare and mostly done in analytical cases where you do reporting where it's slightly less critical to have those. So you forbid writing to two shards in one transaction. If you want to write to a shard, it's fantastic. You tell us what tenant you are writing data into and we will direct you to the correct shard. I mean, if there is a transaction which needs to write to two different shards, this is a big problem because without the PC. Yes, exactly. We don't let you do that essentially in order to avoid anomalies.

Starting point is 00:22:29 I see. Another question here, have you considered the approach used in VITES? As I understand it, maybe I'm wrong. Where most of the analytical queries maybe to avoid distributed transactions data is brought asynchronously from one shard to another and we have it locally like basically kind of materialized you on top of logical replication for example or something and it has eventual consistency approach of course but you can just join it in one postgres in one

Starting point is 00:22:58 shard right have you considered this approach? We have considered it. I think maybe CitusDB has something similar. If I remember correctly, I'm not 100% sure. But yeah, it's something that we are like, yeah, this is a good idea that we may examine in the future. It's definitely we are trying to build something useful gradually. And we understand that early on, it's almost safer to have a bunch of limitations that over time we resolve rather than allow people to do something unsafe. And also, yeah, you know, build a kitchen sink. Sounds to me like Postgres versus MySQL approaches, because MySQL approach, you remember my some maybe you maybe don't remember but it's it was like

Starting point is 00:23:48 quite Bad you need to run a repair table all the time because it's yeah They say it is so long like yeah, it allowed too much Exactly. Yeah, and my son had a lot of issues I mean that there is a reason why EnoDB became extremely popular. Right, right. But also with a multi-tenancy use case, I think you're quite right that, well, we,

Starting point is 00:24:13 I mean, you'll find out soon enough, right, if lots of people want these cross-tenant or cross-shard queries, which are by definition cross-tenant queries. And if they don't, if you don't need to worry about it, you save a bunch of effort having to even implement that. So yeah, I like that a lot. Yeah, last comment here is I'm excited to see that finally Postgres ecosystem receives some tension in the area of sharding, I guess it's just time has come and more and more databases became too large to be handled.

Starting point is 00:24:45 I'm almost surprised. I'm honestly surprised it took that long. I mean, again, if you look at MySQL... We just... it's just unfortunate. First time I touched this topic, it was 2006, immediately when we started working with Postgres, honestly. And there was a PL proxy from Skype at that time already, but it required you to write everything in functions. Partitioning didn't exist back then. Existed. It was based on inheritance. It required much more manual... It was fun, actually. You understood it better, you know. But yeah, but it was not convenient, not super convenient.

Starting point is 00:25:27 It's just I see that just unfortunate how it turned out in Postgres ecosystem. And now definitely there is a huge pressure. Many companies need partitioning. How would history work? Imagine that YouTube picked Postgres and not MySQL as their first database. Yeah, Google or Facebook, they both chose MySQL somehow, yeah. The test would be for Postgres first, right? Exactly.

Starting point is 00:25:55 It could have turned out so differently. Yeah. It's funny, isn't it? Changing topics slightly, or LLMs are quite top of mind at the moment. Are you seeing people's initial choices change as a result of asking for advice earlier from our robot friends? Oh my God. We're seeing so many weird things. Like it's just unbelievable how much things are changing. First of all, we're having people show up on our Discord and say things like, I'm using Nile because my LLM thought it's a good idea.

Starting point is 00:26:36 And I don't really know Postgres, so I need some help, but my LLM assured me that this is still a good idea. A lot of people who are very much beginners, like maybe I can say when I started developing the first time I had to use a database, my company sent me to a three-weeks database class. I think it was Oracle about 20 years back. And I came back a lot more confident that I know how to use Oracle and not to leave transactions open too long because people will yell at me. These days, people don't do the three week class before they try using a database. So you see a lot more people start using a database earlier on and they do expect more hand holding from the vendors like the LLMs

Starting point is 00:27:28 give them advice up to a certain point but eventually if things are slow they will come to you and say hey why is my query slow I'm sure you guys have seen your share of that that. I'm also seeing people use Postgres for their LLMs in different ways. And this is really exciting to me. People use Postgres via MCPs, people use Postgres with vectors, people building AI applications on top of Postgres. We're seeing a lot of that. And I mean, personally, I'm really excited that people can program with LLMs, not knowing a lot of about Postgres, not knowing even a lot about software engineering at all, and still get reasonable security guarantees. You don't need to know to ask your LLM about RLS or about, are you sure this query

Starting point is 00:28:33 actually properly isolate tenants? And it's also interesting how much the results differ when people use different LLMs. Like I would say, syncing models do fairly well iterating in order to get good code and checking their own results. Again, given via MCP access to a database. I would say that if you use a charge GPT-4.0,

Starting point is 00:29:02 you will get a lot of random hallucinatory stuff still in your code. Yeah, it's funny. I already told Michael we had the cases doing consulting, but like maybe already almost a year ago I started noticing that people send us like some, we are building this part of database, can you review it? We are reviewing, we use different LLMs supporting this review. And then we have a call and I'm curious, code looks great, I mean schema looks great, but something is off. And then we have a call and I see they open tabs and charge a PT, cloud there as well.

Starting point is 00:29:47 So I realize they used LLM to create schema and then send us for review. And we use LLMs to review it. And then there's like a four-party process. It leads to a good place, but someone needs to jump with proper expertise and say this is not a good approach, but someone needs to jump with proper expertise and say, this is not a good approach.

Starting point is 00:30:07 This is hilarious. Do you think at some point you and your customer can just step out and let the LLMs figure it out between themselves? Well, there's a problem here because it's great, but it does 80% of the job in 1%, not in even 20% of effort very quickly. But there is 20% of problems which again, like I don't know, maybe in next few years, it will change. I think it will change.

Starting point is 00:30:32 But right now I feel my internal LLM trained much better than than Chagy PT. You trained your own LLM, right? Yeah. No, I mean my own. Oh yeah. No, but I think you actually trained your own. Yeah, we experiment. We do some stuff, we experiment, we have some things like we have, we start with fine tuning, moving to our own LLM, but not yet there still. So yeah, there is a lot of stuff can be done there to properly. It's hard to compete

Starting point is 00:31:07 with cloud and they have very high pace. With cloud 4 release, I see, wow, it's really great. But it's still missing many things you learn from practice, which were not discussed. That's why they don't bring it. Many problems were not discussed yet. And you explore them if you have a lot of data and heavy workloads. So you're saying that even training on the Postgres mailing list doesn't have all the information in it essentially? Well, yes. For example, a random problem, we recently touched it. There is a buffer pool in Postgres and there are 128 basically partitions. So you can have 128 locks and if you have a huge buffer pool, this becomes bottleneck. And some people say, let's maybe make it tunable, configurable.

Starting point is 00:32:05 But when you start researching this topic, you end up finding some recent, well, recent last August and September last year, conversation in Hacker Smelling List, which is open-ended. It's not complete, because somebody needs to run benchmarks and prove that this is worth having a new setting. And that's it.

Starting point is 00:32:28 There's a patch proposed, but that's it. There are no experiments yet. This is, by the way, one of the reasons I was so excited about your LLM approach, because if you think about what is the bottleneck for doing a lot of the database improvement things, and I am feeling it very personally. Running benchmarks is hard. Properly planning a benchmark is hard. LLMs can help with that.

Starting point is 00:32:52 But again, they sometimes just go off the rails. And even if they help plan that, I don't know LLMs that actually run benchmarks to the point where they provision the machines in AWS and know that you have to provision a separate machine as a database and a separate machine, maybe a few of them, to drive the workload and they both need to have appropriate resources. All those and then don't get me started on analyzing the results, which is kind of 99.9% of the work.

Starting point is 00:33:28 And so the fact that you actually kind of started your LLM for, I have an LLM that can actually do benchmarks is just, I think this will be the biggest breakthrough in both people tuning their own progress and also post-gress as as a community being able to advance the state of the art. Let me share said news. Like I'm not giving up but I have it's a roller coaster. I spend more than we spend like the team of maybe five engineers spent more than one year trying to achieve that. we achieved many things.

Starting point is 00:34:05 But first of all, we chose Gemini because they gave us credits. And I think it was a huge mistake. Gemini has a lot of problems. They have a lot of like suddenly you have 500 error, which like so many problems. It's just not mature product, Gemini, and it has hallucinations all the time. It's good, for example, for working with JSON, because when you need to run an experiment, we decided to choose JSON as a config format. It writes much better than GPT-4 for O and so on. But many things, it just hallucinates. It invents all the time

Starting point is 00:34:44 some things. It just makes up results all the time. And we have a system to control it, but it bypasses all the time. It's really hard. So then, yeah, we experimented with additional DeepSeq, Lama, and we fine-tuned a lot. All versions of GPT, all modern fresh versions. We also bring them all the time.

Starting point is 00:35:08 And Claude is much better. We just added it to this system we have. But after one year, I decided, you know what? Benchmarks is an extremely hard topic and we cannot trust it anymore. I mean we cannot trust LLM to create precise configuration and process results fully. So we decided that LLM is just you know it's just more like connection thing. Like when you engineer benchmark expert needs to engineer. I don't trust any LLM for now because any experiment, like we have, we plan to publish maybe 15 to 20 experiments in our blog last year. And if you open our blog, you see just one experiment. And even there we screwed up and someone on Twitter

Starting point is 00:35:58 said this is not right and we quickly corrected, which is good and we had achievements like interesting things bottlenecks show popped up here and there it's really fun to iterate with LLM but once you allow to think to design experiment and to treat results in 99% of time we have wrong results wrong conclusions and so on so for now we are thinking, okay this is just accelerator of performing experiments but design and understanding results should be in human brain for now. I love the fact that benchmarks is also hard for robots to be honest. It's so hard for humans, right? Yeah and we collect so many, but somehow it's super hard still. So you always think where is the bottleneck?

Starting point is 00:36:51 And simple question, right? But for now, it's extremely hard to let LLM find bottleneck and draw proper conclusions. And honestly, if you just had an LM that always said it's a network, it would be correct about 80% of the time. What if it's local? There is no network. Everything through Unix sockets, and we have these cases as well.

Starting point is 00:37:17 So I agree with you in production. Like in production, yes. But in experiments when we learn Postgres behavior on single machine running PgBench locally, we don't care sometimes. So there's no network there sometimes. So it's hard. So I had many moments of frustration, but it's so good. I still believe that iterations are great. So if you say this is great benchmark, just check it on new version, just changing one thing. This is good. We have automation, we have interface,

Starting point is 00:37:52 and it repeats the process of analysis again. This is where LLM helps a lot. Because without LLM, you could just have some form, and oh, we don't have this parameter program, it's not exposed in the interface, it's bad. With LLM you have freedom to change things and iterate based on existing good benchmarks. So yeah, I cannot say we are there yet. This project right now is we are thinking about next level of it where I think we will let LLM we will give it less freedom you know that's that's the key and control more by human human brain. Human in the loop kind of thing? Yeah yeah exactly exactly so design and first analysis.

Starting point is 00:38:45 Only human should be there. But once you have confidence that you're moving in the right direction, and you just need to iterate and expand, for example, to different versions, platforms, everything, this is where you can relax. You already verified results. You can say just repeat, but on different something. This is where LLM already can bring you.

Starting point is 00:39:04 It can just speed up everything. You can throw it to this benchmarking process and have like 10 experiments running in 10 different versions or something. And I think this is also kind of almost the general direction that agents are taking shape. I mean people started 2025 was supposed to be the year of the agent. started 2025 was supposed to be the year of the agent. I think it's almost becoming the year of the human in the loop with the agent. Like all the successful products I see are, you tell the agent to do some stuff,

Starting point is 00:39:36 you ask it, please plan something, you give it feedback. You then say, okay, now that we have a good plan, go and execute on it. You come back an hour later, you had your coffee. Okay, let's see what you've got. Here's some feedback. Go fix some stuff. Like it's, I think it's always.

Starting point is 00:39:54 Every successful product is a bit like this. Right, but sometimes humans start using different LLM when reviewing things, right? Being lazy, right? Yes. That's interesting. So I'm not sure how successful it will be, but my gut tells me that we need to move, move, move in this direction anyway and have some, I don't know, like more experiments and so on. I hope we will have more soon to publish and start iterating. But yeah, it was early cost last year, so now we are rebuilding stuff. We will see how it works. And for example, one of experiments we must do, I think, is to conduct various benchmarks for RLS because it's obviously...

Starting point is 00:40:45 That would be a fantastic example. Yes. I mean, this is something that humans with experience are pretty good at finding cases where you're like, is RLS going to actually be an issue and have the stories that then the LLM can go implement and test. Yeah, we had several cases and also SuperBase has public materials, blog post about this. Obviously, there is already quite known case when you have like current setting function inside RLS expression.

Starting point is 00:41:21 And this is, yeah, if you select count one million rows, it's terrible. And it's quite easy to fix actually. But yeah, so these kinds of experiments to collect them and see how, actually my goal with this experiment, I think will be to prove that they are not a problem if you do it right. Interesting. I would contribute, I think there was a recent post on the bug tracker where basically the optimizer,

Starting point is 00:41:52 the planner refused to use, I think it was a GIST index or a GIN index due to the beliefs that the RLS optimization is incorrect, is unsafe. I can look it up and send it to you. But yeah, that can also be interesting. Like if, if you have a fix for Postgres, then you can, it will obviously be nice to showcase. Right. And I think also if you come, so, so you don't use RLS, you, like you said, use RLS, you said bypass it, right? We bypass it, yeah.

Starting point is 00:42:27 Yeah, that's interesting. I'm curious if you, in this mixed schema when we have partitions or shards and RLS, does make sense at all to involve RLS additionally locally if some of shards have a mixed pool of customers. One of the questions that I have in mind and this is something that we are trying to help figure out for our users, often on top of the tenant you still have permissions for specific users in the tenant. Like you have an admin that can do anything and then you may have someone who is not allowed to see some rows at all because they're too sensitive,

Starting point is 00:43:06 all this kind of stuff. And our users ask us, they can do it with RLS inside those partitions, or they can do it in their application. There is a lot of application level tools or like middleware kind of tools that will do it for them. Is it better to do it in the app layer or in Postgres with RLS is a good question that I don't have an immediate answer for.

Starting point is 00:43:35 Right. So there are several layers of multi-tenancy basically. Yeah, not multi-ten much, but different layers. And if your customer being a tenant for you, they might have tenants for them, but also inside them, they might have additional clusters or segments of users inside each tenant. So it's kind of apartments and rooms inside apartments. Nice. I like that. So it's kind of apartments and rooms inside apartments. Nice. I like that. This is one of the things that make multi-tenancy confusing, right? Because it's almost like Matryoshka kind of scenario.

Starting point is 00:44:17 Great. Yes. This term, by the way, used in Postgres ecosystem multiple times, starting with, you mentioned GIST, original paper by Hellerstein mentions that Matryoshka R&D RD3, Russian doll tree, so basically Matryoshka tree. That's true. And also in embeddings, they have the Matryoshka type embeddings where you can make them of any size.

Starting point is 00:44:42 Yeah, I read about this as well. Yeah, it's funny. So yeah, great. And what's, what's your, what are your plans for, yeah, I saw MCP servers and like already there is some integration. We basically have three directions. One is MCP server and making it public, giving it authentication. Right now it's open source. You can run it on your own, but we are not hosting it. So we should start hosting it at some point. So things that just make people who use LLM make it easier for them to use it. Other thing we've done that we think is very useful for LLMs, this we already have, just make it zero

Starting point is 00:45:22 time to create new databases because LLM is just zero time and zero cost. Cause they love creating a lot of them every time something goes wrong. Okay. Let's try from scratch with a new database kind of situation. And so we're making it fast and cheap. The other things that we're still working on is really to make our documentation more LLM friendly. that we're still working on is really to make our documentation more LLM friendly. LLM.txt is absolutely not enough. It is actually, it creates a very large file. The LLMs tend to get

Starting point is 00:45:54 lost in it. We need to figure out how to make it better. And then also, you know, everyone is kind of thinking, can we have our own agents? Can we do something around that? We're kind of thinking about that. Something we already have that is useful is just that with the multi-tenant model, the embedding, the vector indexes are much smaller. And this is a huge deal for people building those agents and LLMs. Yeah. And actually you mentioned, so this zero startup goal, you mentioned separation of compute and data. How is that achieved? Using what approach?

Starting point is 00:46:38 Oh, yeah. Sorry. This is a more than a two minute answer, but the short one is that you kind of, you patch Postgres and then you find a better way to do your storage and basically wrap every Postgres storage function with an equivalent with your storage. And then you also need to apply the wall continuously to the storage layer. Great job doing this. Like in actually 20 seconds. I understood very well. Yeah. So, and, uh, do you have plans to make this open source as well?

Starting point is 00:47:16 Uh, yes, absolutely. I mean, this is, uh, first of all, as you know, wall readers have to be registered are already is and with the open source license. And then, yeah, we are planning to open source storage layer, our extension. I think at this point, it's about seven different patches we've made on the Postgres. I don't think we'll want to open source it as a Postgres fork because the number of patches is quite small and we are maintaining it. I think everything from Postgres 12 to 18 at this point

Starting point is 00:47:50 or maybe 13 to 18, something along those lines. All supported versions, I guess. Yeah. That's great. Yeah, well, looking forward to checking that out and actually one more question for me, maybe last one, are you open to some benchmarks? We probably will do We plan to do some benchmarks with various platforms

Starting point is 00:48:10 It all started after acquisition of neon and there's a discussion on LinkedIn So I thought about running some benchmarks for different platforms How do you think like what what what are your thoughts about it? Oh My god, this is very scary. We're benchmarking ourselves all the time. So I'm keenly aware of exactly like what benchmarks make us look good and what benchmark make us look bad. And I think that's also how we react to benchmarks and social media. Unless it's a benchmark that shows something in postgres itself, and like clearly attempting to educate and help people, you can design a benchmark to

Starting point is 00:48:50 make anyone in the world look bad. Benchmarketing is called. To make everyone look good. Exactly. So I will, I'm at any given time I can publish my benchmarks that makes Nile look fantastic and you can run benchmarks and that you decided how to build them and may be very realistic and may even expose the problems that we have that I didn't know about. But yeah, in general, I love benchmarks.

Starting point is 00:49:19 I just have opinions on how benchmarks are used by marketing people. just have opinions on how benchmarks are used by marketing people. Very good answer. For anybody interested in a bit more about Niles architecture, Gwen gave a really good talk at PGConf.dev recently and the video just went up on YouTube so I will put that in the show notes for anybody that wants a deeper dive there. There was also a good talk I saw at a PG Day Paris event by Pierre Ducroquet, I'm not sure if I'm pronouncing that anywhere near correctly, all about multi-tenant database design, especially focusing on something we didn't focus much on today, which was the downsides of schema

Starting point is 00:49:59 per tenant design, including things like observability and monitoring, which I thought was really fascinating. So anybody considering going down that route, definitely check out that video and I'll put that in the show notes as well. So yeah, thank you so much, Gwen. I think we're out of time. It's been a real pleasure. It's been a pleasure. Thank you for having me on. Thank you.

Your Ad Here

Postgres FM - Multi-tenant options

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.