Postgres FM - Sharding

Episode Date: August 11, 2023

Nikolay and Michael discuss sharding Postgres — what it means, why and when it's needed, and the available options right now.  Here are some links to some things they mentioned:PGSQL Frid...ay monthly blogging event https://www.pgsqlphriday.com/Did “sharding” come from Ultima Online? https://news.ycombinator.com/item?id=23438399 Our episode on partitioning: https://postgres.fm/episodes/partitioningVitess https://vitess.io/Citus https://www.citusdata.com/ Lessons learned from sharding Postgres (Notion 2021) https://www.notion.so/blog/sharding-postgres-at-notion The Great Re-shard (Notion 2023) https://www.notion.so/blog/the-great-re-shard The growing pains of database architecture (Figma 2023) https://www.figma.com/blog/how-figma-scaled-to-multiple-databases/Timescale multi-node https://docs.timescale.com/self-hosted/latest/multinode-timescaledb/about-multinode/ PgCat https://github.com/postgresml/pgcat SPQR https://github.com/pg-sharding/spqr PL/Proxy https://plproxy.github.io/ Sharding GitLab by top-level namespace https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/doc/root-namespace-sharding.html Loose foreign keys (GitLab) https://docs.gitlab.com/ee/development/database/loose_foreign_keys.html ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is PostgresFM episode number 58, and my name is Nikolai, and together with me Michael. Hi, Michael. Hello, Nikolai. And we are going to talk about sharding today. Sharding, sharding. Two big experts of sharding are here, are going to discuss sharding. Yeah, definitely not an expert here, actually this came up last week as a topic somebody suggested partitioning and sharding as a topic for a monthly blogging event that i'll link it up in the in the show notes but yeah we've done an episode on partitioning i thought it was a really good one i really enjoyed that and learned stuff and i think we got some good feedback on it this though is sharding and i think maybe that is a good place to start, like what the
Starting point is 00:00:48 difference is. I tried defining it, you gave me some feedback. How do you define them? Well partitioning is when we split tables but we remain on a single node. I mean we might have read-only standby nodes, but if we count only primary nodes, which have read-write access, it's only a single node. But once we do similar split
Starting point is 00:01:15 involving multiple primary nodes, we should call it sharding instead of partitioning. And the name, I guess, came from some game in the past oh really yeah it's from it's some some game online game and they called shards some i should not lie i was not prepared to explain etymology here but my part of my brain says it's from some game, from gaming, and they call shards some parts of the world, of gaming world. It's basically if we move this concept to databases, we have database sharding.
Starting point is 00:01:57 I really like your definition. It's the one I tried to go with in my blog post. So partitioning is at the table level, splitting what's logically one table into multiple tables. And sharding is the same, but at the database level. So it's splitting what can logically be thought of as one database, but is actually behind the scenes, multiple databases. So that makes sense to me, except, and I think it's worth warning listeners, lots of blog posts you read out there, definitions, even on Wikipedia and other places that are normally quite accurate, they do often use the word partitioning in places that I think
Starting point is 00:02:38 would more accurately be described as sharding. So it is confusing out there. The CAP theorem, it's official terminology network partitioning right so and network partitioning yeah i mean in negative connotation it's uh like in our case we should call it network sharding but it's like doesn't make sense also there's some confusion around vertical versus horizontal. And, well, I don't understand why confusion but I see people try to say vertical sharding, which actually just vertical split of, like, maybe functional split of databases to multiple databases. For example, you see weak
Starting point is 00:03:20 connection between two groups of tables. We could say clustering here as well and actually we can apply some machine learning approaches like basic k-means to try to automatically detect good options for vertical split of databases into sets of tables with weak connections weak means like almost no foreign keys between them and also dynamic relation when we have two tables involved in the query, it's also a connection. Though it's not static as foreign key, it's dynamic. So I used it. I used machine learning to try to help people understand how to split databases vertically.
Starting point is 00:04:03 Vertically means, okay, users are here, products are there, right? And connection between exists, but we don't do it often. Well, maybe it's not a good example. But I think it's, it's clear when we have different data in different parts, different primary nodes. But horizontal means we basically take out our table and split it using like horizontal line, right? I think the partitioning case is actually simpler to explain. So horizontal partitioning is at the row level. So we're taking some rows from the table and putting them in a different node, sorry, in a different table and vertical would be column based.
Starting point is 00:04:47 So we're splitting the table based on the columns into two different tables. Now that's, it's the same, but for sharding, right? Like we don't have the same column. It is, it's kind of still columns and rows,
Starting point is 00:04:56 right? We're just taking columns as a whole table and putting them somewhere else. And in the horizontal case, we're taking row-level data. So we have the same tables across different nodes, but different rows in each one. Vertical column level, okay. It means like if you studied database theory,
Starting point is 00:05:18 it should be called like projection. When you have only limited set of columns and the other columns go to different table, you probably have one-to-one connection between them, relationship, sorry. I agree. And if you do it at server level, it's just functional split, which is called functional split. And this is a normal approach to scale your system. If you prefer micro services or just midsize or big size services approach when you split some functionalities in one database and other functionalities in another database, they have some connection, of course, maybe
Starting point is 00:06:01 they have some common part, and you need to take care of replication. But majority of data is very different in nature, versus okay, row level means like, we have same sets of tables, but different data because some part of data goes to one node, some different part of data goes to another node. So how do we do that? Let's talk. Well, do you want to do how or I was thinking maybe we could start with why? Like what? Why? Why is simple? Why is simple? Well, I think there are a few reasons to be honest. I think there's one big difference. Performance. Well, scale, right scale right like when you say performance it's um they are related we're talking about that i like it i think it's the main reason
Starting point is 00:06:53 you would ever want to shard which is hitting a max like let's say you're super keen to stay on amazon rds postgres and there's a max size that you can provision. And on some basis, you're getting close to maxing that out. Maybe your CPU's going up, maybe your ingest rate. There's some level in which you're scared, well, we can't continue to grow this vertically. We can't scale up anymore. Is that what you're talking about? You just used the word vertically in different
Starting point is 00:07:25 sense? I mean, vertical scaling when we just add the resources, more CPU, more memory, better disks. So yeah, that's what I meant. What I meant is, if we get to a point where we can't do that anymore, we need to think about sharding. No, not necessarily like, again, like you just used again i will repeat you used the word vertical scaling where vertical word is in different sense compared to what we just discussed when we talked about columns you and RAM, vertical scaling, we probably still can split our system vertically in terms of tables and columns and have different parts
Starting point is 00:08:15 of our schema and different primary nodes. Yeah, I agree. But I thought we were counting that as a type of sharding. I don't like to call it sharding at all but i see people do it vertical sharding okay for me it's like just functional split or like go in microservices architecture or something and it has some of problems uh which this approach has uh are similar to normal sharding. For me, normal sharding is horizontal sharding. When we don't specify vertical or horizontal, we mean, usually we mean horizontal sharding. Same scheme everywhere, but different data. I completely agree. Same with partitioning. If you don't hear somebody say horizontal
Starting point is 00:09:03 or vertical, they mean horizontal. Right, right. And vertical partitioning also, like, it's not common. Some people use it again. Yeah. It's just, it's an attempt to have some unified terminology for everything. But okay, so we cannot live with one primary node, we are saturated, we tried everything like offloading read-only queries to standby nodes, reduced frequency probably with some caching, with some optimization and so on, we still see that we are growing and soon one node is not enough, it's very painful situation, very scary, especially for CTO and so on,
Starting point is 00:09:45 to suddenly to hit the ceiling. And of course, in this case, usually people choose one of two directions as the main one, like again, vertical split or sharding right away. But sometimes they need to mix. If it's a really large project, you start with like you bet on one approach, but still need to apply the different one as well. Yeah. Right.
Starting point is 00:10:12 So if you like microservices, and microservices is a bigger than technological topic. It's not just technological microservices is organizational topic, you need to change management and how teams are organized yeah how they choose technology they probably choose not progress some people some teams who might choose not but something else and so on so it's a bigger than just technical discussion but if you choose microservices probably you don't need sharding right it's they they microservices approach either solves it this problem of saturation and inability to scale or it just postpones it so much that you have like five years or so right maybe like the thing the thing i don't understand about microservices is
Starting point is 00:11:00 like what if you've got one thing that's very difficult to split logically and that is the most heavy ingest rate of everything? So you could easily have quite a small team looking after one huge node in the microservice infrastructure. So I suspect you could still hit this. If you've got a load of IoT sensors or something, you could get a lot of data very quickly. So yeah, I think it's possible even in the microservices. I cannot agree here.
Starting point is 00:11:27 If you, for example, e-commerce and a huge one, like a leader, a continental leader, for example, and I have a few examples, Postgres-based, which I worked with directly or I just learned from them based on discussions with the people involved. So if you choose microservices, and e-commerce, most of e-commerce systems somehow tend to choose microservices approach. They love it.
Starting point is 00:11:52 I mean, engineers, backend engineers and managers, they usually choose microservices. In this case, it's very hard to imagine that one of services, which usually you have something related to registration inventory orders like and so on many many services in typical e-commerce it's hard to imagine that one of them will require sharding right away you need to grow a lot to see the need in sharding of one of those services it should be really. I agree for e-commerce. I think there are some sharding
Starting point is 00:12:27 or things that get called sharding that are horizontal but have a different primary driver, I think. And I think that's analytics query performance. Ah, okay, okay. I silently reduced the topic to OLTP as usual for me. Yeah, okay. But I agree, OLAP, analytical systems, they can have a lot of data,
Starting point is 00:12:53 and usually sharding there is definitely a good thing to have. So I agree with you here, definitely. Cool. So a couple of reasons, at least uh for wanting this but yeah i think you yeah how is probably a good thing to move on to like what are our options right how so unlike my sql world where vtess exists we don't have vtess in postgres and attempts to migrate vtess to postgres failed i know a few and the developers of vtess announced that they don't pursue this goal anymore but we have a few options situs first of all and again like i i had
Starting point is 00:13:36 a joke about we are big experts because i consider my myself not an expert in sharding at all because i reviewed situs and played with it a couple of times in the past, but it was before Microsoft has decided to open source everything. So I always considered only the open source part because I didn't want to be to have vendor lock in to Azure. But right now we have interesting situation, they open sourced everything. So it's worth reviewing once again, especially the feature, the very important feature for large and growing projects, rebalancing without downtime. It's super
Starting point is 00:14:16 important for sharding, because you never know, which node will grow, it's hard to predict. So you need some tools to change distribution of data in the future, but without huge downtime. And this feature was originally only in paid version of Citus. But now it's open source as well. But I like up to date knowledge. For me, if we talk about LTP, I usually asked, I talked to them to Cytos team couple of times and asked please provide cases of pure
Starting point is 00:14:50 OLTP good like heavily loaded systems with OLTP workloads but everything they provided that time it was like 2, 3, 4 years ago it was looking as HTAB to me not OLTP you understand what so hybrid hybrid transactional
Starting point is 00:15:08 analytical for example some search an analytical engine for videos or something where it's okay like only limited number of users and they are motivated to wait a couple of seconds in this case it's okay i mean to have some latencies. But in OLTP, we have only usually dozens of milliseconds, or just hundreds of milliseconds, but not a second. A second kills our traffic and people go away. So we cannot wait, we cannot allow waiting so long. And when I benchmarked myself, Cytos was not behaving very well for oil tp but then i've got some response on twitter from developers that i do some things wrong so very likely i did some things wrong i mean i i was trying to measure latency overhead this is my
Starting point is 00:15:59 favorite test for such systems because the biggest problem for performance when like splitting is not that difficult but when you have something that decides which shard to go router right it adds overhead because it needs to parse the query and parsing query and so on requires time so this adds overhead and for me in all tpks it's unacceptable to have overhead more than a millisecond. Millisecond is quite big already. What were you seeing? I don't remember details. Okay. It was a few years ago. Imagine if it was, and I'm not saying it is, imagine if it was tens of milliseconds, you would deem that completely unacceptable. Yeah, definitely, because it's already very close to human perception threshold, which we know 200 milliseconds.
Starting point is 00:16:49 And we need probably multiple queries to serve one HTTP request. So I cannot accept that overhead at all because I know my backend engineers will add milliseconds on their own. They know how to do, how to add more milliseconds. So I cannot allow this proxy
Starting point is 00:17:06 middleware, right? I cannot allow it to have significant overhead. But again, my benchmarks very likely were not ideal because of this feedback. And I never tried one more time. Probably it's time to benchmark again and see the overhead of sito's proxy and you're a few years out of date right if they open sourced it a little while ago so you're at least that much out of date and i think the latest version included schema level sharding which seems quite interesting for some like OLTP type split. Yeah, exactly. So that it's quite vertical, right? If you if you were considering if someone's considering sharding, and they're on Postgres, they're gonna look at Citus, they should look at Citus. But there are other options as well,
Starting point is 00:18:00 right? Like that's the I will I would revisit this decision right now for ltp heavy loaded ltp system but again benchmarks are needed once again and probably some tuning and so on and i still don't know very good examples of ltp systems which use cytos but i i'm not closely watching i looked at their customer look look at some of their customers, and some of them are very analytics-heavy, so like Algolia for search, Heap. I believe Microsoft used them internally for quite a few things. But Freshworks stood out to me as potentially, like that's a help desk ticketing system, right?
Starting point is 00:18:40 Sure, they need to search as well, but a large part of that is OLTP. So that seemed interesting to me as well, but a large part of that is OLTP. So that seemed interesting to me. Yeah, that's interesting. But analytical area, let's look at Hydra, which recently released 1.0. Congrats to the Hydra team. Interesting, column-based. And they inherited this code from Cytos for column store.
Starting point is 00:19:06 So it's interesting i looked at the hydra website in preparation expecting them to have something around sharding but i think they've fought situs to do the other parts of what situs does well like the analytical processing but no no mention of sharding anywhere on the hydra website maybe at some at some point at some level you don't really need it if you have column store and vectorized processing. So I'm not an analytical guy, I'm not a sharding guy. I honestly don't understand what we are talking here about. Let's move on because we do have some really interesting cases and write-ups and blog posts from a recent OLTP company. Notion and Figma both blogged relatively recently
Starting point is 00:19:46 in the last couple of years about sharding Postgres. In a way which I call application server sharding, A-S-S, which I also implemented myself a few times. And this is what you usually choose because you don't have proper tooling. And we probably cite this as proper tooling for LTP. Again, maybe someone has good experience or knows about some good experience. I would like to know.
Starting point is 00:20:11 Please comment anywhere, like on Twitter or on YouTube or anywhere. But application server sharding or application level sharding, I like application server sharding because it's about this side. Application side sh side sharing. Sorry. This is challenging and requires an effort. Definitely. And usually it's quite easy to split and so on. But you need to think about several things as usual. First is this rebalancing in future you will need it definitely. And how to do it without downtime second is how to avoid distributed transactions for example it's absolutely bad idea to have multiple connections to different primary nodes and work with them like in some messy way you start transaction you
Starting point is 00:21:00 work with different connection it also has transaction. If you do that, if you absolutely need it, you need two-phase commit, 2PC. But it's slow. So you cannot have thousands of TPS on 2PC at all. It's impossible. So it's very slow because it has its own huge overhead. So usually in all TP context, we try to avoid to pc unless absolutely needed right and finally you also need to take care of this router and should have small overhead one millisecond is good probably that's it high level yeah i i think the notion blog post in
Starting point is 00:21:43 particular there's two there's two blog posts actually from notion one from 2021 where they initially did this and they shared a huge amount of detail in the preparation how they chose things how they were preparing for resharding late today rebalancing sorry and then there's a follow-up blog post from this year from them talking about resharding without doubt to all with i think less than a second of noticeable impact on users which is incredibly impressive but they they um they might have even used the word partition key or something they chose to shard based on what's it called um like workspace because people aren't ever people aren't ever looking for information from two workspaces at the same time.
Starting point is 00:22:27 So you don't have that same problem. Yeah, that's actually the same as with partitioning. It's very difficult. And I saw cases when people spend a lot of time trying to find the ideal partitioning key or sharding key in this case, or a set of keys. For example, unlike partitioning, where wearding key in this case or a set of keys for example unlike partitioning we where we can for example choose one key and it's enough we also need to think about how right workload will be distributed among multiple nodes here and for example we know
Starting point is 00:22:59 timescale cloud they they unfortunately not an open source they have basically sharding as well and there you need to have two keys one is time based timestamp but it's not enough why because if you just use only timestamp you will have hotspot one shard will be receiving all most of the rights all the time so you need additional for balancing you need the second level of Second key basically is a part of complex key So for example a workspace ID could could work here as well time scale multi nodes a really good point actually I was looking up and it looks like you can self host it. So even if it's not open source by different It's interesting news to me. Okay According to the yeah, I was reading some I'd only I also thought that wasn't true until I checked the docs.
Starting point is 00:23:48 What is the license? I didn't check, sorry. They have two licenses, Apache and Timescale, which is not open source. I very much doubt it's not. I don't think they're doing anything on the Apache one anymore. I think that's their old stuff. But anyway, let's not guess.
Starting point is 00:24:04 I'll link up with the docs and people that are interested can look into it themselves this episode is full of guessing but let's let's return to this main topic you need to take care of these things and you need to think about as usually when you're architecting something you need to think about how much of data you will have in five years and how will you approach rebalancing with minimal cutover time so you need probably involve logical decoding logical replication it's improving in latest postgres versions but it was a surprise for me in one case when it was not about sharing it was about vertical split of a quite big system. And I was considering
Starting point is 00:24:47 logical replication right away to perform a split. But split was like in two parts, like 50-50. And in this case, it's easier to use just physical replication and then drop tables on both sides, which are not needed. Because it's easier to install. It has less bottlenecks, fewer bottlenecks, and so on. So in some cases, just physical replication, and then you drop tables you don't need. And also, a lot of balancer and lag detection is interesting there. So let me add one more item, which is quite a huge item. If you go, it doesn't matter, microservices or sharding, you're going to have many more nodes.
Starting point is 00:25:34 And in this case, operational side, you should be better. Autofailover, backups, provisioning, balancing, everything should work much better more reliably and in this case it means that you need to simplify if you rely on managed postgres probably it's also okay but you need to trust it 100 and so on but if you manage postgres yourself before increasing the number of primaries you need need to unify, for example, naming convention. It takes a lot of time if you have different schemas of naming of hosts, for example, for different parts of your system. And then you design some very good tool, which, for example, performs minor or
Starting point is 00:26:17 major upgrade or something. And then you bump into issue of deviations. So you need to simplify, unify everything, because you are going to have many more nodes now yeah it's a great point it's not cheap to add sharding adds a lot of complexity so it's a good idea that's a nice point to simplify it's not about sharding it's about just growing fleet in terms of clusters you have more postgres clusters, so you need unification and simplification and so on. So I actually just thought of thinking about the other options out there. I do remember hearing and reading up on EDB's product, it kind of in this area called Postgres Distributed.
Starting point is 00:27:00 Now that kind of raises the point of a different use case for sharding, which is, well, one of the big advertised features of that is being able to keep data local to a region, for example so like if you want to keep your data bi-directional replication involved so it's the new let's detach this topic because it's very different and specific it's not sharding
Starting point is 00:27:21 well, but by our definition of one logical database split into multiple physical databases it kind of is but then you need to replicate in both directions and so this is based on the claim that replaying logical replay is easier than initial apply of changes. And so far, I haven't tested myself. I saw it only in BDR documentation, but I don't think it's so. I think it's more marketing claim. I didn't see benchmarks.
Starting point is 00:28:00 So let's keep aside bidirectional replication completely and return to this topic in a separate episode. And because we also don't have time for it right now, I also wanted to mention two different tools. First is PGCAT. If you want to shard yourself, PGCAT already offers
Starting point is 00:28:17 a simplified approach because it provides sharding in originally provided sharding in explicit form. Application needs to say, okay, this needs to be executed. And I know on which shard in the comment. So like basically just some helper tool. But you need to take care of a lot of things yourself. But I saw this, they improved, improved it and some kind of automatic routing already there.
Starting point is 00:28:43 And also overhead is quite low. I tested that long ago again, like a year ago maybe. And overhead was not bad at all for LTP. It's written in Rust, so quite performant. Interesting. I would look at it. And it's been developed quite at good pace. And another is SPQR. Also under development, I watch changes in both repositories
Starting point is 00:29:07 and I see a lot of development happening in both projects. And this project was developed with the idea of having more automated sharding tooling, similar to VTGate and VTES. Yeah, would you call that... Do we need another name for it? Is it almost like pooler-level sharding? If we've got application-level sharding, is would you call that it would do we need another name for it? Is it almost like pooler level shot? If we've got application level sharding? Is this pooler level? Well, well, yes. Well, if in this case, we can distinguish
Starting point is 00:29:35 like application level sharding application side sharding, it's when backend engineers is responsible for everything, basically, or almost everything additional like software can be can be put like in transparent fashion yes we need to take care of rewriting some queries because we don't want to deal with multiple shards at the same time yeah yeah often but we can distinguish at least two two types of this middle, which helps us with sharding. First is like, right, like VTGate style or this SPQR style or PGCAT when something lightweight is placed, and it doesn't include Postgres code, or at least majority of Postgres code. In this case, this tool needs to parse queries, Postgres code. In this case, this tool needs to parse queries,
Starting point is 00:30:26 Postgres queries. Postgres syntax is very, very advanced. So it's a challenging task. Grammar is huge. And another approach is placing a whole Postgres node in between. And in this case, I think latency overhead
Starting point is 00:30:39 is quite significant. And this is what PL Proxy in Skype 15 years ago, developed more than 15 years ago was doing and it's quite interesting approach I'm not sure about overhead but it has limitations that you need to rewrite all your queries in the form of server functions because it's kind of language PL proxy it's a language which similar to map reduce approach but overhead is interesting but at the same time skype was altp definitely and overhead requirements were quite strict
Starting point is 00:31:13 it's again it's similar to pgq a lost art or lost knowledge well if you like i was thinking we could end on like where do we think the future's going a little bit and i think vites is a really interesting case i do think we're maybe at a point where the postgres based startups that started in the last 10 years or so are really at a scale that youtube were at when they started needing this and they started building Vitesse. And I wonder if maybe that is what we're starting to see with the likes of PGCAT, that we're starting to see some of these companies that have been built on Postgres really needing some better tooling here for their own use cases.
Starting point is 00:31:58 Or, you know, as you said before, the bazaar of different options, lots of people will build their own tooling and maybe one of them will emerge like vites did for mysql and we'll have that in 10 years time it'll be the same place maybe i don't consider pgcat as a sharding solution it's more like puller but i know with very lean approach to development when they decided why not having this, for example. And unlike PgPool, they don't aim to solve every problem completely. For example, this explicit approach when you, as a backend engineer, is putting a comment on which shard to execute, so you're responsible for routing.
Starting point is 00:32:41 It's manual routing, right? It works quite well well it's a simple feature why not right so i think with posgas somehow very good sharding solution never existed and vtes has many features which they usually are overlooked for example asynchronous flow of data between different parts of system. And if we consider a huge project, we shard different parts differently. So basically, we're already splitting two services, right? So we need to shard users in one way and products or orders in a different way. In this case, we have basically two vertically split parts and already then we split them
Starting point is 00:33:27 horizontally. But in this case, we want to avoid dealing with multiple nodes. So why not having some kind of materialized views, which will would bring data asynchronously with some delay, but quite reliably. So we could work with local so i mean if you imagine some sharding system with with ability to have different sharding keys for different parts of schema plus you add materialized views with ability to be incrementally updated not not fully. And the ability to be incrementally updated asynchronously between nodes, plus also global dictionary,
Starting point is 00:34:09 because sometimes you need some global dictionary and also ability to synchronize everything. This is something like Vitesse has already in my SQL world. But Postgres somehow don't have it. And I think it's possible to build it from bricks it will take time, all these things but somehow
Starting point is 00:34:29 full-fledged solution doesn't exist again, I'm lagging in terms of Cytos understanding because for me there is a requirement on the overhead if it's not passed, I'm already looking at different direction right now now I would
Starting point is 00:34:46 revisit sitos revisit pgcat and spqr and if tested in my for my case and then go with application side sharding once again unfortunately last question from me I think I saw for to have complete picture there's also a third type of sharding when we maybe fourth type of sharding when we don't shard in ourselves when we don't have middleware but we decide to split our system to multiple systems if you talk about sharding you talk usually about database split but what if we split whole system? For example, we had only one website, but now we have 10 websites. But they have, for example, unified authentication subsystem. So you have one login, it works on each one of those 10 sites.
Starting point is 00:35:40 And this can probably solve your you mentioned this problem, to have data closer to your customers. For example, you could have one website for one country, another website for a different country. Maybe you have hundreds of websites. They have a single login system, and they have different databases. In this case, you split horizontally also, right? But you split not at database level, but at whole system level. So they have everything separately. What about this approach?
Starting point is 00:36:10 Yeah. And by the way, it's not just about latency of having data close to users. It's also about residency, privacy, legals. Yeah, exactly. But this approach is interesting
Starting point is 00:36:23 for SaaS systems systems maybe like notion yeah well but interestingly they didn't like they've uh there are a lot of benefits of keeping it as a single application right simplicity of management or like having let's say i'm part of one company in the u.s and part of a different company in Europe, I can, I can have my work so I can log into the workspaces within the same UI very easily. So I suspect you can do it even if it's two different applications, you just need a single authentication authentication system, it's possible to do. So you use one login, it's for Google do so you use one login it's for google services you use one login for many many different services right it's it works similar here we can have
Starting point is 00:37:12 different systems they are the same if we forget about the legal details they are all the same but you'll log in to all systems using single login it works everywhere like seamlessly. Yeah, I don't know if it would be quite as easy. It's not easy. For example, okay. It's not easy, but you can grow a lot using this approach. Like scalability is infinite. I think I saw a talk by GitLab talking about splitting systems out.
Starting point is 00:37:44 I think there might be a recorded version of it, but I suspect you know more about it than me. Which route did they go? Well, disclaimer, they are still my customers. Yeah. So I would recommend checking their documentation. It's very transparent. Most of the things are public.
Starting point is 00:38:03 They split vertically first. And it's obvious for systems like GitLab because if you think they have a lot of functionality, but some kind of functionality is quite automatic and it's related to CI-CD pipelines. And this is exactly what was moved to different database, CI-CD data. It's coupled very well inside CI part, but it's not that connected to different database. CICD data. It's coupled very well inside CI part,
Starting point is 00:38:26 but it's not that connected to different parts. Oh, by the way, check out what they needed to do with foreign keys when performed this split. Because they needed to create a new concept. I forgot the name, but basically how it works, it's like a synchronous foreign key check
Starting point is 00:38:46 between two databases. So it's quite interesting concept which helps you preserve at least something when you move data to different nodes. And this is useful. This experience is useful probably for other systems as well. And right now I think they go to direction
Starting point is 00:39:06 probably I don't need to discuss because it's still under development. But you can check their website documentation and open issues, a lot of information is open. Yeah. I mean, foreign keys are notoriously difficult. I think that's something Vitesse still is working on. It might be soon even.
Starting point is 00:39:25 I forgot the name, unfortunately, but there is some name GitLab invented, introduced some new concept. We can find it and link it up, right? Yep. Awesome. Thanks so much, Nikolai.
Starting point is 00:39:40 Thanks, everybody, and catch you next week. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.