The Changelog: Software Development, Open Source - The story of Vitess (Interview)

Starting point is 00:00:00 What's up? Welcome back. This is the Change Law. Thank you for tuning in. If you're new to the pod, head to changelog.fm for all the ways to subscribe. If you're a longtime listener, check out our membership at changelog.com slash plus plus. Directly support us, make the ads disappear, and get access to bonus content on our shows. By the way, on this episode, our Plus Plus subscribers are getting an extra 12 minutes. Jared saved the elephant in the room question just for our Plus Plus subscribers. And today we're talking with Deep Thesis Ready, Vitesse maintainer and engineer at Planet Scale. Of course, we're talking about all things Vitesse. We cover its origin inside YouTube, how V test handles sharding, how it scales,

Starting point is 00:00:45 when you should begin using it, how it fits into cloud native infra, and of course, that extra 12 minutes of runtime for our plus plus subscribers. Big thanks to our friends and partners at Fastly for keeping our pod super fast all around the world. Check them out at fastly.com. This episode is brought to you by our friends at Influx Data, the makers of InfluxDB. In addition to their belief in building their business around permissive license open source and meeting developers where they are, they believe easy things should be easy. And that extends to how you add monitoring to your application. I'm here with Wojciech Kajan, the lead maintainer of Telegraph Operator for Influx Data. Wojciech, help me understand what you mean by making monitoring applications easy.

Starting point is 00:01:35 Our goal at Influx Data is to make it easy to gather data and metrics around your application. Specifically for Kubernetes workloads, where the standard is Prometheus, we've created Telegraph Operator, which is an open source project around Telegraph, which is another open source project that makes it easy to gather both Prometheus metrics as well as other metrics such as Redis, PostgreSQL, MySQL, any other commonly used applications, and send it wherever you want to. So it could be obviously in FluxDB Cloud, which we would be happy to handle for you, but it could be sent to any other location like Prometheus server, Kafka, any other of the supported plugins that we have. And Telegraph itself provides around 300 different plugins. So there's a lot of different inputs that we can handle. So data that we could scrape out of the box, different outputs, meaning that you can

Starting point is 00:02:22 send it to multiple different tools. There's also processing plugins such as aggregating data on the edge so you don't send as much data. There's a lot of possibilities that Telegraph could be used to get your data where you are today. So we've permitted metrics, but you can also use it for different types of data. You can also do more processing at the edge and you can send your data wherever you want. Wojciech, I love it. Thank you so much.

Starting point is 00:02:46 Easy things should be easy. Listeners, Influx Data is the time series data platform where you can build IoT, analytics, and cloud applications. Anything you want on top of open source. They're built on open source. They love us. You should check them out. Check them out at InfluxData.com slash changelog. Again, InfluxData.com slash changelog. Again, influxdata.com slash changelog.

Starting point is 00:03:34 DT, thank you for joining the show been a fan of the tests mainly by way of planet scale like i you know i'm not that deep and steeped in the horizontal scaling of my sequel myself personally but having had a conversation with sam lambert on Finer's Talk, really gained a lot of respect for his leadership, his new role as CEO there at PlanetScale, and really just this story of Vitesse and how it's really just like doing tons of things for like YouTube and all that stuff to really make scaling the MySQL database truly possible. So welcome to the show. Thank you for coming. Thank you. What's the best place to begin when talking about Vitesse? Should we explain the tech itself? Like where should we begin the story of Vitesse? I know it began in YouTube back in 2010, but what's a good sweet

Starting point is 00:04:14 spot for you? We should start with the history, why Vitesse came into being, and then how people outside of YouTube saw Vitesse and why it got adoption outside of YouTube. So if we go back to the beginning, Wittes was started at YouTube, as you already know, and that was back in 2010. By the way, as an aside, YouTube has migrated off Wittes. So they are no longer the largest user of Wittes or even a user of Wittes anymore. But Wittes did start at YouTube. And the reason Wittes even came into being was because YouTube was growing really fast and they were running their video metadata store, not the actual video content, but just the titles, descriptions, comments, that sort of thing

Starting point is 00:05:05 associated with videos on multiple MySQL instances already. So they weren't on a single MySQL instance. They were managing about eight MySQL instances and they were having regular outages. So the people who started Wittes, which includes Sugu, who's still working on WITES at PlanetScale, and some others who are still at Google slash YouTube or have left Google and moved on to other things. So the team that started WITES, the problem they were trying to solve was this system keeps going down. It keeps having outages. How do we not have these outages? So they sat down and made a big list of all the outages they had experienced in some previous time horizon, whether it was a year or two years or whatever it was. And when they looked at that

Starting point is 00:06:01 list and thought about how do we solve this, the only answer they could come up with was, we need to write a new piece of software to solve this because there is no existing way to solve this. So that's how WIT has started. And it started with solving that specific problem of we are having database outages. How do we solve that? And they did it by putting a layer in front of the MySQLs, which will do connection pooling, which will put limits on the number of results returned by a query. A query might return 100,000 rows. No one's ever going to look at 100,000 rows. So why? And consolidating queries. So you may have thousands of clients trying to read the same row. You don't actually have to fetch that row thousands of times from the database. You can fetch it once and serve it to thousands of clients. So these were the sorts of things that were initially built as WITS at YouTube. And then over time, sharding an existing database became functionality that was needed and built into Wittes.

Starting point is 00:07:09 But for about the first five years of the life of Wittes, YouTube was the only user. Even though Wittes was open sourced in 2012, it was open sourced very early on in the journey to today. But for the first five years, YouTube was the only user. But at that point, there were other companies that faced similar scaling challenges and discovered that there was something called Wittes, which was open source. And the way they found out about it is because Sugu and a couple of other people used to go and give talks. They would give interviews to journalists who would write about it. So this knowledge was out in the public domain. And around 2015, some other companies, which also had scaling problems, started looking at WITS, trying it out, and then started going into production.

Starting point is 00:08:04 So when it comes to horizontal scaling, it seems like there's the first step that many companies take, which is like, let's separate our reads from our writes. And we can have a bunch of read databases and then a primary or some sort of write or a set of writers. And then that's like one way to get going at a small scale. You said they're already on maybe eight databases at the time. They had scaled some already. And then sharding is another strategy, which seems to come maybe at the same time or later.

Starting point is 00:08:34 Can you describe sharding just for so we're all on the same page? And then we'll see how it fits into Vitesse's story as well. But just like the concept of sharding, how it works generally, and maybe some of the challenges that it adds. So the idea of sharding is that you have a whole set of data and then you want to break it up into pieces and store them and handle them separately. So a very simple example would be, let's say you go to a conference and you are trying to check in and they have these different counters or booths with,

Starting point is 00:09:07 if your last name starts from A to M, you go here. If it starts with N to Z, you go here. So that's a very simple example of sharding your data set. So in terms of what it means for WITS, if we take any application that is storing data, let's say you have a set of people, a user stable, right? What you are trying to do is to break it up and store it in different individual databases, but present an illusion to any application that is trying to access the data as if it is a single database. So that's sort of the essence of how you would horizontally scale something like MySQL without leaking the details of how it's actually being stored at the backend. Right. So a naive implementation of sharding would be at your application layer. So as you go to your users table, every time in your application that you're going to

Starting point is 00:10:05 access some users, it would have to maybe like do a lookup on the first letter of their last name and say, okay, this one starts with S. And so I'm going to go to this database that has the S's in it. And that, that gets, that works, right? But at the application layer, it gets very complicated to be doing that all the time. Sometimes you forget how it works or a new engineer, etc. And so you can do that without Vitesse. People do it all the time, build it into your application to shard. But what Vitesse provides is this middle layer that hides that complexity underneath it or tucks it away.

Starting point is 00:10:39 So your application code can remain blissfully ignorant of that sharding strategy. I assume you could even change strategies or deploy multiple strategies. And your code that your developers are writing does not have to get spaghettied around, right? It doesn't have to have all those concerns the whole time. Is that what you're saying? Yeah, that's exactly right. So the strategy you first described where the application code itself figures out which shard to address its queries to. That's called manual sharding, or you can call it custom sharding in some ways.

Starting point is 00:11:12 But like you said, the application has to know and nothing is transparent. What if you initially had two shards and you named them one and two, and they've grown so big that you want to split them further. Now, what do you do in the application? So all of that logic of how the data is stored has to now be spread through every component that needs to know it, including the application code. So that's the downside of the manual sharding approach. It's very high maintenance. But in YouTube's particular case, they didn't even shard at first. They simply, the test started off as like this connection pooling thing first.

Starting point is 00:11:54 Right, right. And so that took them a long ways, right? That's like, that provides some scale. Yeah, yeah. So what you get with connection pooling is that instead of every request, keeping a connection open me the metadata for the video. And then I'm going to watch the video. So for, I don't know, 5, 10, 20 minutes, I may not need to go back to the database at all, just to give a very crude example, right? So what connection pooling gets you is instead of thousands of connections to the backend database, you can say, I only need a handful of connections, whether that's 20 or 50 or 100 for a specific application

Starting point is 00:12:51 that is going against the backend database. But on the front end, you can still serve tens of thousands of concurrent users because there is this natural property to user behavior that database requests are always intermittent. They are not continuous. So you're fighting against those peak moments. You have to guard against the peaks. Yes, yes.

Starting point is 00:13:14 Actually, they're almost the only thing that matters is you have to have the capacity at the peaks because the valleys are just going to be there. So one of the things you can do by having a connection pool is that you can convert load into latency. So if you can serve 10,000 users without any noticeable lag, and suddenly you get 100,000, if you can make them wait, then maybe they will get a response after two seconds instead of 10 milliseconds or 20 milliseconds. But you get the ability to handle that load

Starting point is 00:13:49 just by delaying some of that work. Obviously, if you have a sustained peak, if your traffic doubled overnight and it stays double, then you have to start provisioning more capacity at the backend. But transient peaks you can handle with a connection pooling strategy. Which is nice because on the web, transient peaks are normal and overnight double your capacity is abnormal, right?

Starting point is 00:14:14 Like you don't expect that. Maybe when the COVID lockdowns hit, many companies found themselves suddenly having twice as much traffic at a sustained clip. But other times it's like maybe an Instagram influencer sends a bunch of traffic your way, but they, they're very fickle and they leave quickly.

Starting point is 00:14:31 Right. Okay. So the test starts as this connection pool or inside of YouTube, but then it adds sharding, which is I think a huge deal and probably highly complicated to implement. And then also I I assume, requires a lot of like setup and definitions of like their strategies there. In terms of just the features of Vitesse over the years to now, maybe like the flagships, like those are some things. What are

Starting point is 00:14:58 some other scaling features that it provides? Or are those the two big ones that Vitesse gives you? The other feature I want to talk about is schema changes. So prior to Vitesse providing integrated schema changes, people were using all kinds of tools to manage schema changes to large MySQL deployments. Because you can't just on a production database, which is of any significant size, you can't just on a production database, which is of any significant size, you can't just directly execute schema changes because the effect of those is unknown. How many rows might be affected, whether a table will get locked up

Starting point is 00:15:38 and that translates to application downtime. These are all the sorts of problems that people running MySQL in production faced. And then they built tools to work around these problems. So at GitHub, they built a tool called GitHub Online Schema Tool. And Percona built something called PTOSC. So all of these were meant to work around how schema changes worked in MySQL, especially in older versions, which was not very well. Not only are schema changes not transactional, so you can't roll them back, but their effect on database performance was unpredictable.

Starting point is 00:16:20 So in WITES, in about 2019, we actually started integrating Ghost and PT online schema change. And then we also built our own WITES native way of doing schema changes on MySQL in a safe manner. And the foundation for this online schema change technology or WITES Online Schema Change Technology is the same foundation that underlies sharding. And that's something we call vReplication. So the way sharding evolved, when it was first built, you had to add a column to every table that you wanted to shard and store a value in that and use that value to define which shard that row would go to. But over time, that became transparent in the sense that you no longer added to need an extra column to a table. You could take an existing column in a table that you're trying to shard

Starting point is 00:17:18 and define a function on it and use that function to define how the table is going to be sharded. So I think starting in about 2018, we built this vReplication technology, which now underlies sharding. So there was a previous generation of sharding code, which still exists, but it's called legacy, and we will phase it out. And there is the new generation of the sharding code. And what vReplication does is it leverages MySQL's binlog replication. So in MySQL, when you have a primary replica configuration, there are something called binary logs, which the primary that is taking the rights will write out and replicas will subscribe to those logs and

Starting point is 00:18:03 they will receive those logs from the primary and then they'll apply them to their own databases. And that's how replication works. So what VITAS can do is it can subscribe to those binary logs and filter them. So by filtering them, you're saying, okay, maybe I'm in a resharding operation. So I have this one database, I want to break it into two, I can look at the bin logs and say, this change should go to this new shard, this other chain should go to the other shard. And that enables many different types of workflows. So you may want to do what we used to call vertical sharding, where you want to take a whole table and move it to a

Starting point is 00:18:46 different database. Or you may want to do horizontal sharding where you have a big table and you want to break it into multiple shards. Or you may just want to do something like a materialized view. So there is a table with data, you're interested in a subset of the data for some particular application, and maybe you also want to do some aggregations on that. Maybe you just want a count or a sum of some column, number of orders or total value of orders, right? So materialization is something you can do with V replication. So this technology we built that can look at MySQL's binary logs and process them in different ways for different applications eventually became the foundation for also doing schema changes in a very robust way. That sounds super slick. How long did it take to come to that implementation? You said

Starting point is 00:19:46 2019 was when these things took hold, this vReplication? Most of the vReplication code in its initial incarnation was written by Sugu. And I think it took him about a year to do that. After that, we've had more people working on it. And I would say it took another year for it to get to the point where it was stable and Vitas users were using these V replication based workflows to do sharding on their production systems. And it was at that point that we started building the online schema change functionality on top of it. I just find it fascinating that a tool that began nine years prior, I mean, assuming it's around 2019, like the the aha moment or the idea of like, hey, let's take these binary logs and provide this filtering mechanism.

Starting point is 00:20:34 And that will be how we shard. But then also, by the way, this is a really great way of doing schema changes. And then a year of effort by one person, or I'm sure there's other people involved along the way, like lots of effort to roll that out to develop it and then test it and actually integrate it now it's being used in massive scale it's just amazing to me that sometimes it takes that long of doing it differently or of just toiling or working on other things and then being like oh here's a much better way of doing it that actually solves two problems at once. It's really cool.

Starting point is 00:21:05 So the amazing thing about vReplication is that it was originally created to do sharding at a table level or at a database level, right? But once it came out, and this is something we see every once in a while with Vitesse, we may develop something for a particular use case, but people will start using it for other use cases because you can never think of all the use cases.

Starting point is 00:21:30 Right. So vReplication, for example, we have someone who's using it to create a development copy of their production database that developers can have access to with all the PII redacted. So they select a subset of the columns from the production database tables so that there's no user information, there's no addresses, credit cards, what have you. And developers can then use that redacted database for testing out new features. And they don't have to access the production database directly. Super cool.

Starting point is 00:22:03 You've mentioned Sugu a couple of times now. Can you kind of just share who this infamous slash famous person is behind the scenes? Who's totally, as Jared mentioned, and as you mentioned, DT, can you kind of impart to our listeners who that person is? Sure. So Sugu was at YouTube when Wittes was created. So he was one of the co-creators of Wittes. He was instrumental in open sourcing it. So he says that the reason they open sourced it is because they never wanted to reinvent it. And you know how things are these days. No one stays at the same company forever. So they were looking forward to the day when they would eventually leave YouTube and might have to solve database scaling problems elsewhere. So he was instrumental in open sourcing Vitesse

Starting point is 00:22:52 to start with in 2012. He came up with the name Vitesse. The project at YouTube was called Voltron, but when they open sourced it, they had to change the name because Voltron is copyrighted by whoever owns that character. For sure. Yeah. 1980s nostalgia for sure. And they had written code where components had names like VT something. There were directories in the code with VT in them. So they were like, OK, we have to find a name that still can be contracted to VT.

Starting point is 00:23:30 And Sugu happens to be someone who's fluent in French. So he took the word Vitesse from French. It's spelled with an E at the end and it means fast. And then he took the E out and called it Vitesse without the E. So Sugu was an engineer at YouTube. He was instrumental in creating Wittes. And eventually there came a time when Wittes was a technology, not product. So in that sense, it wasn't a revenue generating project for YouTube.

Starting point is 00:24:03 And there was probably not a lot of appetite to invest into it. At the same time, there were outside users, users who were using the open source version of Wittes, and they needed support. So how do you deal with this, right? So it was at that point that Sugu went and talked to some of the people in the Kubernetes community who had started the Cloud Native Computing Foundation. And they recommended that WITES should be donated to CNCF. And Google was on board with that. Google, from what I know, from what I have seen, is extremely open source friendly. They are a part of CNCF.

Starting point is 00:24:47 They are a major sponsor to many, many open source projects and conferences and communities. So they were on board with that. And Sugu saw the project with us through its donation or adoption to CNCF. And at that point, he left Google and he co-founded PlanetScale with Jitain, who was also a former YouTuber. This episode is brought to you by our friends at FireHydrant. FireHydrant is the reliability platform for every developer. Incidents impact everyone, not just SREs.

Starting point is 00:25:29 Fire Hydrant gives teams the tools to maintain service catalogs, respond to incidents, communicate through status pages, and learn with retrospectives. What would normally be manual, error-prone tasks across the entire spectrum of responding to an incident,

Starting point is 00:25:42 this can all be automated in every way with Fire Hydydrant. FireHydrant gives you incident tooling to manage incidents of any type with any severity with consistency. You can declare and mitigate incidents all inside Slack. Service catalogs allow service owners to improve operational maturity and document all your deploys in your service catalog. Incident analytics like extract meaningful insights about your reliability over any facet of your incident

Starting point is 00:26:07 or the people who respond to them. And at the heart of it all, incident run books, they let you create custom automation rules to convert manual tasks into automated, reliable, repeatable sequences that run when you want. Create Slack channels, Jira tickets, Zoom bridges, instantly after declaring an incident. Now your processes can be consistent and automatic.

Starting point is 00:26:28 Try Fire Hydrant free for 14 days. Get access to every feature, no credit card required. Get started at firehydrant.io. Again, firehydrant.io. so vites has had a long journey inside and now outside of youtube at a certain point you came along and got attached to the project can you tell us your journey, maybe even in your career, and then how it got attached to Vitesse along the way? So I started working with databases back in, I would say, 2000. I did not study databases at all as part of my formal education, but I had to start working with them because that was the way the world was going. I was working at a company that built

Starting point is 00:27:25 supply chain software. And specifically, we were building software for retailers and retailers have a lot of data and they store them in databases. So whatever software we built had to work with databases. And those were the days when we still used to ship CDs to customers. And we were trying to build this large-scale supply chain planning system for retailers, and it had to work with Oracle. It had to work with DB2. At some point, we thought that it would need to work with Informix,

Starting point is 00:27:56 but slowly we realized that no one actually was using Informix anymore at that point. So I had to learn how to write code that worked against databases. All of the data is going to be stored in the database. You're going to fetch some of it into memory, do something with it, write the results back into the database. So along the way, I learned how to write SQL. I ended up writing, I didn't write the full parser, but I borrowed a SQL parser that somebody else had written, and I was using it to understand SQL that was part of our application and emit the two variants, Oracle and DB2, because we did not

Starting point is 00:28:34 want to maintain two versions of queries, and they were queries that had to be written differently for the two different databases. Time went on. I continued working with Oracle and I did that for about 15 years. I switched jobs a couple of times, but everywhere I went, the database of choice was Oracle. So I was working with Oracle databases and I learned how to write SQL. I learned how to tune queries and how to manage schemas. There were always DBAs, but there was always stuff you had to do as an application developer at the database level as well. And we were writing database access objects, database access layers, all those kinds of things. Then in 2018, I was coming off of like a one-year hiatus from work. I had taken time off for personal reasons, family reasons.

Starting point is 00:29:30 So I was unemployed and I was looking to get back into a job. And I was even debating whether I wanted to keep doing software engineering. I had been a software engineer, a tech lead, an engineering manager in previous gigs, or whether I wanted to do product management because I had done a little bit of product management as well before. So I had sort of just started looking around what would I want to do? What sort of companies do I want to apply to? And coincidentally, my husband was actually working at YouTube at that time. And this was after Sugu and Jitain had already started PlanetScale. And YouTube was already on its way to migrating off WITUS. They had started their migration project.

Starting point is 00:30:20 So my husband was at YouTube and he heard about the migration project of Vitas. So he looked up Vitas and he happened to go look at Sugu's LinkedIn page. And then LinkedIn showed him Jitain's LinkedIn page. And so he looked at it and he was done. So the next time Jitain logged into LinkedIn, LinkedIn showed him someone from YouTube has viewed your profile. So he looked at the plot thickens. Yeah. So he looked at Venkat's LinkedIn profile and he was like, oh, maybe this will be a good hire for PlanetScale. We should talk to this guy. So they met, they talked and my husband was not ready to leave Google at that time. But he said, oh, by the way, my wife is looking for a job. So that's how I ended up interviewing with PlanetScale.

Starting point is 00:31:12 And they ended up making me an offer. That's awesome. Before I spoke to them, I was actually not looking at startups. I felt that in terms of workload, career progression, I would actually be much better off working for a big company because tech startups, Silicon Valley startups have a reputation of requiring a lot of hours. And I did not want to work a lot of hours. I did not want to work nights and weekends. But after I talked to the people at PlanetScale, I met Sugu. He gave me a very early demo of vReplication.

Starting point is 00:31:47 It was just a demo. The code had not been written yet. And I was just blown away by what Wittes was and what Wittes could do. And I was like, I must work on this because this is just so cool, right? So I ended up accepting the offer from PlanetScale and I went to work at PlanetScale. And for the first couple of months, I actually worked on the database as a service side, because what PlanetScale was trying to do was to launch a database as a service built on Wittes. So I went to work on the Kubernetes operator to start with. But pretty soon they needed more than one person to be working on Wittes. Sugu was the only person working

Starting point is 00:32:25 on WITS at that time at PlanetScale. And the engineering team was literally four people. So we had Sugu, we had an engineering lead, and two other people. So they said, okay, the engineering lead needs to focus on the PlanetScale side of things, the database as a service side of things. So you are the next logical person to start working on WITS. And that's how I started working on WITS. Awesome. And that was three and a half years ago now. Wow. It's incredible how you can step away from a career for a bit and come back into probably, I can imagine just by the joy you're sharing here as you describe your story. The listeners aren't getting to see your face, but I can see a lot of joy in your face as you describe this journey of your

Starting point is 00:33:08 own to step away and then come back into, you know, not so much a boring big tech job, but something that seems to be startup exciting and maybe the opposite of what you thought a startup could be or would do for you. Right. So I think I had definitely been through some not fun times at an earlier point in my career. I had somewhat burnt out a little bit. I was even questioning whether I wanted to be in the tech industry anymore. But working on WITES, I think, has really brought that joy back. It's given me back the zest

Starting point is 00:33:47 of working on something interesting. You're working on hard problems, but it's not hopeless, right? Because there is progress being made on these basic computer science problems on an ongoing basis. So in WITES, we are grappling with distributed systems and the theory and practice of distributed systems keeps evolving and we can learn from what other people are doing. And maybe we can do something that others can learn from. So to me, that is awesome. Something you said before we took the break and this kind of dovetails a bit into more of the journey is this idea of the test being open source not wanting to write it again in the future i mean that that to me is kind of the core component of open source right like especially something

Starting point is 00:34:35 that's born inside of a large organization like youtube has become and uh i could just see like you know how how would the world be right now given what planet scale is doing and then how it's also supporting the test and its journey through cncf from incubation to graduation sugu's journey personally and the team that's grown up around it like that idea to open source it was like profound because i don't want to write this thing again somewhere else i'm going to leave eventually because you you know, that's how things work. Eventually you're going to go somewhere else and do something else. What about for you?

Starting point is 00:35:11 What do you think about the idea of open sourcing of a test? Was it just genius to do that? What do you think? I do think it was genius because it happened. Open source is now, I guess, 30 years old. But in 2012, it wasn't as intuitive or as much of a default as it is today. Today, engineers working at any company, they would love to open source their work. It wasn't that way back then. Most companies kept their work proprietary. There is an additional bit of hassle involved in open sourcing something. I recall way back, this is sort of tangential in a way, but this was 2009.

Starting point is 00:35:56 I was at Future of Web Apps in Orlando. I think it was Miami, Florida. And Blaine, I can't remember his last name at the moment. Jared, you may remember. He was CTO, I believe, at the time of Twitter. And Twitter was falling over left and right. And it was based on MySQL. And I can remember, like, he had to leave his talk on stage to go and deal with, like,

Starting point is 00:36:18 a sharding issue, essentially, because it was just constantly falling down. The fail well was a big meme, all that good stuff. And, like, I just think about Twitter then, had the test been a thing in open source, they wouldn't have had to rewrite or do their own thing. That's what I think about. It's like now you've got sort of ultra-massive scale applications happening because just the state of the internet, the state of the web,

Starting point is 00:36:40 the state of applications has just ballooned in terms of adoption, whether it's because of COVID or other things that have happened that made people like sort of like gravitate toward the internet, but, you know, not having to recreate that wheel because it was open source could have saved Twitter in those days. Like if you could have, if Twitter would have been now and that problem would have been now, you know, they would have just used that open source tooling versus like Blaine leaving the talk to go and shard Twitter's database,

Starting point is 00:37:05 you know? Right. That's so true. And so many of the biggest companies were using MySQL. So Google was using MySQL. They had their own build of MySQL. Facebook still uses MySQL. They still have their own MySQL build.

Starting point is 00:37:23 Twitter still uses MySQL. And they continue to do that today, right? Well, Google's probably done with MySQL. I don't know. But Twitter, Facebook, Facebook, we know still is running their own MySQL. Compare that to companies like Square and Slack, who got to the scale where they needed to shard in, say, 2016. Or they were already sharding in a custom way and it wasn't really scalable. The strategy wasn't scalable.

Starting point is 00:37:57 Operating the system wasn't scalable. And at that point, when they looked around, they were able to say, well, there is this piece of software, which is open source. So if it doesn't quite work for us, we can contribute and make it better. And it's already been proven at YouTube scale. So why not? Right? Yeah. And that's, in fact, exactly what happened when companies like Square and Slack adopted Wittes.

Starting point is 00:38:23 Wittes was built for YouTube's use case. So there were definitely going to be things that didn't work for them. And they actually started contributing back to the code base for the particular use cases that didn't work. And that made things easier for the next adopter who came along.

Starting point is 00:38:45 So that we are now at the point where typically people can adopt with us without having to contribute anything back to it. If they want to, sure. But the sort of gaps people used to find back in 2015 when the first companies, first non-Google users came along, don't exist anymore. And that's because the various adopters have contributed back to open source and made it better for everyone who comes after them. What if you were starting a business today? So Vitesse seems like it unlocks operational horizontal scale at the cost of what starts off as complexity, additional moving parts, setup, time investment.

Starting point is 00:39:35 And so for many companies, perhaps it is a premature optimization. This is where I'm asking, I'm not saying, to start with Vitesse right away. Yeah, I have to agree that starting with Vitesse and trying to run it yourself is definitely a premature optimization. Anyone who's starting something today is better off choosing whatever is easiest to deploy and manage. And it's not that MySQL was the easiest thing to deploy and manage back when these companies that are huge today were startups 20, 25 years ago, but it was cheap. And that's why they started with MySQL, right? But today people have a lot more options. You don't have to run your own database anymore. And in fact, a lot of startup founders are developers who have grown up with AWS

Starting point is 00:40:27 and they instinctively go to RDS. They'll say, I'll just provision an RDS MySQL or an RDS Postgres and use that. Right. And let Amazon deal with the scale. I'll pay Amazon to deal with my scaling problems, right? Right. So that works sometimes.

Starting point is 00:40:46 Sometimes you do the math and you get to a certain scale where that AWS bill is unwieldy and you decide, okay, maybe we could run our own infrastructure or maybe we could take this off AWS and run it better ourselves, cheaper, et cetera. Are there companies doing that? I'm just trying to think of who adopts Vitesse. Is it the entrenched, large, super-scale companies that have MySQL? Are there new MySQL companies that are using it today and they just showed up out

Starting point is 00:41:22 of nowhere? Is the community well set? Most WITS adopters tend to come to WITS because they have a scaling problem. Right. So even if you're using RDS, it won't scale beyond a certain point because it's MySQL. It is not sharded MySQL. Right. So that's sort of the path to WITS for open source adopters of WITS. Yeah. So you know you need it by the time you need it.

Starting point is 00:41:47 You feel the pain. It's probably the leading option in the space. I don't know if there's competitors to Vitesse in terms of other ways of horizontally scaling MySQL. I don't believe there is anyone else or anything else you can use to horizontally scale MySQL right now. Vitesse is compatible with MySQL-alikes like MariaDB and such. But how MySQL-specific is it?

Starting point is 00:42:11 Like, could it be abstracted to all relational databases and you get Postgres users or no? It is actually very MySQL-specific because we are managing MySQL, right? So we are saying, Vitess knows you're always running in a replicated mode. You have a primary and you have replicas and Wittes knows which is the primary, what are the replicas? And if there is a failure and you're recovering, then you are changing that configuration.

Starting point is 00:42:39 You now have a different primary. So Wittes is managing all of that, which means Wittes is managing replication at the MySQL level. And we are hooking into that replication to do resharding. And we are providing that illusion of a single database,

Starting point is 00:42:55 which means clients talk to us using MySQL protocol, or they can use gRPC, but we masquerade as a single MySQL server. So that SQL grammar and query language support is also MySQL specific. It does not have any non MySQL constructs in it. So it is fairly well tied to MySQL. Fair enough. Sometimes that's what you do. You just carve out your area of the world and you say, we're going to do it this way and we're going to serve this group of people and it's going to be awesome.

Starting point is 00:43:27 And that's just the end of that story. Right. So this is a question that we get asked on the PlanetScale side as well. Are you going to support Postgres? And the answer to that is, at some point, you will stop caring about the particular SQL dialect and you will choose it as a data store. This is a database. It's not very difficult to learn the particular language

Starting point is 00:43:53 that it understands. It's SQL. Right. You just use that. Just got to wait it out. You only do that if you're getting something else in return. Yeah. And that something else is ease of use.

Starting point is 00:44:07 It has to be easy to use, easy to deploy, easy to use, easy to run. Yeah, reliable, seamless. You just don't care anymore because it's always there. It scales for you. It's easy to use. And you just use the language that it provides. One thing, Jared, when you asked about would you start using Vitesse and as you know, Sam Lambert was on Sam is the CEO of PlanetScale

Starting point is 00:44:29 he was on Founders Talk and he shared a sentiment on that front which is essentially like if you're going to use Vitesse and as you had said, Deepti to not start there but PlanetScale is designed to be the beginning so rather than begin a brand new application on Vitesse

Starting point is 00:44:46 with the complexity and all the simplicity you really want in the early stages of a startup or an idea, he was saying that begin with PlanetScale because it is Vitesse as a service, plus plus, DX, UX, all that good stuff, that that's the starting point. Not to promote that by any means, but that's something he had said on the show

Starting point is 00:45:05 in terms of a starting point, and that's by design. And it's because of Vitesse being a thing that PlanetScale exists. So just to put that there. When someone's trying to adopt Vitesse, what does it take to adopt it? What does it take? What's the stack like? How does the stack change? What is it like to deploy it? What does it take? What's the stack like? How does the stack change? What is it like to deploy it? How does it run? Is it its own server? Like, what is it like?

Starting point is 00:45:31 So there's definitely a hardware cost to it when you start deploying WITES. So typically, what people will do is they have their MySQL instances already, and they'll just put WITS in front of it and start routing the application queries through WITS. So obviously, you have to run these additional components that WITS brings in. and you run a bunch of proxies, which will receive the MySQL queries, connections from clients, and then pass them to the tablets, which will eventually execute those queries on MySQL. So there is a hardware cost associated with it. There is also compatibility. So Vitas is MySQL compatible, but MySQL syntax, query syntax and constructs keep evolving. So Vitas is MySQL compatible, but MySQL syntax, query syntax, and constructs keep evolving. So Vitas started, I think, maybe on MySQL 5.5. And we now support almost all the constructs in 5.7, but not everything that is in 8.0,

Starting point is 00:46:43 because a lot of new syntax was introduced in 8.0. And where this becomes hard, it's easy if you're unsharded. You sort of just have to understand the syntax and you can pass it through. But in a sharded system, what you end up doing is that you take each query and you have to plan that query. And you have to say, which of the backing MySQL should this query eventually go to? Because it's not going to go to all of them. It's going to go to either one of them or a subset of them. In rare cases, yes, if you do select star from a sharded table,

Starting point is 00:47:18 you have to go to all of them. But usually you want to go to a subset or a single one. And query planning is where we figure this out. So what that means is that any new syntax or construct that MySQL introduces, we have to be able to understand that. Because maybe it's a function and the function is being applied to the results of multiple rows, and you have to know how to implement that function in VITAS in order to still provide that transparency. What's another example? Or it's a join, and if it's going to be a cross-shared join,

Starting point is 00:47:57 then you have to do some of the processing in memory after fetching the results from different individual MySQL databases. So all of this is happening in Vitus, which means that to this date, there is a compatibility gap between Vitus and MySQL. And we are trying to close that. So we started the project, the compatibility project formally in, I have to think, I think January of 2020. So it's been two years. And we are still working on it. And we have like a sub team at PlanetScale, all of whom are Vitas maintainers who focus on closing this compatibility gap. And the way we did it was we

Starting point is 00:48:43 said, okay, we'll take some popular developer frameworks like Ruby on Rails or try to run WordPress on Vitesse and look at the queries that are being executed. And especially with frameworks, they execute some preamble queries, some information schema, metadata queries, and so on. And then we start adding support for those things. But to come back to the original question,

Starting point is 00:49:04 which is what does that journey of adopting WITS look like for someone who starts on WITS? You may find that there are some queries that you are using which don't work with WITS in a sharded mode. It's much rarer with unsharded, but there was a user on open source who said, oh, I'm using common table expressions. They don't work with WITS. And I had to common table expressions. They don't work with WITS.

Starting point is 00:49:25 And I had to say, yeah, they don't work with WITS because we haven't added support yet. But most of the time, it's not unsharded, but sharded that causes the incompatibilities. So people will take a test environment. They'll have some test data. They'll put WITS in front of their MySQLs. They will start sending some test traffic through WITES. And then they'll discover what are the things that don't work. At that point, they have a choice of either contributing a fix upstream to make it work

Starting point is 00:49:56 or changing how their application works so that those queries are not produced anymore. So that's the sort of journey people tend to go through when they come to Vitesse. And then once they know that, okay, either all the queries work, or we've made the usually not big changes, minor changes needed for the application to work, then you can start moving from test into production. Mm-hmm. How often in these scenarios, whenever someone adopts or attempts to adopt the test, hits this lack of compatibility in a sharded motor, something like that, do they often contribute or at least provide some guidance to their specific concerns

Starting point is 00:50:41 and how it works to enable that compatibility? Because if this project has been a year know, a year or more, right? Like if it was early 2020, I'm trying to remember what year this is. 2022. Okay, cool. Because COVID has got my brain still yet. I can't remember years anymore. Okay, so like two years at least, right?

Starting point is 00:50:57 This is a long project. So I'd imagine that you've got limited bandwidth, limited core contributors. Maybe we could talk about how open your team is or the team is to more contributors and how that works. But how often does someone come to Vitesse and leave bummed out, but then maybe they find a way to actually contribute back to make that compatibility possible? It's hard to know. We only know about the big ones, the big companies where they put in a sustained effort because they really had no other option.

Starting point is 00:51:31 You're the only game in town when it comes to scaling MySQL. You said so yourself. What other options do you have? Build it yourself or use Vitess? So why not just build features in Vitess, right? Yeah, so in fact, some people built it themselves. We just don't hear about all of them. I know of one,

Starting point is 00:51:50 but there must be many others who built their own custom way of sharding on top of MySQL because they needed to. But if you don't want to build it yourself, then yeah, Vitesse is pretty much the only option unless, I don't know, some people may just migrate out of MySQL at that point. I don't know what people are doing.

Starting point is 00:52:09 So I think this sort of contributing back upstream and sticking to it in spite of hitting roadblocks probably happened a lot more in the early days. The gap is much narrower now in terms of compatibility. So my sense is that it's actually rarer for people to hit those things and then say, oh no, we can't deal with this. The gap is small enough that, first of all, we have a list of things that don't work that's in the public report that anybody can look at. And if somebody says, what are the queries that are not supported, we can just point them to that so they can upfront scan through them and look at their own application and make some decisions. Beyond that, we are obviously open to contributions. But I think the fact that

Starting point is 00:53:01 the people at PlanetScale who do this do it full time versus someone who's trying to adopt WITS is either a database infrastructure engineer or some other type of infrastructure engineer. They have their day job, right? So we don't see at this point a whole lot of contributions to WITS in terms of compatibility. Most of it is happening from planet scale. This episode is brought to you by our friends at WorkOS. When it comes to adding enterprise-ready features

Starting point is 00:53:43 or selling to enterprise customers, product teams, engineering departments, developers, they're all faced with a choice to ignore and focus on viable features or get distracted and learn how to integrate with complex legacy systems. And I'm here with Michael Greenwich, the founder and CEO of WorkOS, who knows there's a better way. Michael, what do teams at Vercel or PlanetScale know about the world of enterprise features that no one else knows? The world of enterprise features is full of acronyms.

Starting point is 00:54:13 Typically, they're like these three or four letter acronyms like SAML or SCIM or SEAM. It's like secure event information event management. There are these long kind of like really obscure acronyms that most developers aren't familiar with. They've never really heard of. And this is what ITM's require you to build integrations around.

Starting point is 00:54:31 They say, hey, we need SAML or we have to have a SCIM integration, et cetera. And for companies like, you know, PlanetScale or Vercel that are building on really modern stacks, building with React and like, you know, cutting edge JavaScript technology and like web applications, they're really having to go integrate with these old legacy platforms and systems like SAML's built

Starting point is 00:54:48 around like XML several generations before. And so I think when those companies looked at what to do in this scenario, they have deals that are getting blocked because they don't have something like SAML single sign on. And their engineering team is like, do we really want to spend all the time to go read the spec and learn how this works and dive into all the different ways this can break? And in the case of SAML, there's a bunch of instances of security vulnerabilities that have happened over the years. Do they really want to spend time on that? Or do they want to spend time building, you know, the unique features, that power of Roussel, you know, like focusing on Next.js and focusing on those applications. And for these companies, they don't. They don't want to spend

Starting point is 00:55:23 the time thinking about SAML. They want to be able to hand it off to someone who can really go deep in that and obsess over it. And so we're sort of like, you know, the partners to all these companies that goes really, really deep around, you know, these acronyms or obscure technologies, making sure they don't just work really well, but they work everywhere with every single system. And we've tested it end to end. And it even has this kind of compounding effect. The more people using WorkOS, kind of the more stable and more robust it becomes. And what it really does is lets companies like Vercel or PlanetScale or Hopin or Webflow focus on those product features and for their best engineers to spend time still delighting their customers and not necessarily doing these enterprise IT integrations. That's awesome. Thank you, Michael. So even if your team

Starting point is 00:56:04 isn't focused on enterprise you can still leverage work os so you're not turning enterprise away learn more get started at work os.com they have a simple page you grow pricing plan that scales with your usage needs no credit card is required again work os.com and by our friends at source graph they recently launched code Insights. Now you can track what really matters to you and your team in your code base, transform your code into a queryable database to create customizable visual dashboards in seconds. Here's how engineering teams are using Code Insights.

Starting point is 00:56:36 They can track migrations, adoption, and deprecation across the code base. They can detect and track versions of languages or packages. They can ensure the removal of security vulnerabilities like log4j. You can understand code by team, track code smells and health, and visualize configurations and services. Here's what the engineering manager at Prezi has to say about this new feature. Quote, as we've grown, so has a need to better track and communicate our progress and our goals across the engineering team and the broader company. With Code Insights, our data and migration tracking is accurate across our entire code base and our engineers and our managers can shift out of manual spreadsheets

Starting point is 00:57:15 and spend more time working on code, end quote. The next step is to see how other teams are using this awesome feature. Head to about.sourcegraph.com slash code dash insights. This link will be in the show notes. Again, about.sourcegraph.com slash code dash insights. so db i'm not sure if you have read or know about nadia ekbal's working in public book but in this book after she's done lots of research, she categorizes four types of open source projects. According to kind of two strata, you have user growth and contributor growth. And if a project is high user growth, but low contributor growth, she calls that a stadium, a project where like there's a rock star on stage and there's a whole bunch of people in the crowd

Starting point is 00:58:24 and they're looking at the rock star, you you know waiting for that person to do their rock star thing and then if it's uh high user growth but high contributor growth she calls that a federation so like the rust foundation or these kind of things if there's low user growth like like something that you just built for yourself it's's low user growth, low contributor growth. So not very many people use it, but also not very many people find it useful. She calls those toys. So maybe you have your own.files, you have a command line app that you use just for yourself, scripts, et cetera.

Starting point is 00:59:01 And then there's a fourth group, which is low user growth, but high contributor growth. And she calls those clubs. So they're never going to be that big, like the people who get involved tend to be contributors. Like if you're using that, you're probably contributing. She calls those clubs. And I've been trying to think about the test in light of one of those four categorizations. It kind of seems like a club because you have probably not that many users because your users are kind of de facto scaled organizations, which there aren't that many of those in the world. But then when I look at the contributors, it seems like mostly it's planet scale. And so it's kind of a stadium in that sense. It's kind

Starting point is 00:59:36 of a club. I'm just curious what kind of project you think it is. Maybe there's been large contributions by the folks at Slack, by other users around. But just curious, like how many of the users are also contributing and how that all breaks out and how it feels as a community? So given that we are a CNCF project, CNCF actually tracks these statistics and we have some numbers. Oh, great. So in terms of users, it's actually very difficult for us to know whether someone is running Wittes in production. Right. Because they don't have to share that information. Unless they come out and tell you they're doing it. Right. So it's

Starting point is 01:00:15 completely voluntary. Some people will talk about it. They will go to conferences and talk about it. They will actually add their logo to the Wittes homepage and others will never talk about it. So we do know of companies where people have told us in confidence that they're using Wittes, but they will never be public about it. They'll never tell anybody. Yeah.

Starting point is 01:00:38 So if I look at our Slack workspace, we have 2000 plus people on there. They are not all active on an ongoing basis, but it's a decent sized community. It's not tiny. I think earlier on, Wittes was where the user growth and contributor growth was similar. Every new user ended up contributing. And they were very significant contributions from Flipkart, from Slack, from Square. But I think now it is at a point where there are more users than contributors. Most users who come in and try to use Wittes don't actually need to contribute. And also because there is PlanetScale and the initial revenue stream for PlanetScale was doing support for people doing their own WITS deployments. So for

Starting point is 01:01:33 companies that were trying to run WITS on their own, but wanted some help, PlanetScale would do that. So at that point, they would funnel any bug fixes or features they needed through PlanetScale. So to the outside world, it looks like PlanetScale is making this contribution, but it's motivated by the needs of a particular user. So I think we have evolved to a point where PlanetScale is the main contributor and maintainer to Wittes. We still get contributions from other maintainers. We still get contributions from like random people, people that I don't know. But it's definitely now, I think more of the stadium type of project. There are, I think, other reasons too, besides the one I talked about, which is PlanetScale was supporting people who

Starting point is 01:02:25 wanted to do this. So then that just became the easiest way for them to get stuff done in WITS versus contributing it themselves. I think the other reason also is that initially when I started working on WITS, there were only two people at PlanetScale working on Wittes. So obviously, we couldn't do everything that open source users needed, and people had to contribute. So Slack, for instance, contributed many significant things, which they found because of the way they were trying to run Wittes. They didn't expect Sugu and me to do all of those things because it was physically impossible to do so. But the maintainer team at PlanetScale has grown to the point where we can do the bulk of things that are flowing either from users of PlanetScale databases or open source users. So it just seems easier

Starting point is 01:03:20 sometimes for people to say, oh, here is a bug. And maybe somebody will jump in and say, I know how to fix it. It'll take me 10 minutes. I'll do it. Right. It's interesting because the motivation from PlanetScale side obviously is its core to PlanetScale. Like as a commercial open source company to have the expertise and then to fix and improve a test. So it's natural.

Starting point is 01:03:48 Yeah. Although at some point everyone is benefiting those users that are non planet scale companies are benefiting from their contribution back to the tests and they may not be planet scale customers. So it could be waste or I don't know, maybe it, maybe it works out in the end. Maybe that's just how open source works is like you just trust, you know, the contribution.

Starting point is 01:04:15 I don't know how to describe it, but like you just trust the give back, the generosity, so to speak. So there are different open source governance models. I was reading an article written by the founder of Drupal about this, how in any open source project, there are givers and takers. And if that balance goes out of whack, if there are community members that are mostly takers, and they're not enough givers, then eventually the project may not survive long term. So those are things that all open source maintainers have to think about. So for Vitus, right now, PlanetScale is backing it. So the number of givers in terms of individuals is high, but in terms of corporate entities is pretty low. Yeah, it's pretty low.

Starting point is 01:05:07 Is that something that, and I don't know how you separate yourself because you work for PlanetScale, but you also are a core contributor to Vitesse. And I want to get into some of your particular contributions because you're quite a subject matter expert. And I want to know more about your actual contributions beyond your wealth of knowledge of its story

Starting point is 01:05:23 and help you if you've been told it. But I got to imagine that there's some desire there so how do you personally separate your your psychological balance between i work for planet scale but i also am pro vates and i'm core contributor there how do you want to see the balance change and shift like if there's listeners out there at netflix and they use it or at XYZ and they adopt the test. How do you want corporate contributions back to balance out the planet scale give and backing of it? So if I think back to how I got more and more involved in WITS as a planet scale employee. So I started off with, OK, there's some bugs that need to be fixed. Let's fix them. There are some features that PlanetScale support customers want. Let's add those features. And slowly, because of my own desire to learn more about Wittes, I started spending a lot of time in the open source Slack where people ask questions and I would look up the answers to those questions.

Starting point is 01:06:26 Basically, if somebody asks a question, I don't know the answer to it. So I'll go and run WITES myself in my local environment, try it out and understand how it works and then trace it back to the code. And then at that point, I'll be able to answer that question. Or I may just search through the Slack messages to find the answer. So I had to build my competence in WITS at the code level, but I also had to build my competence in WITS as a user from the user perspective. And when you do that, you actually develop empathy for people who are trying to use the software, because you're not just looking at it as the writer of the software, but as someone who has to use it and you start empathizing with how hard it can be sometimes.

Starting point is 01:07:13 So that to me is what drives the balance. WITS is open source, which implies a certain contract between the project and its users. And the project is embodied by the maintainer team. And I'm part of that maintainer team. I'm just fortunate that PlanetScale pays me a salary so that I can do this full time. Most people don't have that luxury. So, like you said, I have to balance what I owe owe planet scale because they are paying me a salary versus this more idealistic nature of open source software where you're altruistically giving

Starting point is 01:07:56 without expecting anything in return yeah precisely so uh that is a difficult balance. The best thing, though, is that everything we do in Vitus for PlanetScale is usable by anyone. It goes upstream. So for PlanetScale to start with, we started the compatibility project. And we worked on adding constructs, adding syntax, and so on and so forth for a year. And we did some very basic testing with various frameworks. Ruby on Rails, okay, it works. Can you deploy WordPress? It works. Django, it works, and so on.

Starting point is 01:08:34 But we had to, for PlanetScale, we had to actually make sure that anything you did in Ruby on Rails would work because the PlanetScale app is built using Ruby on Rails. And PlanetScale uses PlanetScale or Vitis as its own data store. Sure. So PlanetScale runs on PlanetScale. went through and looked at all the Ruby on Rails guides and proved that all of those constructs that you can use in Ruby on Rails, the active record guides, will actually work against Wittes. Now, we did that for PlanetScale, but it benefits anybody who tries to write a Ruby on Rails app

Starting point is 01:09:20 against Wittes or tries to write a Ruby on Rails app against PlanetScale. So that's just one example of something we did because we wanted it for PlanetScale. And some of this work was done by, a lot of the testing was done by non-core Vitesse maintainers, but the bug fixes were done by the Vitesse maintainers. So I'm a bit torn because it seems like Vitesse is in good hands, but then I also see all these other corporations who are using it. And I think like, shouldn't they also be pitching in? Is there, there's a foundation, I guess,

Starting point is 01:09:54 as part of CNCF where it's graduated. Maybe tell everybody what that means for Vitesse having graduated from CNCF and maybe I guess some of the financial side of the open source commons here. So WITES, when it joined CNCF in, I think, early 2018, the process probably started in 2017, but it was January 2018 that WITES became a part of CNCF as an incubating project. And then graduation was in 2019. So CNCF has certain criteria for graduation. And one of them is that the project has to be supported by multiple entities so that it doesn't just disappear if one of those corporate entities disappears. And when they had that, the meeting or the review, Michael Demmer from Slack was in that

Starting point is 01:10:47 meeting to say that even if planet scale goes away, WITOS is so important, so foundational for Slack at this point that we will maintain it. So because WITOS is running in production at some of these huge companies, and it is storing their business data, right? Slack, Square, HubSpot. Right now, PlanetScale is doing most of the maintenance and that's fine. But if PlanetScale were to go away, other people would have to step in. They would not have a choice. Mm-hmm. Mm-hmm. Yeah, one of the criteria, I'm not going to read them all, but one of the criteria for the graduation stage for a CNCF project, the very first line is have committers from at least two organizations. And a few other things, essentially, for this due diligence, like you had said, to balance the need for the technology and also the support of the technology.

Starting point is 01:11:42 And I assume that all maintainers are not employed by PlanetScale given that graduation stage, right? Yeah, yeah. So about 50% of the maintainers are now PlanetScale employees. And a lot of it is because PlanetScale staffed up the VITAS team. And over time, people just commit so much when they're doing it full time that they end up becoming maintainers. Their contributions reach a level where you can give them right permissions.

Starting point is 01:12:10 They know some part of the code well enough to review other people's work and to make decisions on how that should evolve in the future and things like that. So that's sort of why about 50% of the maintainer team is now from PlanetScale. But we do have maintainers from Slack, from Square, HubSpot, a few other companies. So there are about 10 people who are not PlanetScale employees. Also on the list of users, Pinterest, GitHub, New Relic. These are companies that are not going to allow it to disappear. Maybe they're not maintaining today. But like you said, if PlanetScale scaled back its maintenance quite a bit, somebody would step up. Right.

Starting point is 01:12:55 If it weren't Slack, it's going to be Square. If it's not Square, it's going to be Pinterest and so on. There's just too much vested interest in the project for it to be abandoned at this phase. Exactly. Yeah. In terms of support from the foundation, they pay for our GitHub repositories. So they provide, obviously the project itself has no money, any budget that, anything that requires spending either comes from PlanetScale or from the foundation. So GitHub repositories, Docker, we have a Docker team plan. They've given us Equinix Metal hardware on which we run daily benchmarks and we publish those benchmarks on a separate website, benchmark.bit.io.

Starting point is 01:13:39 So that's continuously updated and these benchmarks are running all the time. We needed some dedicated servers for running some portion of our CI and the foundation gave that to us. So they do provide a lot of support in terms of whatever the project needs to keep going. Since you mentioned contributions, I got to hear more. Can you boast a bit? Can you share some of your contributions in particular? Okay, so a few things stand out. WITS has backup restore functionality for any database you need that.

Starting point is 01:14:16 And specifically in the case of WITS, if you want to add more replicas, you will restore from a backup and then catch up to the current primary and then you are ready to serve. So it has had its own way of doing backups, which was basically shut down the MySQL, copy everything over and then restart it. A number of members of the community wanted extra backup support. So they wanted to be able to use extra backup to take backups so that they don't have to bring down the MySQL. Because with extra backup, you can take backups on a running instance. So that was my first major enhancement to Vitas. And that's why I remember it so clearly. And this was actually sponsored by Slack. But I ended up being the person doing the work. And then once we did that, we also did a feature where you could do point-in-time recovery.

Starting point is 01:15:06 So let's say somebody did something bad, lost some data. You want to go back to a good known time. You want to be able to go backwards in time. Most of the time, this is for the purposes of forensics. You want to go back and say, okay, what happened? What was the state of the data at the time or whatever? But point-in-time recovery was something else where I didn't do all of the work, but myself and another engineer together did that. And then health check. So in WITS, there is a component called health check.

Starting point is 01:15:40 So what you're doing is that if you look at the Vitas architecture, there's a metadata store, a topo server, which stores the list of all the tablets. But you don't rely on that when you are serving queries. So you have the proxy layer called VT-Gate, which receives the MySQL connections and queries and sends them to the tablets. But it needs to know what is the primary for a given shard. And the way it keeps track of that is that it establishes connections to all the tablets, and it receives periodic health checks. So this health check code was actually very complex. And we would get bug reports, and they were very obscure, very hard to track down. They would only happen under certain conditions conditions because mostly they were race conditions or

Starting point is 01:16:29 issues with locking and stuff like that. So I ended up rewriting all of that. And that is one of the hardest software projects I've done in my life because that code was very complex. It was almost impossible to understand what it was doing at a code level. So Sugu sat down and he told me what it should do. He said, don't think about what the code looks like right now. I'll tell you what it should do. And then you can implement that. It was so difficult that parts of it were done by three people together. So we would get onto a Zoom call or some sort of a video call

Starting point is 01:17:06 and screen share and start looking at the code together and start writing code together. So the health check rewrite was a big deal. And what that enabled us to do was to support replica transactions. Prior to rewriting the health check, we did not have a way to do read-only transactions on replicas. You could do read transactions on the primary, but there were some users who said, we basically want a snapshot when we begin the transaction. We don't want to see the commits that happen while we are reading data. So we really want to do a begin, read, read, read, read, read, and end. So replica transactions was something that we were able to do because we rewrote the health checks.

Starting point is 01:17:53 So these are some things that I can remember. It's a big deal of stuff there. Deepthi, congrats. Coming back from a year hiatus, back into the swing of things, and kicking butt, if I can tell. Let's talk about the future then. So I know on the blog you had mentioned, or at least Florin had mentioned, back last month, the announcement of general availability of Vitesse 13. So what's in that?

Starting point is 01:18:16 What's in the future? What's the future of Vitesse? Is it separated from PlanetScale's roadmap? How do you map out Vitesse's? Obviously, there's a lot of users, so there's some user demand as well. And it's not all PlanetScale demand or support demand back into Vitesse. But give a snapshot of 13 and maybe into the future. I think in 13, we keep working on compatibility. and one of the things our query serving team realized at some point was that they needed to

Starting point is 01:18:47 rewrite how the query planning was being done query planning was being done in a particular way at the code level and they sort of had to tear the whole thing apart and put it back together but that would have been very risky so what they did instead was that they wrote a new version of the query planner while keeping the old one. And when the new version reached parity with the old one, we were able to say, okay, now the new planner is GA and it supports constructs that the old one simply could not. It was too complex at a code level to add support to certain things in the old planner.

Starting point is 01:19:24 So the new query planner is GA. And because we spent like a couple of releases building that up. So first you have to reach parity, then you can start doing new stuff. So in this release, we were able to add support to a number of constructs that didn't work earlier because towards the end of the previous release, we actually completed the work of bringing it up to parity. So that's a big thing. And then the online schema changes. A lot of improvements have been done to the online schema changes over time, and a number of them happened in the last release, and there will still be a few going forward, but that is

Starting point is 01:20:05 getting quite stable now. So looking ahead to the next release, the big things that we want to go into GA in the next release are the Vitesse native online schema changes, which are still marked experimental. We started, actually, I don't know if i should say we the maintainers from slack started building a replacement ui for what we had so vitas has like a very primitive management ui which was written back in 2012 or something like that it looks very ugly it's still using some old versions of AngularJS, which you can't upgrade from, and so on and so forth. So the team at Slack actually started building a UI for their own Slack internal usage. And at some point, they were able to open source it. So they were able to get their management to agree to open source it. And then they talked to myself and a few others who were in the maintainer team about how they can go about doing that.

Starting point is 01:21:12 So this new generation UI called VT Admin, Vitas Admin UI, is expected to go GA in the next release. And they've been working on it for almost two years now. So it's been a long process. But I'm actually looking forward to that because when I do demos, it's like, how can I show this really old and ugly UI, right? But I don't have a choice. Right. That's exciting. The other thing that was sort of a missing piece in WITOS for a long time was automatic failure detection. So we didn't talk about Kubernetes, but if you are running WITS in Kubernetes

Starting point is 01:21:49 and a WITS component goes down, Kubernetes will bring it back up. But what if a MySQL instance goes down, right? Kubernetes will bring it back up, but it will take time. And maybe people are running it with systemd or systemctl or something else that can bring it back up. But when a MySQL goes down to bring it back up takes time. And that time could range from 30 seconds to several minutes. And for most production systems, that's too long. So what we really need to do is for Wittes to monitor these and to do an automatic failover

Starting point is 01:22:28 when a MySQL instance goes down. And for the longest time, Wittes didn't do that itself. There was another open source project called Orchestrator, which people would integrate with to get this ability. Because Wittes might be running in Kubernetes, outside Kubernetes. You can run it in many different ways. But what we've started doing, and this is again a project that started in 2020, I want to say. So it's been in the works for over a year, was to take what Orchestrator does. It's open source, right? We can copy it in and change it the way we want to. So we took what

Starting point is 01:23:06 Orchestrator does, brought in some of the code, rewrote some of the code to do it the Wittes way, so that within Wittes, there is a component that is watching all the MySQL databases. And if the primary goes down, the failover is automatic. It doesn't have to be human intervention. So that self-healing mode or autopilot is where we want with us to go more and more as we look out beyond this year and look into the future for the next two to five years. Self-healing, autopilot, usability, these are things that we want to worry about for the community. Some of these are relevant for PlanetScale, but usability is not a big issue for PlanetScale because PlanetScale has its own UX. It is really a backend component as far as PlanetScale is concerned.

Starting point is 01:24:00 VT admin UI, maybe, maybe not. So it's a mix of what is driven by PlanetScale and what is driven just as project vision or driven by the community. I guess one more question on the release schedule. Is there, since you mentioned 13 and the next release, what is the schedule? Is there a schedule? Is it monthly?

Starting point is 01:24:20 Is the next version 14? Is it 13.5? Is it, how do you version? How do you release? What's the schedule? There is a release schedule. The next version 14? Is it 13.5? Is it, how do you version? How do you release? What's the schedule? There is a release schedule. The next version is 14. So a couple of years ago,

Starting point is 01:24:32 one of our maintainers wrote this up as a VITAS enhancement proposal. So we have a repo called enhancements. So this is modeled after Python, Kubernetes. Everybody does this now. If you want to do some fundamental change, you write it up as an enhancement proposal. So there is a published release cadence. We used to do four releases a year. So every quarter, effectively. And starting this year, we are

Starting point is 01:24:59 going to do three releases a year. And the releases come with certain guarantees that the open source maintainer team will maintain a release for a year after the release date. And maintaining means that any major vulnerabilities will be patched. If there is a critical bug that can lead to data loss or system downtime, that will be fixed and we will do patch releases if necessary. We don't typically do more than one or two patch releases for a major release, but we do end up doing a couple of patch releases for each major release. Gotcha. Anything we didn't ask you, Deepthi? Anything that we left on the table that we didn't get a chance to ask you in closing? We didn't talk about Kubernetes. So I guess two things. So one is

Starting point is 01:25:46 Vitesse has been cloud native since 2015. You have been able to run Vitesse in Kubernetes since 2015. And that is a non-trivial thing because like I said, you can't just kill a MySQL pod and restart it on another node and expect it to work because it needs its storage. Right. Right? Yeah. So that's one thing. I think the other thing is that the reason WITES or PlanetScale or anything like this

Starting point is 01:26:14 is relevant and will continue to be relevant is that if I look at the database market, there are some trends, right? One of them is the move to cloud. Everybody wants to run everything in the cloud. They no longer want to deal with their own data centers. That is a clear trend. And with cloud comes Kubernetes. So there is a data on Kubernetes community. They do surveys and more and more companies are getting comfortable with putting their data into Kubernetes. So the combination of this means that people are going to go to managed services. So Wittes is a project that people have traditionally run on their own.

Starting point is 01:26:58 But I expect that most people will actually prefer to either run with us on some sort of a Kubernetes service like from Amazon or Google or whoever, or to pay someone else to do it. Yeah. Well, I've run away yourself and you can have somebody else run it for you, right? You just stick to your actual guns,

Starting point is 01:27:20 so to speak, and build your own product instead of like supporting infra or why, unless you absolutely had to really you know unless you had a specific use case for doing so yeah and the team so this is the trend towards specialization that we have seen already with hardware right people used to run their own servers for everything and now most of that happens in the cloud because it's just easier to pay Amazon or Google to run your servers for you, except under certain conditions. Well, Deepthi, it's been a great journey with you

Starting point is 01:27:57 learning about Vitesse, yourself, Sugu, all of the corporate community involved in Vitesse, of course, and all the maintainers involved. I'm excited for what the future is for it. We don't personally use MySQL, which is why Jared had the question of Postgres, because we're kind of a Postgres family around here. Although we're not haters, we just have preferences. That's right. So, but I appreciate you sharing the journey, your journey.

Starting point is 01:28:23 Appreciate your time. Thank you. I really enjoyed this conversation. I always love talking about WITS and I in particular enjoyed the direction in which you took the conversation. So thank you. Awesome. Good deal. Thank you.

Starting point is 01:28:38 Your enthusiasm is infectious. Yes, very much so. That's it. This show's done. Thanks for tuning in. If you haven't yet subscribed, now is the time. Head to changelog.fm for all the ways to subscribe. And if you dig this show, we have other pods you may enjoy as well, such as Ship It.

Starting point is 01:28:58 We recently had Kelsey Hightower on where he drops a ton of wisdom. Here's a sampler. My entire career, my rule has always been document the manual process first, always. Okay. Because if you go and do everything in Puppet, now I got to read Puppet code to see what you're doing. How can I suggest anything better?

Starting point is 01:29:16 So if you write it down manually and you say, first get a VM, install, change log, then take this load balancer, put this certificate here, then get this credential, put it in this file, then connect to Postgres this version with these extensions. So now I can see the entire thing that you're doing. And then the next thing I do is say,

Starting point is 01:29:36 okay, now that we understand all the things that we're required to run this app, I want to see the manual steps that you're doing. All of them. We build the app using this make file. We create a binary. We take the binary and we put it where? You're not storing the binaries anywhere? Oh no, we're just making this assumption that we could just push the binary to the target environment. You need to fix that. That's a bad assumption. You need to take the binary and preserve it so that way we can troubleshoot later in different environments and we can use it to

Starting point is 01:30:03 propagate. Oh, okay, Kelsey, good idea. So we're just going to fix the manual process until it looks the best we can do for what we know at the time. Now, once we have that, I'm going to give that process a version. This is one data of everything. We've cleaned some things up. We saw some bad security practices. We've cleaned up the app.

Starting point is 01:30:21 So now go automate that. All right. Continue to listen to that conversation at changelove.com slash ship it slash 44. That's episode 44. Thanks once again to our friends and partners at Fastly for making all our assets and our pods super fast globally. Check them out at fastly.com. And also to Breakmaster Cylinder for keeping our beats fresh and friendly.

Starting point is 01:30:44 And of course, to you for listening, we appreciate you. And don't forget that bonus 12 minutes for our plus plus subscribers. Make sure you stick around. That's it for this show. Thanks again for tuning in. We'll see you next week.

The Changelog: Software Development, Open Source - The story of Vitess (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.