The Changelog: Software Development, Open Source - Shopify’s massive storefront rewrite (Interview)

Episode Date: October 16, 2020

Maxime Vaillancourt joined us to talk about Shopify's massive storefront rewrite from a Ruby on Rails monolith to a completely new implementation written in Ruby. It's a fairly well known opinion that... rewrites are "the single worst strategic mistake that any software company can make" and generally something "you should never do." But Maxime and the team at Shopify have proved successful in their efforts in this massive storefront rewrite and today's conversation covers all the details.

Transcript
Discussion (0)
Starting point is 00:00:00 One of the main things I keep in mind of this is I've read so many articles and blog posts and opinions everywhere on the internet over the years that say, don't do rewrites. That's like the main takeaway to remember for all of those articles. And I want to be the person who does the opposite and says, rewrites are possible. You can do rewrites if you do them right. And how do you do them right? There's many key things involved in there, but it's not a thing where you have to kind of
Starting point is 00:00:28 push the rewrite option aside if it's something that you don't think is possible. Being with her changelog is provided by Fastly. Learn more at fastly.com. We move fast and fix things here at changelog because of Rollbar. Check them out at rollbar.com. And we're hosted on Linode cloud servers.
Starting point is 00:00:48 Head to linode.com slash changelog. What up, friends? You might not be aware, but we've been partnering with Linode since 2016. That's a long time ago. Way back when we first launched our open source platform that you now see at changelog.com, Linode was there to help us and we are so grateful. Fast forward several years now and Linode is still in our corner behind the scenes helping us to ensure we're running on the very best cloud infrastructure out there.
Starting point is 00:01:16 We trust Linode. They keep it fast and they keep it simple. Get $100 in free credit at linode.com slash changelog. Again, $100 in free credit at linode.com slash changelog. Again, $100 in free credit at linode.com slash changelog. What's up? Welcome back, everyone. This is the ChangeLoggle podcast featuring the hackers, the leaders, and the innovators in the world of software. I'm Adam Stachowiak, Editor-in-Chief here at ChangeLog. On today's show, we're joined by Maxime Viancou to talk about Shopify's massive storefront rewrite from a Ruby on Rails monolith to a completely new implementation written in Ruby. As you may know, it's a well-known opinion that rewrites are the single worst strategic mistake that any software company can make. And generally, it's something you should never do. But Maxime and the team at Shopify have proved successful in their efforts to rewrite Shopify's storefront. And today's Today's conversation covers all the details.
Starting point is 00:02:35 So, Maxime, we teased this conversation when we had your colleague, Simon Eskildsen, on the show a few weeks back, episode 412. But he was here to talk about napkin math, and he was part of this Shopify storefront rewrite. But we didn't want to talk about it with him because we knew you were coming up and we want to talk about it with you. So first of all, thanks for joining us on the ChangeLog to talk about this project of yours. My pleasure. Thanks for the invite. So it's a big thing to rewrite anything, but especially something as big and powerful
Starting point is 00:02:58 and successful as Shopify. I think the main piece of the news that people latched on to was the monolith versus microservices implications here as Shopify has been a poster child of Rails monoliths at scale, right? Yep. And it's something that we're learning about every day i think the more we kind of figure out what should go in the monolith and what shouldn't and that kind of led to the decision to split that specific domain into a separate application yeah so maybe as we lay the foundation for this conversation we're going to go through the decision-making process for the rewrite.
Starting point is 00:03:45 We're going to go through the process of actually getting it done because it was a couple-year endeavor. And at scale, you have to move slowly and carefully. And you guys built some really cool stuff with some NGINX Lua rewriting things to make sure you're doing it right along the way. We're going to dive into those details.
Starting point is 00:04:04 But first of all, lay the groundwork of what the monolith looks like, maybe before you started, what all is involved. People know Shopify as an e-commerce platform where people can run their own online shops, and so it's multi-tenant in that way. I assume a lot of our listeners know what Shopify is, but what does the monolith look like and what all is it doing? Right, so Shopify was started almost 15 years ago now and this all started as a simple Rails application, the majestic monolith approach where everything would be into that one application, which we're still using today, 15 years later.
Starting point is 00:04:40 And that's the main place where most Shopify developers tend to ship code into. Of course, with scaling, you run into challenges with how you get to handle multiple hundreds of developers working on the same platform, shipping that code base at scale, but also dealing with what should go into Monolith. And as we eventually ran into a point where it wasn't possible to use Rails as it is into its base form, we hit the limits of what we could do with basic Rails. So as you mentioned, Shopify is a multi-tenant application. So just dealing with the implications of that and making sure that there's no cross-tenant access for data. That
Starting point is 00:05:25 sort of thing required some patches to the monolith. So the monolith is an application where most of the code is present in there. We're now splitting it into separate components so that we have business domain specific components in the monolith. And that makes it so that, for example, storefront and online store specifically was this own component where everything would be into that directory. So it kind of gives us a nice way to create these boundaries between the different components so that there's no cross-component access. And hopefully, eventually, everything has this sort of interface
Starting point is 00:05:59 between each component and there's no violation. There was one article on the Shopify engineering blog that we've just posted about this that explains how we're starting to enforce more the boundaries between those components to make sure that we don't run into issues with different class names that don't make sense or that sort of stuff.
Starting point is 00:06:24 So these are code boundaries, though. You're still inside of a process or inside of a code base. These are not services that talk over a network, correct? Right, same process, correct. So it's really more of a developer experience thing more than it is a topology thing for networking, for example. Okay, and so if we talk about the monoliths parts, we mentioned storefront.
Starting point is 00:06:46 You can explain exactly what storefront is. Then there's the admin. If anybody's run a Shopify shop, they know what the admin is, at least to a certain degree. Surely there's more to it. There's the payment processing and the checkout part. So there are some logical sections. Am I missing any?
Starting point is 00:07:02 I'm sure I'm missing some. I'm sure there's tons on the back end. There's, yeah, yes. What else is there? So there's a ton of them. Yeah, I mean, the big ones. So there's, of course, the billing stuff is in there. There's payment processing, as you mentioned. There's, of course, storefront, everything that's about pricing, returns, that kind of stuff is all separated into its own components. And it typically lines up with a specific team owning that component to build it. So the way those components work is essentially a tiny Rails app within each component with its own app directory, tests, and everything that's kind of wrapped up into this one component. And Online Store is one of them for storefront-specific stuff.
Starting point is 00:07:46 Okay. It's worth noting that the storefront, which is the topic of conversation here, has been rewritten with this strenuous process that we're going to go through here. It's still a Ruby app. Is it still its own Rails app, or is it somehow different than that? We started with a base Ruby application. We are using some parts of Rails, but not Rails itself directly. And the reason for that is the way that Shopify storefronts are implemented
Starting point is 00:08:16 don't really line up with the CRUD pattern that typically a Rails app would use. So if you go on a Shopify storefront, you go on the index page, for example, you're going to have the index pages, the product pages, the collection pages, and all of those different endpoints that you would see on a storefront. Now, all of those things could be implemented as Rails actions to be rendered. But starting out from scratch, we kind of realized that we don't need everything that Rails provides
Starting point is 00:08:48 and we could simplify this with a Ruby application to get started with at first. So the storefront is kind of a simplified part of a full stack insofar as, I'm assuming now, so correct me here, it's taken in a request and then it's like basically once it determines which storefront it is then it says okay get all my templates which are writable by the storefront owner right like they're liquid templates by whoever owns that theme or whatever
Starting point is 00:09:16 it happens to be grab that those templates grab all the data merge them together and spit out some html and like in the most simplified form is what it's trying to accomplish, right? That's exactly what it is. Because of that, the goal of that specific application is to do that really well, really quickly, and at scale. Because we don't necessarily want the same performance criterias that we would for the admin, for example, separating that application gives us
Starting point is 00:09:44 a do one thing and do it well, kind of a Unix philosophy thing for that specific service to do. So what were the goals with the rewrite? You mentioned there was like three aspects that you wanted to accomplish by going through this process. Right. So of course, success criteria. The first one was to make sure that we had the same features and the same behavior in
Starting point is 00:10:07 the new one than we did with the older application. So by that, we say that for a given request for the same input, if both are treated as black boxes, you get the same output and they're just behaving the same way for whatever input you give them. That's where we use the verifier mechanism to make sure that we for a given input, we get the same output and make sure that we never serve something to buyers that is not equivalent. And that's incorrect or invalid. The second one would be to improve performance, of course. So improving performance with the new application, we're able to really focus and drill down
Starting point is 00:10:45 into the performance criteria that we've set for this application. But not only the application itself, in terms of infrastructure, we've kind of thought about what we want out of this in the next 10, 20, 30 years in terms of how we want to set up Shopify storefronts to scale with time.
Starting point is 00:11:03 So for example, running on an active-active replication setup allows us to read data from different locations without needing to go all around the world if we don't have to. And thinking about how we write the code within the Ruby application is something that we're using different Ruby patterns that you usually wouldn't. So it's not really idiomatic Ruby. It's not really something that you kind of just write your pretty Ruby
Starting point is 00:11:29 as you usually would. We are thinking about certain things that do have an impact on performance in the end. So something like thinking about the memory allocations underneath is something that I know I certainly didn't do before that project in Ruby, but now we think about those things to make sure that there isn't anything that we're... Basically make sure that we're doing the right thing
Starting point is 00:11:52 for performance. And finally, the last one was to improve resilience and capacity. So Shopify has this major kind of like Super Bowl part of the year, which is Black Friday and Cyber Monday. Coming up. Yes, coming up in a month, roughly now, or even two months, I think. So end of November.
Starting point is 00:12:12 So that's typically the place at time when we kind of figure out or find out if everything we did during the year was good enough. And so far, it's been going well. This year is going to be the first year that we're powering most of the storefronts on the platform with that new implementation. So it's our first real kind of game day for us for actual big time events in the wild. And hopefully everything goes well. This may be famous last words, but hopefully everything goes very well. And we're doing great.
Starting point is 00:12:45 But that's the third goal of the application, to basically take what we had with the monolith and make it even better because we're optimizing this exactly just for that one use case. So a big question that I would have at that point when you decide here are our goals and they're around performance, they're around scalability, resiliency, and you're extracting this section which has, well, it's called a limited scope. I'm sure it's very complicated, but a limited scope logically is, did you consider other languages, other runtimes altogether? Yes, we did.
Starting point is 00:13:20 The one thing that made us decide to keep using Ruby is one, like all of Shopify is on Ruby. So in terms of developer knowledge transfer, that's the most accessible thing to do as an organization. Another thing is that the liquid library that we're using to render storefronts is Ruby based. So keeping that runtime is something that at the very beginning of the project kind of made sense to get started with. And the other thing that we're now starting to see is we're just starting
Starting point is 00:13:50 to explore alternative runtimes for Ruby. So Truffle Ruby is something we're trying to explore to see if that could help with performance in terms of storefront rendering. So it's not something that we've really wanted to move away from. We're committed to Ruby, committed to Rails, and that decision still makes sense today. Maybe eventually we'll start to think about a different runtime for this, but so far it's been working for us. Well, the amount of stuff that you'd have to rewrite, your rewrite is already hazardous enough.
Starting point is 00:14:22 And we all know the treachery that is a big rewrite. This is a pretty substantial rewrite. And it took you two years, soup to nuts. You're not 100% on it, but you're getting close, right? Yep, correct. Imagine how much longer that would have been if you had a rewrite liquid in a new language and rewrite. I'm sure you're pulling in lots of shared code throughout Shopify into the new storefront
Starting point is 00:14:42 that you can just build on top of. It's like starting with a bunch of Lego blocks. If you had to switch languages altogether, you'd have to build each Lego block, and you may never finish. Exactly, and so I think in this sort of paradigm, extracting everything into this different application I think is the first step, and once that's done,
Starting point is 00:15:04 we're able to work with it and do something different eventually. But the first step, of course, is to take everything out and have this isolated thing that we can then play around with and experiment with. And it's so much easier to do that once you have that out of the, as a separate service that you can really have the smaller scope of, rather than trying to work with different languages into the monolith, which would definitely be a bit harder to do. I'm curious why the storefront was chosen, or maybe it was just your team's project. Are each part of this monolith going to give a similar treatment,
Starting point is 00:15:37 or are they parallelizing that effort, or was it just storefront first and see what happens? So far it's only storefront, and I don't think there's any other plans to extract any other major blocks from there. So something to keep in mind is Shopify's platform receives a lot of traffic and the majority of that traffic is for storefront requests. Admin, of course, takes a good chunk of that,
Starting point is 00:15:59 but mainly is storefront traffic. So it made sense for us to optimize that specific place of the code, simply because that's where most of traffic goes. And that's something that we could leverage in terms of impact. The other thing is that storefronts don't necessarily have the same requirements as you would need from the admin, for example. So the admin is something where you want to have
Starting point is 00:16:25 valid information at all times. For example, payment processing and billing is something where you want the right amount of dollars being taken in and out of your accounts. Performance is less of a criteria there because you want to ensure that you have proper calculations and logic going on there. On the storefront, there is a bit less of that strong requirement.
Starting point is 00:16:47 Yeah, I think that's spot on. I mean, a storefront is rendering correct information, obviously. It's not about not correct or incorrect. It's more like an admin is for a limited scope of type of person, whereas a storefront is literally anybody on the internet. Exactly, yeah. And if you look at Shopify as a up or down scenario, for a limited scope of type of person. Whereas a storefront is literally anybody on the internet. Exactly, yeah. And if you look at Shopify as a up or down scenario,
Starting point is 00:17:11 the majority of that up or down scenario is likely looking at storefront, not so much admin or others. Like you might get a limited and a small portion of the world is going to care, but the majority is going to care about storefront being open and fast. Exactly. So storefront, the main criteria there is performance,
Starting point is 00:17:27 especially as people run onto mobile devices from all around the world. You want to get people to have their storefronts to be loading as fast as possible from any sort of circumstances. So in this case, there's a bit less requirement to get the right data at all times to be precise at any given second.
Starting point is 00:17:45 It's really more to get the response in their devices to start doing something. You think about this from a standpoint, you mentioned that Rails CRUD scenarios didn't really fit in with the criteria of this, which is we're kind of defining what was it that sort of pinpointed this, as Jared mentioned, why storefront?
Starting point is 00:18:04 Why would you rewrite this? And would you parallel other opportunities inside of Shopify? And I think you think about the right tool for the right job. And while we're not dismissing the fact that Rails isn't amazing, as you said, you're committed to both Ruby and Rails. So this isn't a matter of like done with Rails, see you buy. It's more like, well, maybe in this scenario, performance and speed and optimization, all these different things outweigh that. And I think the bigger play here might be to help us understand why Rails didn't fit anymore and why a rewrite made more sense.
Starting point is 00:18:38 And in particular, to Jared's question, why Ruby still yet, which makes sense because, hey, you've got a lot of Ruby knowledge inside of Shopify. It would make very little sense to move away from unless you had a really good long-term plan for that. But more so, why the rewrite for Ruby, but so much the right tool for the right job? Ruby is still the right tool for the right job, but Rails didn't fit anymore in that realm. What did you gain by going pure Ruby and non-idiomatic Ruby and all those things? One thing that's interesting is we're still using a fair amount of Rails components in that new implementation that aren't necessarily the whole Rails app itself, but we are using
Starting point is 00:19:15 some bits of active support to compatibility purposes to make sure that what we have in the monolith still works in the new application. There's various gems that are still used in Rails that we do works in the new application. There's various gems that are still used in Rails that we do use in the new implementation. So the way I see it, it's more of a hybrid in between a bare-bones Ruby app and what Rails would provide. What we're kind of putting behind is everything that's the Rails routing, for example, with something that we didn't necessarily need for that implementation because of how storefronts are routed can be implemented fairly simply without going with everything that Rails provides with routing. But there's a fair amount of behavior and features that Rails does provide
Starting point is 00:19:55 that we are using still in the application through gems and modules that we've imported there. It sounds like to me when you talk about the non-idiomatic ruby when i read the blog post a lot of the the things you're doing is like using the self-destructive style method calls like map bang where it's not going to return you a new array or whatever it happens to be of objects it's going to actually modify itself in place and the reason why you do that inside the storefront is because you are optimizing the heck out of memory consumption. You're trying to get memory
Starting point is 00:20:29 consumption to as small as it could possibly be. And so there's your why not Rails right there. It's like if we can load as little, I mean one reason of course, there's plenty of reasons, but if you could load as little bit of that code into memory as possible of what you need of Rails and not the entire stack,
Starting point is 00:20:46 you are undoubtedly going to save in memory allocation all those objects you're not using. Exactly. There was a, I think it was Sam Saffron from Discourse who posted a memory article, or it was an article about how ActiveRecord takes so much memory, and I think he compared it to SimpleSQL
Starting point is 00:21:06 or I don't remember the name. It's a library he wrote himself. I'm assuming it's a library where you're basically writing around a SQL with some help, I guess. How I learned to stop worrying and write my own ORM. There's this one, but there's also an analysis of memory bloat in ActiveRecord 5.2 which is a different one, which is interesting. And so that's a good example of memory usage that we've kind of skipped with the new implementation.
Starting point is 00:21:33 And for example, because Storefront is, almost all of it is read traffic, there's no writes involved, there's no deletes involved, there's no updates involved. It's really, I have a request, generate a read response and you know you just have to get data from the database and send it back um that sort of thing does not necessarily warrant using active record or anything that's heavier in terms of oram to read data straight sql kind of works to get the data out of there and uh you know having access to that data directly through reads is enough. So in this case, it's a matter of reducing memory allocations
Starting point is 00:22:08 to getting the garbage collectors to run for either less time or less often, and that has a major impact on their response times that we're seeing. What was it that really drew you to this rewrite? When did the wars begin to show, so to speak? Obviously, Rails has worked quite well for many years. You've IPO'd'd you're worth lots of money in terms of a company you're doing amazing what were the main things that started to prop up that said you know what we really need to get this down was it simply speed and uptime was it memory was it servers falling over was it like
Starting point is 00:22:41 servers on fire what was it that really like struck this chord and said, we need to really fix this two years ago? I think it was a progressive pain point that kind of, it never was a big thing that kind of appeared in one night. It's just something that with time we started to see performance slowly degrading with in terms of response times on the server. And eventually, we kind of had to do something about it to improve things. And interesting story is the initial commits for that applications were Toby himself, who took it upon himself to start
Starting point is 00:23:19 something and as a prototype, get something up and running and make it as lean as possible to get started. And then eventually that became a team and we picked it up and that became the project that we're working on. But there never was really one thing that kind of said, okay, that's it. We're doing this thing now. It was a slow process that eventually kind of arrived at the conclusion that we had to do something. Why not try something that's a bit different than what we would usually do and
Starting point is 00:23:47 let's see where this goes, where this leads and that's where we are now. And we kind of realized that the approach made sense and we kind of went along with it and we're still there now. That makes sense too why you mentioned the black box approach in terms of one of the success criteria meaning that feature parity. Obviously if you're going to replace something you want to be replacing it as equally as possible so that as you begin to switch over, they act very similar. So not only similar, but also ideally byte equal. So we want the exact same responses to be returned.
Starting point is 00:24:17 Similar is too vague then. Identical. It is, but it depends. It depends. So that's a good question. In some cases, we had to ignore certain things to go closer to what you're saying closer to equivalent rather than byte equal. So one example of that is and that's something that's in blog post when we do the verification
Starting point is 00:24:36 requests to send the initial request to both backends to see if the output is equivalent or ideally the same. What would sometimes happen is on some storefronts, you can try to render some random values and those values may change on a render by render basis. So if I try to do a request to one of the applications and then I do the same to the other application, but both use different random values, that's not going to be bite equal. And because of that, then to us, that would be a verification failure and would be, hey, those two applications are not doing the same thing. There's an issue, what's wrong? But then looking into those, we realized that it wasn't really an issue. It's more of the output was
Starting point is 00:25:19 using something that relied on either time or randomness based values. And because of that, that's something we have to ignore and say, that's a false negative. We have to just accept it and it's fine and move on to real issues. Our friends at Pixie are solving some big problems for applications running on Kubernetes. Instantly troubleshoot your applications on Kubernetes with no instrumentation, debug with scripts, and everything lives inside Kubernetes. But don't take it from me. Kelsey Hightower is pretty bullish on what Pixie brings to the table.
Starting point is 00:26:01 Kelsey, do me a favor and let our listeners know what problems Pixie solves for you. Yeah, I did this keynote at KubeCon where we talked about this brings to the table. Kelsey, do me a favor and let our listeners know what problems Pixie solves for you. Yeah, I did this keynote at KubeCon where we talked about this path to serverless. And the whole serverless movement is really about making our applications simpler, removing the boilerplate, and pushing it down into the platform. Now, one of the most kind of prevalent platforms today
Starting point is 00:26:20 is Kubernetes. It works on-prem, works on your laptop, works in the cloud, but it has this missing piece around data and observability. And this is where Pixie comes in to make that platform even better. So the more features we can get from our platform, things like instrumentation, ad hoc debugging, auto telemetry, I can keep all of that logic out of my code base and keep my app super simple. The simpler the app is, the easier it is to maintain.
Starting point is 00:26:47 Well said. Thanks, Kelsey. Well, Pixie is in private beta right now, but I'm here to tell you that you're invited to their launch event on October 8th, along with Kelsey, where they'll announce and demo what they're doing with Pixie. Check this show notes for a link to the event and the repo on GitHub or head to pixielabs.ai to learn more. Once again, pixielabs.ai to learn more. Once again, pixielabs.ai. all right maxine so you and the team defined your success criteria
Starting point is 00:27:34 feature parity improved performance and improving resilience and capacity and then you set out to rewrite the thing you had to somehow have some guide rails to know whether or not you were totally screwing up or not. We've talked about it a little bit. But guide us through the whole process. This has taken a while, and you had to invent a few tricks and tech just to help you get this rewrite written
Starting point is 00:28:00 and deployed out into the greater Shopify ecosystem. So tell us how you tackled this problem. So the initial vision around this was to, of course, when you're starting from scratch, there's always like a transitional period where you don't have anything. And up to a certain point, you don't have anything working well enough to say, we have something that's equivalent to the previous implementation. So that whole starting phase of the project is very much exploratory, and you don't really know if what you're doing works. To reduce the risk of that, we've implemented what we call a verifier mechanism. So what that allows us to do is to compare real-world production data and see if what we're doing is close to our reference baseline,
Starting point is 00:28:55 which is a monolith in this case. So the previous implementation lived in the monolith. We wanted to make sure that what we're doing in the new service is doing the equivalent behavior or the equivalent output in terms of what it does. And that mechanism allows us to say, okay, we're looking at the responses from both applications and we see discrepancy online. I don't know, 117, there's a missing module that's not included or something. And so that gives us an idea, okay, we have to fix this. Either most of the time, it's something that we haven't implemented in the new application, or it's something that's a bug that we've noticed that we didn't know was there. But because we have this verification mechanism that now we realize, oh, maybe there's been a bug in that application
Starting point is 00:29:38 in the monolith that we didn't really know about. And that's something that we've ran into of, like, it's a bug that's been there for six years, it we never really knew it was there. But upon doing that verification process, we realized that we had implemented something that was the right thing in the new application. And upon comparison realized, it's never been the right thing in the previous one. So it helped us figure out how to go and implement what's most impactful to get towards completion and parity in terms of features as quickly as possible. So that's why at first, when we first started the project, we started to look at a single shop to say, okay, that's the one shop we want to try to support and target for the
Starting point is 00:30:20 release of that new thing. Running the verifier mechanism gets us to a point where we're able to say we're that close to getting that response to be exactly the same for the new application. And from there, move on to other endpoints, other shops, and then figure out how to scale to the rest of the platform. So I've done this at tiny scale, where I take one endpoint, I curl the endpoint, take the response, take the other endpoint, the one that's in development or the one I'm building, curl the same thing, take the response, pivot the diff,
Starting point is 00:30:52 and then I look at the diff, and I hope diff says these two files are identical or whatever it says, right? You never quite got there because of all this randomness and stuff, but was your verifier essentially a differ that you just lodged into your request pipeline? Tell us how that worked.
Starting point is 00:31:08 Exactly. So that's exactly what the mental model around this is. And it's just, instead of us doing the curls and the diffs manually by hand in the command line, it's something that's happening. Your customers did it. Exactly. And automatically for us, we're getting some data coming back from everyone just requesting storefronts from all around the world. Right. So of course, there's that part of the process, but there's also the one that we're doing
Starting point is 00:31:32 locally on our own machines. So this thing runs in production where the verifier gives us data about, okay, there's that many failures in terms of verification that you have to fix for shop XYZ and then over these different endpoints. But also once we know that there's an issue on a given storefront or a given endpoint, we then have to go in our machines and try to figure out, okay, what's the issue? How can I fix it?
Starting point is 00:31:56 And how can I bring it back to parity with the baseline implementation? So that specific mechanism is into a NGINX routing module where it's written in lua we're using open resty and what it essentially does is at the very beginning of the project all traffic was going all storefront traffic was going to the monolith for storefront traffic nothing was really going to the new implementation as we were just getting started. The nice thing about that is for every, I don't remember exactly what sampling rate we had, but for example, something like for one in every thousand requests that are coming into Shopify for storefronts, take it, but also do the request to the new implementation and compare the output of those two requests, do a diff on them, and then upload whatever diff results happen
Starting point is 00:32:47 to an external store so we can look at them later and see what was the issue. So that helped us figure out that certain shops have more diffs than others, certain endpoints have more diffs than others, but also seeing the traffic patterns of where we should try to tackle at first was a super nice signal for us to say, okay, there's that many failures there, let's try to do this one first to get as much impact as we can there, and then move on to the other ones eventually. What kind of diffs are we talking about here? What kind of non-parity, what's an extreme example and maybe a non-extreme example of a diff gone wrong?
Starting point is 00:33:26 What's to say? I could talk about this for hours. So, one of the most extreme examples is you try to open the page. So, I'm talking about a buyer's perspective. You open up a storefront and assuming that the
Starting point is 00:33:41 new implementation was going to render that page, all you would see is a blank page, nothing else. And that's one of the most extreme examples where you're like, okay, this cannot go out in production, right? And one of the reasons behind that could be a missing head tag. For whatever reason, there's a missing thing that the page does not work at all. Some of the more extreme examples in terms of, not in terms of how a buyer would perceive it, but in terms of
Starting point is 00:34:07 how we would perceive it is there's a missing new line, but that's it. There's just one missing new line somewhere. For some reason, we're not returning the same string for whatever method. And the verifier screams, Hey, that's not the same thing. There's a missing new line there, which is not in the old one. And that's something that we have to deal with so of course all of those non-problematic issues we were seeing we started to say oh that's not actually an issue we can just ignore that away and say that new line is not really a problem or fix it of course but there's some cases where there's issues that we realize just like the time-based and the randomness-based issues that we didn't really want to block us as we started to get more and more support for certain requests.
Starting point is 00:34:51 And we were able to say, okay, well, these patterns that look like this, for example, if there's a timestamp in the responses, we can just ignore that away. If there's a script ID that's being generated by the script or something, ignore that away. And then as more patterns started to come up, we came up with a pool of patterns that we knew weren't problematic, and knowing those, we were able to focus on the actual issues.
Starting point is 00:35:16 Let's talk about that thing, because this isn't like a typical error tracker that you're doing. This is parity, and I'm curious how you log these things. How do you keep track of and organize this so not only you, but others can triage
Starting point is 00:35:31 this and say, okay, these are the ones we should pay attention to, these are the ones we shouldn't. So like, it's probably not your typical bug tracker that's doing this for you, or maybe it is, I don't know. How did you log these things and then organize them? Great question, and it actually ties into how would you do this? So assuming I'm asking you the question, you have to do this project where you implement
Starting point is 00:35:51 this new thing and you have a million merchants on your platform that are using the like so many storefronts on the platform, you have to get parity for all of them. Do you go with a breadth first approach where you try to support as many shops as you can for, say, a single endpoint, like all index pages, and you support all index pages across all of the merchants? Or you try to go for a single merchant and cover everything for that single merchant, but maybe that merchant has some features that the other merchant does not use, and you have to think about, well, maybe I should consider other merchants first. Should I consider the bigger merchants, the smaller merchants, because there's more people?
Starting point is 00:36:32 Do I want to look at it as a theme-based approach? Because usually there's going to be a mass amount of shops that use the same theme. Different ways of looking at this. And in this case, it kind of ended up being a thing where we started with a handful of shops at first that were the most, I guess I could say problematic shops in terms of performance, where they would cause a high amount of load on the platform because of their storefront traffic. And from there, getting the diffs from them to fix them eventually. But there are two ways of seeing this. So it's either breadth-first or depth-first.
Starting point is 00:37:08 So to analyze the actual diffs that we see in terms of parity, we upload that to Cloud Storage where all the diffs are kind of aggregated. And later on, we can figure out how those are. And then the other side of this is that we keep track of where the diffs happen in terms of, is it shop X, Y is it endpoint abc and based on that we can run through our logging pipeline to see where do most of the issues
Starting point is 00:37:33 happen is it on that shop isn't on that endpoint and that gives us an idea of what we should try to tackle at first so on splunk we have so many dashboards that are just trying to figure out you should look at this first, because this is where most of the issues are happening. Datadog is also giving us a bunch of information in terms of where we should focus on first. And later on, what's happening is that on the developer side of things, we have tooling locally to be able to kind of comb through the diffs that we have stored on the cloud storage part of it, and read through what are the most frequent ones.
Starting point is 00:38:06 I don't know what I would have done here, honestly. It seems, as you described it, quite overwhelming. As you mentioned, I might have gone down both roads and tested both sides of the water and sort of drawn some consensus from the team to see, okay, which direction do we feel is better, breadth or depth? And I think I might have done a little bit of both to get a sampling of each direction. But it seems like just daunting. Millions of merchants,
Starting point is 00:38:33 unlimited amounts of traffic, tons to dig through, and anomalies everywhere. So I have extreme empathy for you. It seems like a daunting task. I would send, there's a team, right? So yeah, it's not just me, of course. We're a team of- Of course, yeah. I mean, the proverbial you, meaning like you many. So then you like divide and conquer, right?
Starting point is 00:38:51 So like you said, one team depth and one team breadth, and then you meet in the middle. That's like when you're- Interesting. Yeah, that's something we did try to do. So some people were focusing on a single shop while others were trying to cover as many shops as possible. And I think what eventually happened is, you know, when you look at things, you have to
Starting point is 00:39:09 do everything at some point anyway. So you're going to have to go through both paths kind of as a balance, try to do both at the same time. And eventually you'll reach a critical mass of supported requests from where you can kind of move on to go into the more specialized things for depth. So this is my kind of problem, by the way. I'm a completionist. I love this kind of problem. Here's a big goal, right?
Starting point is 00:39:32 We know what the end looks like. It's called 100%. We're not there yet. We know we're at 32% or whatever your numbers indicate. And we have a clear path to gain there. What do you do? Well, you check the next diff and then you fix that problem. And then you, oh, now we're at 30. OK.
Starting point is 00:39:47 And you try to find the ones where you can implement a module and chop off a whole leg. And you're like, oh, look at that. That module just solved these 15 problems. And you're just on your way to the end. It's like a good video game where you're like, I'm trying to find, I got my main quest and my side quests. I've got to get them all. So let's just start hacking away and making some progress. video game where you're like i'm trying to find like i got my main quest on my side quests i've
Starting point is 00:40:05 got to get them all so let's just start hacking away and making some progress i would i would enjoy this and i'm sure it gets you get down in the mucky muck and you're like oh these new lines are killing me you know yeah very much yeah but still i'm sure they're those huge wins where you just slice off one big chunk and you see all these different stores go to the parity. It has to be pretty cool. And that's super interesting you say that. So the percentage-based thing, we do have that. We have a dashboard that says, this is the current support we have.
Starting point is 00:40:32 We're going towards 100%. And one of the funnier moments during this project where at the very beginning, it was easier, I think, than it is now because we're running into this. We're seeing diminishing returns now because it's more edge cases and we're trying to fix all the tiny, small things that are left to be fixed. But in the very beginning of the project, every single PR could potentially unlock so many more shops
Starting point is 00:40:56 and so many more endpoints. So on every deploy, we kind of look at that percentage metric and say, how much is my PR going to do there? Is it going to bump by 0.5%? Is it going to be 1%? Like 2%? That would be amazing. So that kind of gamification of the thing
Starting point is 00:41:13 also made it fun and helped us run towards 100%. So two questions about Monolith in the meantime. The first one is, and maybe I guess they are related. So was Monolith a moving target, or is it pretty static in its output? As you build it, because it takes a couple years,
Starting point is 00:41:32 were there changes going into the Monolith storefront so you had to play catch up? We did. So play catch up for different things. So you have to play catch up with internal changes in terms of other developers working on the Monolith and us trying to catch up with that. There's also catching up in terms of what merchants do. So if merchants start using a given feature that we just do not support in the new application yet,
Starting point is 00:41:54 then that's another source of potential catching up we have to do. So you could go backward. You wake up in the morning, you're at 33%, now you're at 28% because somebody used a new feature. Exactly. That's happened multiple times where you, exactly that. You go out in the evening, you're like, oh, nice. We're at like 37%. And then the next morning was something like 17%. We didn't do anything. What happened, right? Or you go to lunch and you're like, we left for lunch at 32%. We came back at 20. What happened here? That's demoralizing. Exactly. Exactly, exactly.
Starting point is 00:42:25 So that sort of thing was one of the harder parts. Of course, you have to deal with how do you get people to onboard your new project to get them to help and support that new project as well as you trying to get them to work on it. So eventually, I decided to make it kind of my quest to make it as easy as possible for others to start contributing to that project by making the documentation amazing, as welcoming as possible to reduce the friction, and to basically get people to say, hey, look, it looks more exciting to work on that new thing than it is to come onto the project and on their own kind of contribute to the new thing rather than only keep working on the previous implementation. So that's something that really helped. I think in terms of how we drew the line was, if it's something that was already in the monolith by the time we started the project, that's something that we would have to take on ourselves, the team working on the rewrite. If, however, it's something that's not in the monolith yet, it's not anywhere, it's a new project, then that team should be able to consider both projects
Starting point is 00:43:35 because they know it exists. They know they have to build for the future and make it into the new application as well. And that's how we kind of got that line drawn to say who's handling what. Eventually we reached a point in the project where most people were also writing that code in the new application. They knew they had to do it to be future-proof. So my second question about the monolith is because you were going for parity,
Starting point is 00:44:01 did you ever have to re-implement bugs or suboptimal aspects of the monolith because you had to have the exact same output? Yes. That also sucks. It's like, this is my brand new thing, I have to put the bad stuff in the new thing? Yeah, well, so it's a bad thing, I guess, for us as developers in terms of, it doesn't feel good. Yeah, demoralizing. But in terms of how a buyer or a merchant would see it, people using Shopify, to them, that's good.
Starting point is 00:44:25 So one of the goals we have is to basically make it so that for storefront, specifically for online store, if you have a theme archive that you have from eight years ago, for example, and you try to upload that today, it should work the same way it did eight years ago. We try to be as backwards compatible as possible. So, of course, if there's something that was introduced eight years ago, we have to be as backwards compatible as possible. So of course, if there's something that was introduced eight years ago, we have to make sure that's still there. And the other thing is Liquid, the gem that we use for and that we built and we use for storefront templates is almost train complete. So the possibilities of what Liquid can do is almost infinite. So there's features that we shipped at some point that kind of became used in ways that we did not expect and that we did not really think about
Starting point is 00:45:13 that we still have to support today. Even though that's not what we want it to be used for, we have to keep it this way. So of course, we had to port some bugs that unfortunately are kind of, it doesn't feel good. But for the people using that, it's a, I think, a service to them to say, look, your thing you had from a few years ago still works today.
Starting point is 00:45:33 And there's no breaking change in there. What's up, friends? When was the last time you considered how much time your team is spending building and maintaining internal tooling? And I bet if you looked at the way your team spends its time, you're probably building and maintaining those tools way more often than you thought, and you probably shouldn't have to do that. I mean, there is such a thing as retool. Have you heard about retool yet?
Starting point is 00:46:09 Well, companies like DoorDash, Braggs, Plaid, and even Amazon, they use retool to build internal tools super fast, and the idea is that almost all internal tools look the same. They're made up of tables, drop-downs, buttons, text inputs, search, and all this is very similar. And Retool gives you a point, click, drag and drop interface that makes it super simple to build internal UIs like that in hours, not days. So stop wasting your time and use Retool. Check them out at retool.com slash changelog. Again, retool.com slash changelog. Again, retool.com slash changelog. so max over the process of the rewrite you have this verifier in place it's getting some traffic
Starting point is 00:47:14 to it via the nginx routing module but this is for your learning right so it's still going to the main monolith it's also going to the verifier running your code, doing the diffs. At a certain point, I assume, since the blog post is out there and we've got Black Friday coming up, you are confident enough, you have a high enough percentage on a high enough number of stores or themes
Starting point is 00:47:36 that you say, we're going to start rolling this thing out. Tell us how, I think the routing module played a role here. You kind of automated this process. Tell us about that because the routing module played a role here. You kind of automated this process. Tell us about that because I think it's pretty neat. Yeah. So along with the process of verifying traffic, we wanted to start rendering that traffic for real people out there buying stuff to serve that traffic using the new implementation. And we simply leveraged to the existing verifier mechanism to say, if you have a certain amount of requests that have been equivalent, and that's happened in a given timeframe, then consider that endpoint for that specific shop to be eligible to be rendered by the new implementation. So that was all kept tracked as a very, it's very much a stateful thing, a stateful system to keep track of what those requests are, should they be considered,
Starting point is 00:48:31 and if they are, start rendering. So all of that was being kept track in the verifier mechanism, again, in NGINX, storing that into a key value store to just keep track of how many requests we're getting, whether they're the same or they're not. And we have this whole routing mechanism that we control to say, okay, assuming that we have that amount of traffic in that amount of time that is equivalent to the baseline reference implementation, then the routing module would start to send traffic to the new implementation instead. From there came the need to do some verifications as well against the monolith this time. So because we now start to route traffic
Starting point is 00:49:12 to the new implementation, you also want to make sure that what you're doing is still valid. So you're not just sending traffic to the other place and saying, okay, we're done. We're just moving on to the other thing. You want to keep making sure that what you're doing is still valid.
Starting point is 00:49:26 So we keep verifying traffic, kind of reverse verification, where you're doing the verification usually from the monolith to the other one. Now you're doing the opposite, because you're serving traffic from the other application. And that kind of started out as a few shops that we wanted to take care of, because those were the main winners from what the new application was doing in terms of impact on performance and resilience and everything. And once we started to gain confidence, I think that's when we kind of opened up the, like we pressed on the throttle and just moved it up to a lot more shops to say, okay, that mechanism works.
Starting point is 00:50:03 We're seeing that it scales. We're comfortable with opening it up to a lot more shops to say, okay, that mechanism works. We're seeing that it scales. We're comfortable with opening it up to more merchants. And that's where we kind of reached a critical mass of serving the traffic for more merchants on that new implementation. So it's shop by shop. It is shop by shop, also split up by endpoint itself. So for a single shop, we're not necessarily rendering the entire shop with the new implementation. We're maybe rendering half of that storefront's endpoints with the new one and still the other half with the old one.
Starting point is 00:50:32 That should not change anything for the people browsing that storefront. To them, it's really the same thing. We're just maybe a bit faster depending on what endpoint they're on. How long do you keep that up in terms of this reverse parity? Because kind of going back to the last segment when you mentioned you're enticing people to write new features in, you know, monolith and new application. And I suppose to kind of keep this reverse parity at some point, you got to keep the same, I guess, features in both code bases. So do you sort of you reverse the idea of,
Starting point is 00:51:05 okay, we're going to build a new, but we're also going to build an old too for a certain period to keep that. Is that what you've done, or did that force you to do that, to keep that overlap in terms of parity over time? Pretty much. And that's, I think, one of the main challenges
Starting point is 00:51:19 in terms of, if I'm a developer at Shopify that has to ship something, for a specific period of time, you had to write the thing in both applications, which is not ideal. There's additional work pressure added onto this. And the goal was to make that period as short as possible. So now our focus is really on removing the old code
Starting point is 00:51:37 from the monolith to say, okay, well, there's only one canonical source of truth now. That's a new implementation. This is where the code should be going. And reducing that period of time where there's two implementations going on. How long will that be then, you think? What's your anticipation for that overlap? It's been interesting because of the way we kind of advertise the project internally.
Starting point is 00:52:02 There's a bit of time where it was more considered as an experiment. And we didn't really want to go and shout off rooftops to say, hey, there's this new thing. It was more of a thing where we're kind of internally experimenting with something, seeing if it works. And there was a point eventually we realized, okay, this is going to be the future of Storefront at Shopify internally. And this is how we should be doing things. From that moment on is when we kind of wanted to get people to be aware that they should be writing their features with that new implementation in mind, to think of how they would build it based on the fact that we now have this new thing. It's not the old one anymore. You may still write the thing in the older one if you need it to be available for the
Starting point is 00:52:38 previous implementation. But in terms of making sure that people don't have to write that code twice, of course, they have to for some period of time. It's not great, and we wanted to make that as short as possible. But that's something that did happen for a while and still happens for certain parts of the storefront based on what we do support and what we don't at the moment. This is a trade-off, really. I mean, anytime you do a rewrite, you have to ask yourself,
Starting point is 00:53:06 what am I willing to sacrifice to do this rewrite should I do it at all which is kind of the bigger picture we haven't really asked you that I mean to some degree it's in between all the lines we're drawing here but this is a trade-off like having to do that is a trade-off of you know something that's worth pouring into
Starting point is 00:53:21 in the future so to me it's like well it's not ideal but in the future. So to me, it's like, well, it's not ideal, but in order to rewrite in a smart way, all the steps you've talked about, the verification service, the diffing, all the work, the double implementation for a time period, the overlap and the continued reverse verification, to me, that's necessary.
Starting point is 00:53:41 Maybe in particular with your style of application, in terms of your customer base, all the routing you have to do just generally. But that's a necessary trade-off that you have to do when you say, okay, we have to solve this problem for the future. And unfortunately, this lab experiment gone right is the future, so we have to do it this way in order to rewrite this thing. To me, that's just a necessary trade-off. It's like a tiny bit of pain now that is going to be so much better later on.
Starting point is 00:54:12 So why not do it now? Ripping off the Band-Aid and just dealing with it now. Of course it's a bit painful, but there's going to be so many benefits from this, so let's not think about it too much for now. So are you beyond the pain, or are you still in the pain in terms of the double implementation? Your fellow devs. We're way beyond the pain or are you still in the pain in terms of the double implementation? Your fellow devs. We're way beyond the pain.
Starting point is 00:54:28 So Monolith doesn't get any new features? It does in rare cases. In very, very rare cases. And again, I said earlier about how the storefront implementation we're doing now is mostly a read application. The rest
Starting point is 00:54:44 of the features would be mostly for everything that updates or writes that may still be going in the monolith for now and with time we may be thinking about doing something where rights would also go to the new implementation not clear yet if that's something we want to do but at the moment we're really focusing on getting those reads served by the new implementation. So here you are, you've ripped out the bandaid, you're at the end of this process, you've made the necessary trade-offs. Was it all worth it? Would you do it again? What are the wins? What are the takeaways from Shopify and your team?
Starting point is 00:55:27 Right. So I think one of the main things I keep in mind in this is I've read so many articles and blog posts and opinions everywhere on the internet over the years that say, don't do rewrites. That's like the main takeaway to remember for all of those articles. And I want to be the person who does the opposite and says, rewrites are possible. You can do rewrites if you do them right. And how do you do them right? There's many key things involved in there, but it's not a thing where you have to kind of push the rewrite option aside if it's something that you don't think is possible. The main thing, of course, I think is communicating early and often with the people that are involved into that process. So getting the right, of course, in the example of Shopify developers, it's getting them aware
Starting point is 00:56:07 that there's going to be this new application that's coming out that you have to think about in terms of when your features and your roadmaps and everything have to be aware of that new application. And one mistake I did myself was trying to send an email at some point to say, hey, this is the new implementation of how storefronts work. One email is not going to cut it. Like you have to be frequently getting into in touch with people and making sure that they're aware. And eventually the word kind of gets on, people get excited about the thing. And that's when you kind of get this excitement going on for that new implementation. And along with making sure that you have the most frictionless process to
Starting point is 00:56:45 work on that new thing, when you get those two things combined, that's where magic happens. And people want to work on the thing that people also realize that it's, it's easier, it's more fun to work on the new thing. So it's, it becomes a self realizing kind of prophecy where you want it to happen, but people kind of make it happen for you. And that happens through communication, I think. So that's one thing, of course, in terms of how people using Shopify benefit from this. We're seeing some great results in terms of performance. On average, over all traffic coming up on the platform,
Starting point is 00:57:20 we're seeing around 3 to 5x performance improvements for storefront response times on cache misses. So of course, we're focusing on cache misses because cache hits are almost always fast. But whenever there is a cache miss, that's what we want to optimize and make sure is always fast. So that was kind of the first rule, I guess, of the project to say, don't really think about caching because we want to first have very, very fast cache misses. And only when we know that we have some very fast cache misses do we
Starting point is 00:57:51 want to start thinking about caching on top of that. So we don't want to cheat away the cache misses by saying, oh, it's not a problem because we just add caching onto there. Sure, but what happens when you don't have a cache hit? That's when we want to be extra fast. And in this case, we're seeing some good performance there. In some but what happens when you don't have a cash hit? That's when we want to be extra fast. And in this case, we're seeing some good performance there. In some cases, we're seeing some store funds kind of surprised us in terms of how they performed, where we were seeing up to 10 times faster cash misses. So that's something we were very surprised to see in the early process. And we kind of knew that this was the right way to go because we were seeing those good results. If you guys let me listen to this show right now, they're thinking, you know what, Maxine's right.
Starting point is 00:58:29 I think we could probably do a rewrite. You've obviously made them believe in the potential if done right. What are some of the steps you would take to do it right? We've talked about a lot of them, but like if you can kind of distill it down to like three or five kind of core pieces of advice, what might those be for our listeners? So I think one of them is to make sure you have the shortest feedback loop possible when you kind of get off track. So for example, our verifier mechanism gets us there in a way that if we implement something that's not equivalent to the previous implementation, we're aware of it really quickly. In terms of minutes, when we deploy, we know, okay, we shipped something that's different. Let's look into it, figure out what's going on, and then move on to the next thing. So that's
Starting point is 00:59:13 number one. The second one would be to start with a tiny scope, to scope it down to a small thing that kind of validates the approach to say, is this something we even want to go down? Is this a road we even want to go down towards? Or is it not something that's realistic? So if it is something that's realistic, having that smaller scope will get you to a point where you know, okay, this seems to make sense so far. I don't see why it wouldn't be okay as well
Starting point is 00:59:41 for the rest of the project. Let's make it happen. So that is the second one i think the third thing as it like i said communicating early and often but also making it easy and enjoyable to work on the new thing like reducing friction the bar to entry to work on the new thing is a critical thing to get adoption internally for people that need to ship their features and their roadmaps on the different applications. If it's enjoyable for them to do it on the new one,
Starting point is 01:00:13 they're going to do it. You don't have to force them to do anything. It's just going to happen by itself. If you're going to cause them pain, cause them the least amount of pain, right? Exactly. And we're trying to balance this out, right? So if you have to get people to write code in both applications, at least make it enjoyable
Starting point is 01:00:29 in the new one. Right. Or more enjoyable than it would be in the previous one. Maybe distill that one a bit more. Is that, you mentioned documentation being a critical piece for you. What beyond documentation helps that? I think the one thing I'd say is if you have a Slack channel, a public Slack channel, be the best customer service person you can be. If people arrive with questions, be available,
Starting point is 01:00:51 help them. They are the people that are eventually going to make that a success or not. You are kind of the messenger to say, hey, we have this new thing we're trying to build. Can you help us? And if they are there and they receive the help, that's going to help tremendously to get everyone on board. And because that person that receives the help is going to talk to the project, to their own team. And the team will eventually get to that point where, because someone on their team is familiar with it,
Starting point is 01:01:19 they kind of become that expert on the team. And the word spreads. So helping people, creating this kind of expert network within the company that are aware of your app and they're excited about it, kind of share the word for you. So helping people, I think, would be the best thing. And being a good customer service person to just get them closer to what they need to achieve. One reason that a big rewrite is scary is because it's difficult or even impossible to bound or bind your risk right it's unbound risk and there's steps you can take to fix that you guys had very clear goals i think that was one of the reasons you succeeded and you had a
Starting point is 01:02:00 way of measuring that as you went. So that's clear and awesome. Were there any failure mechanisms like kill switch? We're not going to make it. Because you said it was an experiment. But the other thing is you can say, well, what would failure look like? Obviously you succeeded, but did you have failure thresholds where it's like, you know what, we're going to abandon this and go back to the monolith because we'll try something totally else new? Or were those things not thought about?
Starting point is 01:02:32 No, they definitely were. They definitely were. Internally, we call those tripwires. So eventually you get to a point where you have to figure those out. If you don't, you'll just keep going and you're going to get yourself deeper and deeper into the problem. So figuring out what those tripwires are, at what point do you, you're comfortable with those tripwires and saying, okay, we're just getting too deep now. And now we have to come back and go to something else. That's something we did. So for example, some of them, we talked a bit about, um, you know, how we get some catching up to do with internal changes. That's when it's requires, if we're not able to catch up in time with whatever's being happening in the whatever's being changed in the internal monolith. That's something we have to be
Starting point is 01:03:10 what to be careful with. If it's something about we're by rolling out the application, for example, in production, if we're causing some incidents in production, that's a tripwire. Because if the monolith is not doing that, and we are and potentially we're not ready to go in production, that's a tripwire. Because if the monolith is not doing that, and we are, then potentially we're not ready to go in production, and we have to be careful of that. So there's multiple tripwires we set to make sure that this did not happen. And thankfully, we didn't hit those. We didn't hit them too often to say, okay, it's a problem we should look into and maybe reconsider the entire approach. So in this case, figuring out those tripwires way before you get into a point where you want to roll out is critical to make sure you're doing the right thing.
Starting point is 01:03:50 Yeah. I like the word tripwire. That's a very good, concise word for exactly what it means in this case. Is there anything else we haven't asked you that you're like, man, I want to talk about this before we tail out? We're at that point, So what else you got? One experience that was quite interesting to me personally was when we started the project, we were around five people, four or five people. And that's in the first few months,
Starting point is 01:04:18 we start to get progress. We eventually realized that we're getting closer to rendering certain pages for certain merchants. And that's only internally. We're not in production yet. And at some point, we get a request to showcase it to everyone, all developers and everyone at Shopify internally during our Friday town hall meeting. So everyone's kind of watching this, we're kind of sharing contacts and everything. And this is streamed to all employees at Shopify. So the one thing we did was to have a simple webpage that shows you the same page being rendered by the monolith and then by the new implementation we're doing. So side-by-side, two iframes,
Starting point is 01:04:57 just requesting the same page and seeing which one's faster. And when that thing ran during the live stream when we showed it up, the new implementation was way faster than the new one. And I think that's when I kind of clicked into my head and in many people's heads, okay, this is the right path forward. They're doing the right work and we have to continue doing this thing.
Starting point is 01:05:19 So from that moment on, I think there was kind of a, it was kind of a turning point in the heads of developers working on this to say, this is the thing we're now focusing on in terms of the future to make sure this happens. That's awesome. I can imagine that feeling, because you weren't expecting that call, like, hey, can you demo this to everybody?
Starting point is 01:05:38 Yeah, we weren't. Were you about to throw up? Were you antsy? Of course we had tried it before a few times to make sure it was okay, but seeing the real thing, like the live demo of clicking the button and both pages come up, seeing the new implementation coming up way faster was like, okay, good. We're in a good place, we're going to be fine, and we can keep working on the next endpoints
Starting point is 01:06:05 to make sure that we're doing this for everyone on Shopify. That's a cool story, I'm glad you shared that. Congratulations to you and the rest of the team making this happen. I know this has been a multi-year road with many facets, opportunities for tripwires, obviously we hit success, which is great. You're at plus 90% parity right now. On your way to still to 100, right?
Starting point is 01:06:29 Is that still the case based on your blog post? Or are you closer to 100 now? We're getting towards 100. Exact numbers vary by the day, again, because we're catching up with certain things and there's external circumstances. But we're getting very, very close to 100%. The majority of traffic is being served by that new implementation now.
Starting point is 01:06:48 It's really a matter of fixing the last few diffs and figuring out how can we get to that place faster. Of course, like you said earlier, Jared, it's a game where you have to pick up the issues one by one and just fix them until you get to a point where everything is fine but uh but yeah massive team effort to to get there's a lot of entrusters to work a lot of parody diff fixing issues a lot of external communications as well with the merchants like it's a it's a shopify wide uh initiative and seeing the thing take off and work is super uh rewarding in the end.
Starting point is 01:07:27 Well, we've appreciated this conversation and thank you so much for sharing the story with our, with Jared and I and the rest of the audience here on ChangeLog. Thanks so much for everything. It was very fun. Thank you. That's it for this episode of the ChangeLog. Thank you for tuning in. If you haven't heard yet, we have launched ChangeLog++. It is our membership program that lets you get closer to the metal, remove the ads, make them disappear, as we say, and enjoy supporting us. It's the best way to directly support this show and our other
Starting point is 01:07:56 podcasts here on ChangeLog.com. And if you've never been to ChangeLog.com, you should go there now. Again, join ChangeLog++ to directly support our work and make the ads disappear. Check it out at changelog.com slash plus plus. Of course, huge thanks to our partners who get it, Fastly, Linode, and Rollbar. Also, thanks to Breakmaster Cylinder for making all of our beats. And thank you to you for listening. We appreciate you. That's it for this week. We'll see you next week. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.