The Changelog: Software Development, Open Source - Shopify’s massive storefront rewrite (Interview)
Episode Date: October 16, 2020Maxime Vaillancourt joined us to talk about Shopify's massive storefront rewrite from a Ruby on Rails monolith to a completely new implementation written in Ruby. It's a fairly well known opinion that... rewrites are "the single worst strategic mistake that any software company can make" and generally something "you should never do." But Maxime and the team at Shopify have proved successful in their efforts in this massive storefront rewrite and today's conversation covers all the details.
Transcript
Discussion (0)
One of the main things I keep in mind of this is I've read so many articles and blog posts
and opinions everywhere on the internet over the years that say, don't do rewrites.
That's like the main takeaway to remember for all of those articles.
And I want to be the person who does the opposite and says, rewrites are possible.
You can do rewrites if you do them right.
And how do you do them right?
There's many key things involved in there,
but it's not a thing where you have to kind of
push the rewrite option aside
if it's something that you don't think is possible.
Being with her changelog is provided by Fastly.
Learn more at fastly.com.
We move fast and fix things here at changelog
because of Rollbar.
Check them out at rollbar.com.
And we're hosted on Linode cloud servers.
Head to linode.com slash changelog.
What up, friends?
You might not be aware, but we've been partnering with Linode since 2016.
That's a long time ago.
Way back when we first launched our open source platform that you now see at changelog.com,
Linode was there to help us and we are so grateful.
Fast forward several years now and Linode is still in our corner behind the scenes helping us
to ensure we're running on the very best cloud infrastructure out there.
We trust Linode. They keep it fast and they keep it simple. Get $100 in free credit at
linode.com slash changelog. Again, $100 in free credit at linode.com slash changelog. Again, $100 in free credit at linode.com slash changelog.
What's up? Welcome back, everyone. This is the ChangeLoggle podcast featuring the hackers, the leaders, and the innovators in the world of software.
I'm Adam Stachowiak, Editor-in-Chief here at ChangeLog.
On today's show, we're joined by Maxime Viancou to talk about Shopify's massive storefront rewrite from a Ruby on Rails monolith to a completely new implementation written in Ruby. As you may know, it's a well-known
opinion that rewrites are the single worst strategic mistake that any software company
can make. And generally, it's something you should never do. But Maxime and the team at Shopify
have proved successful in their efforts to rewrite Shopify's storefront. And today's Today's conversation covers all the details.
So, Maxime, we teased this conversation when we had your colleague, Simon Eskildsen, on the show a few weeks back, episode 412.
But he was here to talk about napkin math, and he was part of this Shopify storefront rewrite.
But we didn't want to talk about it with him because we knew you were coming up and we want to talk about it with you.
So first of all, thanks for joining us on the ChangeLog
to talk about this project of yours.
My pleasure. Thanks for the invite.
So it's a big thing to rewrite anything,
but especially something as big and powerful
and successful as Shopify.
I think the main piece of the news that people latched on to was the monolith versus microservices
implications here as Shopify has been a poster child of Rails monoliths at scale, right?
Yep.
And it's something that we're learning about every day i think the more we kind of
figure out what should go in the monolith and what shouldn't and that kind of led to
the decision to split that specific domain into a separate application yeah so maybe as we lay the
foundation for this conversation we're going to go through the decision-making process for the rewrite.
We're going to go through the process
of actually getting it done
because it was a couple-year endeavor.
And at scale, you have to move slowly and carefully.
And you guys built some really cool stuff
with some NGINX Lua rewriting things
to make sure you're doing it right along the way.
We're going to dive into those details.
But first of all, lay the groundwork of what the monolith looks like, maybe before you started,
what all is involved. People know Shopify as an e-commerce platform where people can run their
own online shops, and so it's multi-tenant in that way. I assume a lot of our listeners know
what Shopify is, but what does the monolith look like and what all is it doing?
Right, so Shopify was started almost 15 years ago now and this all started as a simple Rails application,
the majestic monolith approach
where everything would be into that one application,
which we're still using today, 15 years later.
And that's the main place where most Shopify developers
tend to ship code into.
Of course, with scaling, you run into challenges with how you get to handle multiple hundreds of
developers working on the same platform, shipping that code base at scale, but also dealing with
what should go into Monolith. And as we eventually ran into a point where it wasn't possible to use
Rails as it is into its base form, we hit the limits of what we could do with basic Rails.
So as you mentioned, Shopify is a multi-tenant application. So just dealing with the implications
of that and making sure that there's no cross-tenant access for data. That
sort of thing required some patches to the monolith. So the monolith is an application
where most of the code is present in there. We're now splitting it into separate components so that
we have business domain specific components in the monolith. And that makes it so that,
for example, storefront and online store specifically was this own component where everything would be into that directory.
So it kind of gives us a nice way to create these boundaries
between the different components
so that there's no cross-component access.
And hopefully, eventually, everything has this sort of interface
between each component and there's no violation.
There was one article on the Shopify engineering blog
that we've just posted about this
that explains how we're starting to enforce more
the boundaries between those components
to make sure that we don't run into issues
with different class names that don't make sense
or that sort of stuff.
So these are code boundaries, though.
You're still inside of a process or inside of a code base.
These are not services that talk over a network, correct?
Right, same process, correct.
So it's really more of a developer experience thing
more than it is a topology thing for networking, for example.
Okay, and so if we talk about the monoliths parts,
we mentioned storefront.
You can explain exactly what storefront is.
Then there's the admin.
If anybody's run a Shopify shop,
they know what the admin is, at least to a certain degree.
Surely there's more to it.
There's the payment processing and the checkout part.
So there are some logical sections.
Am I missing any?
I'm sure I'm missing some.
I'm sure there's tons on the back end. There's, yeah, yes. What else is there? So there's a ton of them. Yeah, I mean, the big
ones. So there's, of course, the billing stuff is in there. There's payment processing, as you
mentioned. There's, of course, storefront, everything that's about pricing, returns,
that kind of stuff is all separated into its own components.
And it typically lines up with a specific team owning that component to build it.
So the way those components work is essentially a tiny Rails app within each component with its own app directory, tests, and everything that's kind of wrapped up into this one component.
And Online Store is one of them for storefront-specific stuff.
Okay.
It's worth noting that the storefront, which is the topic of conversation here, has been
rewritten with this strenuous process that we're going to go through here.
It's still a Ruby app.
Is it still its own Rails app, or is it somehow different than that?
We started with a base Ruby application.
We are using some parts of Rails, but not Rails itself directly.
And the reason for that is the way that Shopify storefronts are implemented
don't really line up with the CRUD pattern
that typically a Rails app would use.
So if you go on a Shopify storefront, you go on the index page, for example, you're
going to have the index pages, the product pages, the collection pages, and all of those
different endpoints that you would see on a storefront.
Now, all of those things could be implemented as Rails actions to be rendered.
But starting out from scratch, we kind of realized
that we don't need everything that Rails provides
and we could simplify this with a Ruby application
to get started with at first.
So the storefront is kind of a simplified part
of a full stack insofar as, I'm assuming now,
so correct me here, it's taken in a request
and then it's like basically once it
determines which storefront it is then it says okay get all my templates which are writable by
the storefront owner right like they're liquid templates by whoever owns that theme or whatever
it happens to be grab that those templates grab all the data merge them together and spit out
some html and like in the most simplified form is what it's trying to accomplish, right?
That's exactly what it is.
Because of that, the goal of that specific application
is to do that really well, really quickly, and at scale.
Because we don't necessarily want the same
performance criterias that we would for the admin,
for example, separating that application gives us
a do one thing and do it well, kind of a Unix philosophy
thing for that specific service to do.
So what were the goals with the rewrite?
You mentioned there was like three aspects that you wanted to accomplish by going through
this process.
Right.
So of course, success criteria.
The first one was to make sure that we had the same features and the same behavior in
the new one than we did with the older application.
So by that, we say that for a given request for the same input, if both are treated as
black boxes, you get the same output and they're just behaving the same way for whatever
input you give them.
That's where we use the verifier mechanism to make sure that we
for a given input, we get the same output and make sure that we never serve something to buyers that
is not equivalent. And that's incorrect or invalid. The second one would be to improve performance,
of course. So improving performance with the new application, we're able to really focus and drill down
into the performance criteria
that we've set for this application.
But not only the application itself,
in terms of infrastructure,
we've kind of thought about what we want out of this
in the next 10, 20, 30 years
in terms of how we want to set up Shopify storefronts
to scale with time.
So for example, running on an active-active replication setup
allows us to read data from different locations
without needing to go all around the world if we don't have to.
And thinking about how we write the code within the Ruby application
is something that we're using different Ruby patterns
that you usually wouldn't.
So it's not really idiomatic Ruby.
It's not really something that you kind of just write your pretty Ruby
as you usually would.
We are thinking about certain things that do have an impact
on performance in the end.
So something like thinking about the memory allocations underneath
is something that I know I certainly didn't do before that project in Ruby,
but now we think about those things
to make sure that there isn't anything that we're...
Basically make sure that we're doing the right thing
for performance.
And finally, the last one was to improve
resilience and capacity.
So Shopify has this major kind of like Super Bowl
part of the year, which is Black Friday and Cyber Monday.
Coming up.
Yes, coming up in a month, roughly now, or even two months, I think.
So end of November.
So that's typically the place at time when we kind of figure out or find out if everything
we did during the year was good enough.
And so far, it's been going well. This year is going to be the first year that
we're powering most of the storefronts on the platform with that new implementation. So it's
our first real kind of game day for us for actual big time events in the wild. And hopefully
everything goes well. This may be famous last words, but hopefully everything goes very well.
And we're doing
great.
But that's the third goal of the application, to basically take what we had with the monolith
and make it even better because we're optimizing this exactly just for that one use case.
So a big question that I would have at that point when you decide here are our goals and
they're around performance, they're around scalability, resiliency, and
you're extracting this section which has, well, it's called a limited scope.
I'm sure it's very complicated, but a limited scope logically is, did you consider other
languages, other runtimes altogether?
Yes, we did.
The one thing that made us decide to keep using Ruby is one, like all of Shopify is
on Ruby.
So in terms of developer knowledge transfer, that's the most accessible thing to do as
an organization.
Another thing is that the liquid library that we're using to render storefronts is Ruby
based.
So keeping that runtime is something that at the very beginning of the project kind
of made sense to get started with. And the other thing that we're now starting to see is we're just starting
to explore alternative runtimes for Ruby. So Truffle Ruby is something we're trying to explore
to see if that could help with performance in terms of storefront rendering. So it's not something
that we've really wanted to move away from. We're committed to Ruby, committed to Rails,
and that decision still makes sense today.
Maybe eventually we'll start to think about a different runtime for this,
but so far it's been working for us.
Well, the amount of stuff that you'd have to rewrite,
your rewrite is already hazardous enough.
And we all know the treachery that is a big rewrite.
This is a pretty substantial rewrite.
And it took you two years, soup to nuts.
You're not 100% on it, but you're getting close, right?
Yep, correct.
Imagine how much longer that would have been if you had a rewrite liquid in a new language and rewrite.
I'm sure you're pulling in lots of shared code
throughout Shopify into the new storefront
that you can just build on top of.
It's like starting with a bunch of Lego blocks.
If you had to switch languages altogether,
you'd have to build each Lego block,
and you may never finish.
Exactly, and so I think in this sort of paradigm,
extracting everything into this different application
I think is the first step, and once that's done,
we're able to work
with it and do something different eventually. But the first step, of course, is to take everything
out and have this isolated thing that we can then play around with and experiment with. And it's so
much easier to do that once you have that out of the, as a separate service that you can really
have the smaller scope of, rather than trying to work with different languages into the monolith,
which would definitely be a bit harder to do.
I'm curious why the storefront was chosen, or maybe it was just your team's project.
Are each part of this monolith going to give a similar treatment,
or are they parallelizing that effort, or was it just storefront first and see what happens?
So far it's only storefront, and I don't think there's any other plans
to extract any other major blocks from there.
So something to keep in mind is
Shopify's platform receives a lot of traffic
and the majority of that traffic
is for storefront requests.
Admin, of course, takes a good chunk of that,
but mainly is storefront traffic.
So it made sense for us to optimize
that specific place of the code, simply because that's where
most of traffic goes.
And that's something that we could leverage in terms of impact.
The other thing is that storefronts don't necessarily have the same requirements as
you would need from the admin, for example.
So the admin is something where you want to have
valid information at all times.
For example, payment processing and billing
is something where you want the right amount of dollars
being taken in and out of your accounts.
Performance is less of a criteria there
because you want to ensure that you have proper
calculations and logic going on there.
On the storefront, there is a bit less of that strong requirement.
Yeah, I think that's spot on.
I mean, a storefront is rendering correct information, obviously.
It's not about not correct or incorrect.
It's more like an admin is for a limited scope of type of person,
whereas a storefront is literally anybody on the internet.
Exactly, yeah. And if you look at Shopify as a up or down scenario, for a limited scope of type of person. Whereas a storefront is literally anybody on the internet.
Exactly, yeah.
And if you look at Shopify as a up or down scenario,
the majority of that up or down scenario is likely looking at storefront,
not so much admin or others.
Like you might get a limited
and a small portion of the world is going to care,
but the majority is going to care about storefront
being open and fast.
Exactly.
So storefront, the main criteria there is performance,
especially as people run onto mobile devices
from all around the world.
You want to get people to have their storefronts
to be loading as fast as possible
from any sort of circumstances.
So in this case, there's a bit less requirement
to get the right data at all times
to be precise at any given second.
It's really more to get the response in their devices
to start doing something.
You think about this from a standpoint,
you mentioned that Rails CRUD scenarios
didn't really fit in with the criteria of this,
which is we're kind of defining
what was it that sort of pinpointed this,
as Jared mentioned, why storefront?
Why would you rewrite this?
And would you parallel other opportunities inside of Shopify?
And I think you think about the right tool for the right job.
And while we're not dismissing the fact that Rails isn't amazing,
as you said, you're committed to both Ruby and Rails.
So this isn't a matter of like done with Rails, see you buy.
It's more like, well, maybe in this scenario, performance and speed and optimization, all these different things outweigh that.
And I think the bigger play here might be to help us understand why Rails didn't fit anymore and why a rewrite made more sense.
And in particular, to Jared's question, why Ruby still yet, which makes sense because, hey, you've got a lot of Ruby knowledge inside of Shopify.
It would make very little sense to move away from unless you had a really good
long-term plan for that. But more so, why the rewrite for
Ruby, but so much the right tool for the right job? Ruby is still the right
tool for the right job, but Rails didn't fit anymore in that realm.
What did you gain by going pure Ruby and non-idiomatic Ruby
and all those things?
One thing that's interesting is we're still using a fair amount of Rails components in that new implementation that aren't necessarily the whole Rails app itself, but we are using
some bits of active support to compatibility purposes to make sure that what we have in
the monolith still works in the new application.
There's various gems that are still used in Rails that we do works in the new application. There's various gems that are still
used in Rails that we do use in the new implementation. So the way I see it, it's more of a hybrid
in between a bare-bones Ruby app and what Rails would provide. What we're kind of putting behind
is everything that's the Rails routing, for example, with something that we didn't necessarily
need for that implementation because of how storefronts are routed can be implemented fairly simply without going with everything that Rails
provides with routing. But there's a fair amount of behavior and features that Rails does provide
that we are using still in the application through gems and modules that we've imported there.
It sounds like to me when you talk about the non-idiomatic ruby when i read
the blog post a lot of the the things you're doing is like using the self-destructive style method
calls like map bang where it's not going to return you a new array or whatever it happens to be
of objects it's going to actually modify itself in place and the reason why you do that
inside the storefront is because you are
optimizing the heck out of
memory consumption. You're trying to get memory
consumption to as small as it
could possibly be. And so there's your
why not Rails right there.
It's like if we can load as little, I mean one reason
of course, there's plenty of reasons, but if you could
load as little bit of that
code into memory as possible
of what you need of Rails and not the entire stack,
you are undoubtedly going to save in memory allocation
all those objects you're not using.
Exactly.
There was a, I think it was Sam Saffron from Discourse
who posted a memory article,
or it was an article about how ActiveRecord
takes so much memory, and I think he compared it
to SimpleSQL
or I don't remember the name. It's a library he wrote himself.
I'm assuming it's a library where you're basically writing around a SQL
with some help, I guess.
How I learned to stop worrying and write my own ORM.
There's this one, but there's also
an analysis of memory bloat in ActiveRecord 5.2
which is a different one, which is interesting.
And so that's a good example of memory usage that we've kind of skipped with the new implementation.
And for example, because Storefront is, almost all of it is read traffic, there's no writes
involved, there's no deletes involved, there's no updates involved. It's really, I have a request,
generate a read response and you know you
just have to get data from the database and send it back um that sort of thing does not necessarily
warrant using active record or anything that's heavier in terms of oram to read data straight
sql kind of works to get the data out of there and uh you know having access to that data directly
through reads is enough.
So in this case, it's a matter of reducing memory allocations
to getting the garbage collectors to run for either less time
or less often, and that has a major impact on their response times
that we're seeing.
What was it that really drew you to this rewrite?
When did the wars begin to show, so to speak?
Obviously, Rails has worked quite well for many years. You've IPO'd'd you're worth lots of money in terms of a company you're doing amazing
what were the main things that started to prop up that said you know what we really need to
get this down was it simply speed and uptime was it memory was it servers falling over was it like
servers on fire what was it that really like struck this chord and said, we need to really fix this two years
ago?
I think it was a progressive pain point that kind of, it never was a big thing that kind
of appeared in one night.
It's just something that with time we started to see performance slowly degrading with in
terms of response times on the server. And eventually,
we kind of had to do something about it to improve things. And interesting story is the
initial commits for that applications were Toby himself, who took it upon himself to start
something and as a prototype, get something up and running and make it as lean as possible to get started.
And then eventually that became a team and we picked it up and that became the project
that we're working on.
But there never was really one thing that kind of said, okay, that's it.
We're doing this thing now.
It was a slow process that eventually kind of arrived at the conclusion that we had to
do something.
Why not try something that's a bit different than what we would usually do and
let's see where this goes, where this leads and that's where we are now.
And we kind of realized that the approach made sense and we kind of went along with it and we're still there now.
That makes sense too why you mentioned the black box approach in terms of
one of the success criteria meaning that feature parity.
Obviously if you're going to replace something you want to be replacing it as equally as possible
so that as you begin to switch over, they act very similar.
So not only similar, but also ideally byte equal.
So we want the exact same responses to be returned.
Similar is too vague then.
Identical.
It is, but it depends.
It depends.
So that's a good question.
In some cases, we had to ignore certain
things to go closer to what you're saying closer to equivalent rather than byte equal. So
one example of that is and that's something that's in blog post when we do the verification
requests to send the initial request to both backends to see if the output is equivalent or
ideally the same. What would sometimes happen
is on some storefronts, you can try to render some random values and those values may change
on a render by render basis. So if I try to do a request to one of the applications and then I do
the same to the other application, but both use different random values, that's not going to be
bite equal. And because of that, then to us, that would be a verification failure and would be, hey,
those two applications are not doing the same thing. There's an issue, what's wrong? But then
looking into those, we realized that it wasn't really an issue. It's more of the output was
using something that relied on either time or randomness based values. And because of that,
that's something we have to ignore and say,
that's a false negative.
We have to just accept it and it's fine and move on to real issues.
Our friends at Pixie are solving some big problems for applications running on Kubernetes.
Instantly troubleshoot your applications on Kubernetes with no instrumentation, debug with scripts, and everything lives inside Kubernetes.
But don't take it from me.
Kelsey Hightower is pretty bullish on what Pixie brings to the table.
Kelsey, do me a favor and let our listeners know what problems Pixie solves for you.
Yeah, I did this keynote at KubeCon where we talked about this brings to the table. Kelsey, do me a favor and let our listeners know what problems Pixie solves for you. Yeah, I did this keynote at KubeCon
where we talked about this path to serverless.
And the whole serverless movement
is really about making our applications simpler,
removing the boilerplate,
and pushing it down into the platform.
Now, one of the most kind of prevalent platforms today
is Kubernetes.
It works on-prem, works on your laptop,
works in the cloud,
but it has this missing
piece around data and observability. And this is where Pixie comes in to make that platform even
better. So the more features we can get from our platform, things like instrumentation, ad hoc
debugging, auto telemetry, I can keep all of that logic out of my code base and keep my app super
simple. The simpler the app is, the easier it is to maintain.
Well said.
Thanks, Kelsey.
Well, Pixie is in private beta right now, but I'm here to tell you that you're invited
to their launch event on October 8th, along with Kelsey, where they'll announce and demo
what they're doing with Pixie.
Check this show notes for a link to the event and the repo on GitHub or head to pixielabs.ai
to learn more.
Once again, pixielabs.ai to learn more. Once again, pixielabs.ai. all right maxine so you and the team defined your success criteria
feature parity improved performance and improving resilience and capacity and then you set out to
rewrite the thing you had to somehow have some guide rails
to know whether or not you were totally screwing up or not.
We've talked about it a little bit.
But guide us through the whole process.
This has taken a while,
and you had to invent a few tricks and tech
just to help you get this rewrite written
and deployed out into the greater Shopify ecosystem.
So tell us how you
tackled this problem. So the initial vision around this was to, of course, when you're starting from
scratch, there's always like a transitional period where you don't have anything. And up to a certain
point, you don't have anything working well enough to say, we have something that's equivalent to the previous implementation.
So that whole starting phase of the project is very much exploratory, and you don't really know
if what you're doing works. To reduce the risk of that, we've implemented what we call a verifier mechanism. So what that allows us to do is to
compare real-world production data and see if what we're doing is close to our reference baseline,
which is a monolith in this case. So the previous implementation lived in the monolith. We wanted
to make sure that what we're doing in the new service is doing the equivalent behavior or the
equivalent output in terms of what it does. And that mechanism allows us to say, okay,
we're looking at the responses from both applications and we see discrepancy online.
I don't know, 117, there's a missing module that's not included or something. And so that gives us
an idea, okay, we have to fix this. Either most of the time, it's something that we haven't implemented in the new application,
or it's something that's a bug that we've noticed that we didn't know was there. But because we have
this verification mechanism that now we realize, oh, maybe there's been a bug in that application
in the monolith that we didn't really know about. And that's something that we've
ran into of, like, it's a bug that's been there for six years, it we never really knew it was there. But upon doing that verification
process, we realized that we had implemented something that was the right thing in the new
application. And upon comparison realized, it's never been the right thing in the previous one.
So it helped us figure out how to go and implement what's most impactful to get towards completion
and parity in terms of features as quickly as possible.
So that's why at first, when we first started the project, we started to look at a single
shop to say, okay, that's the one shop we want to try to support and target for the
release of that new thing.
Running the verifier mechanism gets us to a point where we're able to say we're that close to getting that response to be exactly the same for the
new application. And from there, move on to other endpoints, other shops, and then figure out how
to scale to the rest of the platform. So I've done this at tiny scale, where I take one endpoint,
I curl the endpoint, take the response, take the other endpoint,
the one that's in development or the one I'm building,
curl the same thing,
take the response, pivot the diff,
and then I look at the diff,
and I hope diff says these two files are identical
or whatever it says, right?
You never quite got there because of all this randomness
and stuff, but was your verifier
essentially a differ that you
just lodged
into your request pipeline? Tell us how that worked.
Exactly. So that's exactly what the mental model around this is. And it's just, instead
of us doing the curls and the diffs manually by hand in the command line, it's something
that's happening.
Your customers did it.
Exactly. And automatically for us, we're getting some data coming back from everyone just requesting
storefronts from all around the world.
Right.
So of course, there's that part of the process, but there's also the one that we're doing
locally on our own machines.
So this thing runs in production where the verifier gives us data about, okay, there's
that many failures in terms of verification that you have to fix for shop XYZ and then
over these different endpoints.
But also once we know that there's an issue on a given storefront
or a given endpoint, we then have to go in our machines
and try to figure out, okay, what's the issue?
How can I fix it?
And how can I bring it back to parity with the baseline implementation?
So that specific mechanism is into a NGINX routing module where it's written in lua we're using open
resty and what it essentially does is at the very beginning of the project all traffic was going all
storefront traffic was going to the monolith for storefront traffic nothing was really going to
the new implementation as we were just getting started. The nice thing about that is for every, I don't remember exactly what sampling rate
we had, but for example, something like for one in every thousand requests that are coming
into Shopify for storefronts, take it, but also do the request to the new implementation
and compare the output of those two requests, do a diff on them, and then upload whatever diff results happen
to an external store so we can look at them later and see what was the issue. So that helped us
figure out that certain shops have more diffs than others, certain endpoints have more diffs than
others, but also seeing the traffic patterns of where we should try to tackle at first was a super nice signal
for us to say, okay, there's that many failures there, let's try to do this one
first to get as much impact as we can there, and then move on to the other
ones eventually.
What kind of diffs are we talking about here? What kind of non-parity, what's an extreme example
and maybe a non-extreme example of a diff gone wrong?
What's to say?
I could talk about this for hours.
So,
one of the most extreme examples is
you try to open the page.
So, I'm talking about a
buyer's perspective. You open up a storefront
and assuming that the
new implementation was going to render that page,
all you would see
is a blank page, nothing else.
And that's one of the most extreme examples where you're like, okay, this cannot go out
in production, right?
And one of the reasons behind that could be a missing head tag.
For whatever reason, there's a missing thing that the page does not work at all.
Some of the more extreme examples in terms of, not in terms of how a buyer would perceive it, but in terms of
how we would perceive it is there's a missing new line, but that's it. There's just one missing new
line somewhere. For some reason, we're not returning the same string for whatever method.
And the verifier screams, Hey, that's not the same thing. There's a missing new line there,
which is not in the old one. And that's something that we have to deal with so of course all of those non-problematic issues we were seeing we started to say oh that's
not actually an issue we can just ignore that away and say that new line is not really a problem or
fix it of course but there's some cases where there's issues that we realize just like the
time-based and the randomness-based issues that we didn't really want to block
us as we started to get more and more support for certain requests.
And we were able to say, okay, well, these patterns that look like this, for example,
if there's a timestamp in the responses, we can just ignore that away.
If there's a script ID that's being generated by the script or something, ignore that away.
And then as more patterns started to come up,
we came up with a pool of patterns
that we knew weren't problematic,
and knowing those, we were able to focus
on the actual issues.
Let's talk about that thing,
because this isn't like a typical error tracker
that you're doing.
This is parity, and I'm curious
how you
log these things. How do you keep track of
and organize this so not only you, but others
can triage
this and say, okay, these are the ones we should pay attention
to, these are the ones we shouldn't. So like,
it's probably not your typical bug tracker
that's doing this for you, or maybe it is, I don't know.
How did you log these things and then
organize them?
Great question, and it actually ties into how would you do this?
So assuming I'm asking you the question, you have to do this project where you implement
this new thing and you have a million merchants on your platform that are using the like so
many storefronts on the platform, you have to get parity for all of them.
Do you go with a breadth first approach where you try to support as many shops as you
can for, say, a single endpoint, like all index pages, and you support all index pages across all
of the merchants? Or you try to go for a single merchant and cover everything for that single
merchant, but maybe that merchant has some features that the other merchant does not use,
and you have to think about, well, maybe I should consider other merchants first.
Should I consider the bigger merchants, the smaller merchants, because there's more people?
Do I want to look at it as a theme-based approach?
Because usually there's going to be a mass amount of shops that use the same theme.
Different ways of looking at this.
And in this case, it kind of ended up being a thing where
we started with a handful of shops at first that were the most, I guess I could say problematic
shops in terms of performance, where they would cause a high amount of load on the platform
because of their storefront traffic. And from there, getting the diffs from them to fix them
eventually. But there are two ways of seeing this. So it's either breadth-first or depth-first.
So to analyze the actual diffs that we see
in terms of parity, we upload that to Cloud Storage
where all the diffs are kind of aggregated.
And later on, we can figure out how those are.
And then the other side of this is that we keep track
of where the diffs happen in terms of,
is it shop X, Y is it endpoint abc and
based on that we can run through our logging pipeline to see where do most of the issues
happen is it on that shop isn't on that endpoint and that gives us an idea of what we should try
to tackle at first so on splunk we have so many dashboards that are just trying to figure out
you should
look at this first, because this is where most of the issues are happening.
Datadog is also giving us a bunch of information in terms of where we should focus on first.
And later on, what's happening is that on the developer side of things, we have tooling
locally to be able to kind of comb through the diffs that we have stored on the cloud
storage part of it, and read through what are the most frequent ones.
I don't know what I would have done here, honestly.
It seems, as you described it, quite overwhelming.
As you mentioned, I might have gone down both roads
and tested both sides of the water
and sort of drawn some consensus from the team to see,
okay, which direction do we feel is better, breadth or depth?
And I think I might have done a little bit of both to get a sampling of each
direction. But it seems like just daunting. Millions of merchants,
unlimited amounts of traffic, tons to dig through,
and anomalies everywhere. So I have extreme empathy for you.
It seems like a daunting task.
I would send, there's a team, right?
So yeah, it's not just me, of course.
We're a team of- Of course, yeah.
I mean, the proverbial you, meaning like you many.
So then you like divide and conquer, right?
So like you said, one team depth and one team breadth,
and then you meet in the middle.
That's like when you're-
Interesting.
Yeah, that's something we did try to do.
So some people were focusing on a single shop
while others were trying to cover as many shops as possible.
And I think what eventually happened is, you know, when you look at things, you have to
do everything at some point anyway.
So you're going to have to go through both paths kind of as a balance, try to do both
at the same time.
And eventually you'll reach a critical mass of supported requests from where you can kind
of move on to go into the more specialized things for depth.
So this is my kind of problem, by the way.
I'm a completionist. I love this kind of problem.
Here's a big goal, right?
We know what the end looks like. It's called 100%.
We're not there yet. We know we're at 32%
or whatever your numbers indicate.
And we have a clear path to gain there.
What do you do? Well, you check the next diff
and then you fix that problem.
And then you, oh, now we're at 30.
OK.
And you try to find the ones where you can implement a module
and chop off a whole leg.
And you're like, oh, look at that.
That module just solved these 15 problems.
And you're just on your way to the end.
It's like a good video game where you're like,
I'm trying to find, I got my main quest and my side quests.
I've got to get them all. So let's just start hacking away and making some progress. video game where you're like i'm trying to find like i got my main quest on my side quests i've
got to get them all so let's just start hacking away and making some progress i would i would
enjoy this and i'm sure it gets you get down in the mucky muck and you're like oh these new lines
are killing me you know yeah very much yeah but still i'm sure they're those huge wins where you
just slice off one big chunk and you see all these different stores go to the parity.
It has to be pretty cool. And that's super interesting you say that.
So the percentage-based thing, we do have that.
We have a dashboard that says,
this is the current support we have.
We're going towards 100%.
And one of the funnier moments during this project
where at the very beginning, it was easier, I think,
than it is now because we're running into this.
We're seeing diminishing returns now because it's more edge cases
and we're trying to fix all the tiny, small things that are left to be fixed.
But in the very beginning of the project,
every single PR could potentially unlock so many more shops
and so many more endpoints.
So on every deploy, we kind of look at that percentage metric
and say, how much is my PR going to do there?
Is it going to bump by 0.5%?
Is it going to be 1%?
Like 2%?
That would be amazing.
So that kind of gamification of the thing
also made it fun
and helped us run towards 100%.
So two questions about Monolith in the meantime.
The first one is,
and maybe I guess they are related.
So was Monolith a moving target,
or is it pretty static in its output?
As you build it, because it takes a couple years,
were there changes going into the Monolith storefront
so you had to play catch up?
We did.
So play catch up for different things.
So you have to play catch up with internal changes
in terms of other developers working on the Monolith
and us trying to catch up with that. There's also catching up in terms of what merchants do. So
if merchants start using a given feature that we just do not support in the new application yet,
then that's another source of potential catching up we have to do.
So you could go backward. You wake up in the morning, you're at 33%, now you're at 28%
because somebody used a new feature.
Exactly. That's happened multiple times where you, exactly that. You go out in the evening,
you're like, oh, nice. We're at like 37%. And then the next morning was something like 17%.
We didn't do anything. What happened, right? Or you go to lunch and you're like, we left for lunch
at 32%. We came back at 20. What happened here? That's demoralizing.
Exactly. Exactly, exactly.
So that sort of thing was one of the harder parts.
Of course, you have to deal with how do you get people to onboard your new project to
get them to help and support that new project as well as you trying to get them to work
on it.
So eventually, I decided to make it kind of my quest to make it as easy as possible for others to start contributing to that project by making the documentation amazing, as welcoming as possible to reduce the friction, and to basically get people to say, hey, look, it looks more exciting to work on that new thing than it is to come onto the project and on their own kind of contribute to the new thing rather than only keep working on the previous implementation.
So that's something that really helped.
I think in terms of how we drew the line was, if it's something that was already in the monolith by the time we started the project, that's something that we would have to take on ourselves, the team working on the rewrite. If, however, it's something that's not in the monolith
yet, it's not anywhere, it's a new project, then that team should be able to consider both projects
because they know it exists. They know they have to build for the future and make it into the new
application as well. And that's how we kind of got that line drawn to say who's handling what.
Eventually we reached a point in the project
where most people were also writing that code
in the new application.
They knew they had to do it to be future-proof.
So my second question about the monolith
is because you were going for parity,
did you ever have to re-implement bugs
or suboptimal aspects of the monolith because you had to have the exact same output?
Yes.
That also sucks.
It's like, this is my brand new thing, I have to put the bad stuff in the new thing?
Yeah, well, so it's a bad thing, I guess, for us as developers in terms of, it doesn't feel good.
Yeah, demoralizing.
But in terms of how a buyer or a merchant would see it, people using Shopify, to them, that's good.
So one of the goals we have is to basically make it so that for storefront, specifically for online store, if you have a theme archive that you have from eight years ago, for example, and you try to upload that today, it should work the same way it did eight years ago.
We try to be as backwards compatible as possible.
So, of course, if there's something that was introduced eight years ago, we have to be as backwards compatible as possible. So of course, if there's
something that was introduced eight years ago, we have to make sure that's still there. And the
other thing is Liquid, the gem that we use for and that we built and we use for storefront templates
is almost train complete. So the possibilities of what Liquid can do is almost infinite. So there's features that we shipped at some point
that kind of became used in ways that we did not expect
and that we did not really think about
that we still have to support today.
Even though that's not what we want it to be used for,
we have to keep it this way.
So of course, we had to port some bugs
that unfortunately are kind of, it doesn't feel
good.
But for the people using that, it's a, I think, a service to them to say, look, your thing
you had from a few years ago still works today.
And there's no breaking change in there. What's up, friends?
When was the last time you considered how much time your team is spending building and
maintaining internal tooling?
And I bet if you looked at the way your team spends its time, you're probably building
and maintaining those tools way more often than you thought, and you probably shouldn't
have to do that.
I mean, there is such a thing as retool.
Have you heard about retool yet?
Well, companies like DoorDash, Braggs, Plaid, and even Amazon,
they use retool to build internal tools super fast,
and the idea is that almost all internal tools look the same.
They're made up of tables, drop-downs, buttons, text inputs, search, and all this is very similar.
And Retool gives you a point, click, drag and drop interface that makes it super simple to build internal UIs like that in hours, not days.
So stop wasting your time and use Retool.
Check them out at retool.com slash changelog.
Again, retool.com slash changelog. Again, retool.com slash changelog. so max over the process of the rewrite you have this verifier in place it's getting some traffic
to it via the nginx routing module but this is for your learning right so it's still going to
the main monolith it's also going to the verifier running your code, doing the diffs.
At a certain point, I assume,
since the blog post is out there
and we've got Black Friday coming up,
you are confident enough,
you have a high enough percentage
on a high enough number of stores or themes
that you say, we're going to start rolling this thing out.
Tell us how, I think the routing module played a role here.
You kind of automated this process. Tell us about that because the routing module played a role here. You kind of automated this process.
Tell us about that because I think it's pretty neat.
Yeah. So along with the process of verifying traffic, we wanted to start rendering that
traffic for real people out there buying stuff to serve that traffic using the new implementation.
And we simply leveraged to the existing verifier mechanism to say, if you have a certain amount of requests that have been equivalent, and that's happened in a given timeframe, then consider that endpoint for that specific shop to be eligible to be rendered by the new implementation. So that was all kept tracked as a very, it's very much a
stateful thing, a stateful system to keep track of what those requests are, should they be considered,
and if they are, start rendering. So all of that was being kept track in the verifier mechanism,
again, in NGINX, storing that into a key value store to just keep track of how many requests
we're getting, whether they're the same
or they're not. And we have this whole routing mechanism that we control to say, okay, assuming
that we have that amount of traffic in that amount of time that is equivalent to the baseline
reference implementation, then the routing module would start to send traffic to the new implementation instead. From there came the need to do some verifications as well
against the monolith this time.
So because we now start to route traffic
to the new implementation,
you also want to make sure that what you're doing
is still valid.
So you're not just sending traffic to the other place
and saying, okay, we're done.
We're just moving on to the other thing.
You want to keep making sure
that what you're doing is still valid.
So we keep verifying traffic, kind of reverse verification,
where you're doing the verification usually from the monolith
to the other one.
Now you're doing the opposite, because you're serving traffic
from the other application.
And that kind of started out as a few shops
that we wanted to take care of, because those were the main winners from what the new application was doing in terms of impact on performance and resilience and everything.
And once we started to gain confidence, I think that's when we kind of opened up the, like we pressed on the throttle and just moved it up to a lot more shops to say, okay, that mechanism works.
We're seeing that it scales. We're comfortable with opening it up to a lot more shops to say, okay, that mechanism works. We're seeing that it scales.
We're comfortable with opening it up to more merchants.
And that's where we kind of reached a critical mass of serving the traffic for more merchants
on that new implementation.
So it's shop by shop.
It is shop by shop, also split up by endpoint itself.
So for a single shop, we're not necessarily rendering the entire shop with the new implementation. We're maybe rendering half of that storefront's
endpoints with the new one and still the other half with the old one.
That should not change anything for the people browsing that storefront. To them, it's really the same
thing. We're just maybe a bit faster depending on what endpoint they're on.
How long do you keep that up in terms of this reverse
parity? Because kind of going back to the last segment when you mentioned you're enticing people
to write new features in, you know, monolith and new application.
And I suppose to kind of keep this reverse parity at some point, you got to keep the
same, I guess, features in both code bases.
So do you sort of you reverse the idea of,
okay, we're going to build a new,
but we're also going to build an old too
for a certain period to keep that.
Is that what you've done,
or did that force you to do that,
to keep that overlap in terms of parity over time?
Pretty much.
And that's, I think, one of the main challenges
in terms of, if I'm a developer at Shopify
that has to ship something,
for a specific period of time,
you had to write the thing in both applications,
which is not ideal.
There's additional work pressure added onto this.
And the goal was to make that period as short as possible.
So now our focus is really on removing the old code
from the monolith to say,
okay, well, there's only one canonical source of truth now.
That's a new implementation.
This is where the code should be going.
And reducing that period of time where there's two implementations going on.
How long will that be then, you think?
What's your anticipation for that overlap?
It's been interesting because of the way we kind of advertise the project internally.
There's a bit of time where it was more considered as an experiment.
And we didn't really want to go and shout off rooftops to say, hey, there's this new thing. It was more of a thing where we're kind of internally experimenting with something, seeing if it works. And there was a point eventually we
realized, okay, this is going to be the future of Storefront at Shopify internally. And this is how
we should be doing things. From that moment on is when we kind of wanted to get people to be aware
that they should be writing their features with that new implementation in mind, to think of how they would build
it based on the fact that we now have this new thing.
It's not the old one anymore.
You may still write the thing in the older one if you need it to be available for the
previous implementation.
But in terms of making sure that people don't have to write that code twice, of course,
they have to for some period of time.
It's not great, and we wanted to make that as short as possible.
But that's something that did happen for a while and still happens for certain parts of the storefront
based on what we do support and what we don't at the moment.
This is a trade-off, really.
I mean, anytime you do a rewrite, you have to ask yourself,
what am I willing to sacrifice to do this rewrite
should I do it at all which is kind of the bigger
picture we haven't really asked you that I mean to some degree
it's in between all the lines
we're drawing here but this is a trade-off
like having to do that is a trade-off
of you know something
that's worth pouring into
in the future so to me
it's like well it's not ideal but in the future. So to me, it's like, well, it's not ideal,
but in order to rewrite in a smart way,
all the steps you've talked about,
the verification service, the diffing,
all the work, the double implementation for a time period,
the overlap and the continued reverse verification,
to me, that's necessary.
Maybe in particular with your style of application,
in terms of your customer base, all the routing you have to do just generally.
But that's a necessary trade-off that you have to do when you say,
okay, we have to solve this problem for the future.
And unfortunately, this lab experiment gone right is the future,
so we have to do it this way in order to rewrite this thing.
To me, that's just a necessary trade-off.
It's like a tiny bit of pain now that is going to be so much better later on.
So why not do it now?
Ripping off the Band-Aid and just dealing with it now.
Of course it's a bit painful,
but there's going to be so many benefits from this,
so let's not think about it too much for now.
So are you beyond the pain, or are you still in the pain
in terms of the double implementation? Your fellow devs. We're way beyond the pain or are you still in the pain in terms of the double implementation?
Your fellow devs. We're way beyond the pain.
So Monolith doesn't get any new features?
It does in
rare cases. In very, very
rare cases. And again, I said
earlier about how
the storefront implementation we're doing now
is mostly a read
application. The rest
of the features would be mostly for everything that
updates or writes that may still be going in the monolith for now and with time we may be thinking
about doing something where rights would also go to the new implementation not clear yet if that's
something we want to do but at the moment we're really focusing on getting those reads served by
the new implementation.
So here you are, you've ripped out the bandaid, you're at the end of this process,
you've made the necessary trade-offs. Was it all worth it? Would you do it again?
What are the wins? What are the takeaways from Shopify and your team?
Right. So I think one of the main things I keep in mind in this is I've read so many articles and blog posts and opinions everywhere on the internet over the years that say,
don't do rewrites. That's like the main takeaway to remember for all of those articles. And I want
to be the person who does the opposite and says, rewrites are possible. You can do rewrites if you
do them right. And how do you do them right?
There's many key things involved in there, but it's not a thing where you have to kind of
push the rewrite option aside if it's something that you don't think is possible.
The main thing, of course, I think is communicating early and often with the people that are involved
into that process. So getting the right, of course, in the example of Shopify developers, it's getting them aware
that there's going to be this new application that's coming out that you have to think about
in terms of when your features and your roadmaps and everything have to be aware of that new
application.
And one mistake I did myself was trying to send an email at some point to say, hey, this
is the new implementation of how storefronts work. One email is not going to cut it. Like you have to be frequently getting into in touch with
people and making sure that they're aware. And eventually the word kind of gets on, people get
excited about the thing. And that's when you kind of get this excitement going on for that new
implementation. And along with making sure that you have the most frictionless process to
work on that new thing, when you get those two things combined, that's where magic happens.
And people want to work on the thing that people also realize that it's, it's easier,
it's more fun to work on the new thing. So it's, it becomes a self realizing kind of prophecy where
you want it to happen, but people kind of make it happen for you. And that happens through communication, I think.
So that's one thing, of course, in terms of how people using Shopify
benefit from this.
We're seeing some great results in terms of performance.
On average, over all traffic coming up on the platform,
we're seeing around 3 to 5x performance improvements
for storefront response times on cache misses.
So of course, we're focusing on cache misses because cache hits are almost always fast.
But whenever there is a cache miss, that's what we want to optimize and make sure is
always fast.
So that was kind of the first rule, I guess, of the project to say, don't really think
about caching because we want to first have
very, very fast cache misses. And only when we know that we have some very fast cache misses do we
want to start thinking about caching on top of that. So we don't want to cheat away the
cache misses by saying, oh, it's not a problem because we just add caching onto there. Sure,
but what happens when you don't have a cache hit? That's when we want to be extra fast.
And in this case, we're seeing some good performance there. In some but what happens when you don't have a cash hit? That's when we want to be extra fast.
And in this case, we're seeing some good performance there. In some cases, we're seeing some store funds kind of surprised us in terms of how they performed, where we were seeing up to
10 times faster cash misses. So that's something we were very surprised to see in the early process.
And we kind of knew that this was the right way to go because we were seeing those good results.
If you guys let me listen to this show right now, they're thinking, you know what, Maxine's right.
I think we could probably do a rewrite.
You've obviously made them believe in the potential if done right.
What are some of the steps you would take to do it right?
We've talked about a lot of them, but like if you can kind of distill it down to like three or five kind of core pieces of advice, what might those be for our listeners?
So I think one of them is to make sure you have the shortest feedback loop possible when you kind of get off track. So for example, our verifier mechanism gets us there in a way that
if we implement something that's not equivalent to the previous implementation, we're aware of
it really quickly. In terms of minutes, when we deploy, we know, okay, we shipped something that's different.
Let's look into it, figure out what's going on, and then move on to the next thing. So that's
number one. The second one would be to start with a tiny scope, to scope it down to a small thing
that kind of validates the approach to say, is this something we even want to go down?
Is this a road we even want to go down towards?
Or is it not something that's realistic?
So if it is something that's realistic,
having that smaller scope will get you to a point
where you know, okay, this seems to make sense so far.
I don't see why it wouldn't be okay as well
for the rest of the project.
Let's make it happen.
So that is the second one
i think the third thing as it like i said communicating early and often but also
making it easy and enjoyable to work on the new thing like reducing friction the bar to entry to
work on the new thing is a critical thing to get adoption internally for people that need to ship their features
and their roadmaps on the different applications.
If it's enjoyable for them to do it on the new one,
they're going to do it.
You don't have to force them to do anything.
It's just going to happen by itself.
If you're going to cause them pain,
cause them the least amount of pain, right?
Exactly.
And we're trying to balance this out, right?
So if you have to get people to write code in both applications, at least make it enjoyable
in the new one.
Right.
Or more enjoyable than it would be in the previous one.
Maybe distill that one a bit more.
Is that, you mentioned documentation being a critical piece for you.
What beyond documentation helps that?
I think the one thing I'd say is if you have a Slack channel, a public Slack channel,
be the best customer service person you can be. If people arrive with questions, be available,
help them. They are the people that are eventually going to make that a success or not.
You are kind of the messenger to say, hey, we have this new thing we're trying to build.
Can you help us? And if they are there and they receive the help,
that's going to help tremendously to get everyone on board.
And because that person that receives the help
is going to talk to the project, to their own team.
And the team will eventually get to that point where,
because someone on their team is familiar with it,
they kind of become that expert on the team.
And the word spreads.
So helping people, creating this kind of expert
network within the company that are aware of your app and they're excited about it, kind of share
the word for you. So helping people, I think, would be the best thing. And being a good customer
service person to just get them closer to what they need to achieve. One reason that a big rewrite is scary is because it's difficult or even impossible to
bound or bind your risk right it's unbound risk and there's steps you can take to fix that
you guys had very clear goals i think that was one of the reasons you succeeded and you had a
way of measuring that as you went. So that's clear and awesome.
Were there any failure mechanisms like kill switch?
We're not going to make it.
Because you said it was an experiment.
But the other thing is you can say, well, what would failure look like?
Obviously you succeeded, but did you have failure thresholds where it's like, you know what, we're going to abandon this
and go back to the monolith because we'll try something totally else new?
Or were those things not thought about?
No, they definitely were. They definitely were. Internally, we call those tripwires.
So eventually you get to a point where you have to figure those out. If you don't,
you'll just keep going and you're going to get yourself deeper and deeper into the problem.
So figuring out what those tripwires are, at what point do you,
you're comfortable with those tripwires and saying, okay, we're just getting too deep now.
And now we have to come back and go to something else. That's something we did. So for example, some of them, we talked a bit about, um, you know, how we get some catching up to do with internal
changes. That's when it's requires, if we're not able to catch up in time with whatever's being
happening in the whatever's being changed in the internal monolith. That's something we have to be
what to be careful with. If it's something about we're by rolling out the application, for example,
in production, if we're causing some incidents in production, that's a tripwire. Because if the
monolith is not doing that, and we are and potentially we're not ready to go in production, that's a tripwire. Because if the monolith is not doing that, and we are, then potentially we're not ready to go in production, and we have to be careful of that.
So there's multiple tripwires we set to make sure that this did not happen.
And thankfully, we didn't hit those. We didn't hit them too often to say, okay, it's
a problem we should look into and maybe reconsider the entire approach.
So in this case, figuring out those tripwires way before you get into a point
where you want to roll out is critical to make sure you're doing the right thing.
Yeah.
I like the word tripwire.
That's a very good, concise word for exactly what it means in this case.
Is there anything else we haven't asked you that you're like,
man, I want to talk about this before we tail out?
We're at that point, So what else you got?
One experience that was quite interesting to me personally was when we started the project,
we were around five people, four or five people. And that's in the first few months,
we start to get progress. We eventually realized that we're getting closer to rendering certain
pages for certain merchants. And that's only internally. We're not in production yet. And at some point, we get a request to
showcase it to everyone, all developers and everyone at Shopify internally during our
Friday town hall meeting. So everyone's kind of watching this, we're kind of sharing contacts
and everything. And this is streamed to all employees at Shopify. So the one thing we did was to have a simple webpage
that shows you the same page being rendered by the monolith
and then by the new implementation we're doing.
So side-by-side, two iframes,
just requesting the same page and seeing which one's faster.
And when that thing ran during the live stream when we showed it up,
the new implementation was way faster than the new one.
And I think that's when I kind of clicked into my head
and in many people's heads,
okay, this is the right path forward.
They're doing the right work
and we have to continue doing this thing.
So from that moment on, I think there was kind of a,
it was kind of a turning point
in the heads of developers working on this to say,
this is the thing we're now focusing on in terms of the future
to make sure this happens.
That's awesome. I can imagine that feeling,
because you weren't expecting that call, like,
hey, can you demo this to everybody?
Yeah, we weren't.
Were you about to throw up? Were you antsy?
Of course we had tried it before a few times to make sure it was okay,
but seeing the real thing, like the live demo of clicking the button
and both pages come up, seeing the new implementation coming up
way faster was like, okay, good.
We're in a good place, we're going to be fine,
and we can keep working on the next endpoints
to make sure that we're doing this for everyone on Shopify.
That's a cool story, I'm glad you shared that.
Congratulations to you and the rest of the team making this happen.
I know this has been a multi-year road
with many facets, opportunities for tripwires,
obviously we hit success, which is great.
You're at plus 90% parity right now.
On your way to still to 100, right?
Is that still the case based on your blog post?
Or are you closer to 100 now?
We're getting towards 100.
Exact numbers vary by the day, again,
because we're catching up with certain things
and there's external circumstances.
But we're getting very, very close to 100%.
The majority of traffic is being served by that new implementation now.
It's really a matter of fixing the last few diffs and
figuring out how can we get to that place faster.
Of course, like you said earlier, Jared,
it's a game where you have to pick up the issues one by one and just fix them
until you get to a point where everything is fine but uh but yeah massive team effort to to get there's a lot of
entrusters to work a lot of parody diff fixing issues a lot of external communications as well
with the merchants like it's a it's a shopify wide uh initiative and seeing the thing take off
and work is super uh rewarding in the end.
Well, we've appreciated this conversation and thank you so much for sharing the story with our,
with Jared and I and the rest of the audience here on ChangeLog.
Thanks so much for everything. It was very fun.
Thank you.
That's it for this episode of the ChangeLog. Thank you for tuning in. If you haven't heard yet,
we have launched ChangeLog++. It is
our membership program that lets you get closer to the metal, remove the ads, make them disappear,
as we say, and enjoy supporting us. It's the best way to directly support this show and our other
podcasts here on ChangeLog.com. And if you've never been to ChangeLog.com, you should go there
now. Again, join ChangeLog++ to directly support our work
and make the ads disappear. Check it out at changelog.com slash plus plus. Of course,
huge thanks to our partners who get it, Fastly, Linode, and Rollbar. Also, thanks to Breakmaster
Cylinder for making all of our beats. And thank you to you for listening. We appreciate you.
That's it for this week. We'll see you next week. Bye.