The Changelog: Software Development, Open Source - Securing ecommerce: "It's complicated" (Interview)
Episode Date: March 20, 2025Ilya Grigorik and his team at Shopify has been hard at work securing ecommerce checkouts from sophisticated news attacks (such as digital skimming) and he's here to share all the technical intricacies... and far-reaching implications of this work.
Transcript
Discussion (0)
Welcome to the changelog where each and every week we sit down with the hackers, the leaders,
and the innovators of the software world to pick their brain, to learn from their mistakes,
to get inspired by their accomplishments, and to have a lot of fun along the way.
On this episode, I'm joined by Ilya Grigorik, Distinguished Engineer and Technical Advisor
to the CEO at Shopify.
Ilya has been hard at work securing e-commerce checkouts from sophisticated new attacks such
as digital skimming, and he's here to share all the technical intricacies and far-reaching
implications of this work.
But first, a mention of our partners at Fly.io, the public cloud built for developers who
ship.
You know we love Fly, and you might too.
Check them out at Fly.io.
Okay, Ilya Grigorik on the changelog.
Here we go.
Well, friends, I'm here with a good friend of mine, David Shu, the founder and CEO of Retool. So David, I know so many developers who use Retool to solve problems, but I'm curious,
help me to understand the specific user, the particular developer who is just loving Retool? Who's
your ideal user?
Yeah, so for us, the ideal user of Retool is someone whose goal first and foremost is
to either deliver value to the business or to be effective. Where we candidly have a
little bit less success is with people that are extremely opinionated about their tools. If for example, you're like, Hey, I need to go use WebAssembly.
And if I'm not using WebAssembly, I'm quitting my job, you're probably not the best ritual
user, honestly. However, if you're like, Hey, I see problems in the business and I want
to have an impact and I want to solve those problems. Retool is right up your alley. And
the reason for that is ritual allows you to have an impact so quickly. You could go from an idea, you go from a meeting like, hey, you know,
this is an app that we need to literally having the app built at 30 minutes,
which is super, super impactful in the business.
So I think that's the kind of partnership or that's the kind of impact
that we'd like to see with our customers.
You know, from my perspective, my thought is that, well, Ritual is well known.
Ritual is somewhat even saturated.
I know a lot of people who know Retool, but you've said this before.
What makes you think that Retool is not that well known?
Retool today is really quite well known amongst a certain crowd.
Like I think if you had a poll like Engineers in San Francisco or Engineers in Silicon Valley
even, I think it'd probably get like a 50, 60, 70%
recognition of Retool.
I think where you're less likely to have heard of Retool
is if you're a random developer at a random company
in a random location like the Midwest, for example,
or like a developer in Argentina, for example,
you're probably less likely.
And the reason is I think we have a lot
of really strong word of mouth
from a lot of Silicon Valley companies
like the Brex Brexit, Coinbase,
Doordash, Stripes, etc. of the world. There's a lot of chat, Airbnb is another customer and Nvidia is another customer.
So there's a lot of chatter about Retool in the Valley.
But I think outside of the Valley, I think we're not as well done.
And that's one goal of ours to go change that.
Well, friends, now you know what Retool is, you know who they are.
You're aware that Retool exists. And if you're trying to solve problems for your company, you're in a meeting, as David
mentioned, and someone mentions something where a problem exists and you can easily
go and solve that problem in 30 minutes, an hour or some margin of time that is basically
a nominal amount of time.
And you go and use retail to solve that problem.
That's amazing.
Go to Retail.com and get started for free or book a demo.
It is too easy to use, Retail.
And now, you know, so go and try it once again.
Retail.com. So I'm here with Ilya Grigorik from Shopify back on the show after years and years.
You've been on the show, I think, four or five times.
Welcome back.
Thank you.
Glad to be back.
What have you been up to, man?
I think it was 2021, last time you were on the show,
we were talking to Hyrogen, you're still at Shopify,
so you've been there a very long time,
what have you been up to?
So I think last time we talked about custom storefronts
and a big mission we had at Shopify to enable developers
to build customized storefronts
using their own application stack. Since then, I've spent a lot of time diving into our APIs and infrastructure.
And then also, kind of in a roundabout way, ended up spending a lot of time in checkout,
which at the end of the day is kind of the engine of the entire e-commerce operation, right?
Like an analogy, perhaps an apt analogy is kind of like air traffic controller within your commerce operation
because everything, all the planes have to land there.
You can think in different pieces in isolation.
You have taxes, you have shipping, you have fulfillment concerns, you have inventory,
but all of that has to come together during checkout where you have all the different
policies, all the different negotiations, all the UI that needs to be in place. And that has been a really interesting and complex domain
to kind of wrap your head around and navigate through.
And of course, I think today we're gonna dive
into one particular aspect of it,
which is the compliance aspect,
which I admit is not something
that I thought I'd be working on,
but it turned out to be
at a really interesting technical challenge.
So you've been at Shopify for how long now? In dog years, it feels like forever. In chronological time, but it turned out to be a really interesting technical challenge.
In dog years, it feels like forever.
In chronological time, I think it's been four years, but it's been a pressure cooker.
Yeah, and you before that, I think, was it GitHub and Google, or you started PostRank?
Can you tell us just briefly your travels? Sure, let's see how far do we wanna rewind.
I started my professional career as a founder of a startup.
This was back in the 2011 era.
And our insight at the time was on the heels of web two
and all of the social things that are happening,
blogs at their heyday and all the rest.
We figured that we could create a better search algorithm.
So if you think of PageRank as the original PageRank of treating links to perform the
ranking, effectively that's a thumbs up, right?
Except that when we approached this problem, and actually it was not, it wasn't 2011, it
was 2008, we observed that there was a lot of extra signals available, like there
was literal thumbs up from different social platforms, you
could, you could leave comments, you can share them on different
surfaces. So if we could aggregate all of the signals,
we could build a better kind of human driven algorithm for
identifying what are the interesting topics. So that was
the kind of the technical underpinning. And then we went
on to build a number of products around it,
which were analytics for publishers
to help them understand where their audience is,
where the content is being discussed,
where people are engaging.
There was a product for marketing agencies,
which kind of worked in reverse, which is, hey,
if I have a thing that I'd like to seed,
who are the folks that I should be engaging,
what are the communities, and all the rest? And through that work, that led us to Google, which acquired the company.
And I ended up working on Google Analytics at the time, integrating a lot of this kind
of social analytics know-how that we acquired into the product. And later took a hard pivot
into infrastructure, technical infrastructure within Google, where we did a lot of fun things
like building radio towers to figure out if we could build a faster and better radio network
and then learning that that's a hard problem. But then later that actually became Google
Fi, which is an overlay network. And in the process, I picked up the ambiguous problem
of, hey, we keep talking about performance
and measuring performance like we want to make it better,
but how do you objectively quantify it?
It's one of those things where you kind of know it
when you see it.
It's like that felt slow.
But if I just asked you to put a technical metric on it,
how do you actually measure that?
So we spent a lot of time with browser developers in a W3C.
I was the co-chair of the W3C,
what performance working group just wrestling
with that problem of, how do you measure fast?
Is it the onload time?
No, not really.
Okay, if it's like, when did pixels paint on the screen?
But how do you measure that
and which pixels are most important?
So this leads you down those interesting cascade
of questions.
So that took a while.
That was a good five or six years
of my life of working on standards
and working with browser developers,
which was a lot of fun.
And later, I decided to join Shopify
because commerce was clearly an interesting area
and a deep domain area.
And that's been the last four years of my life here where I got to
work on building custom storefronts, which I think we covered in our last show.
Yep.
Like what is the Shopify opinionated toolkit for building custom experiences? So today
this is actually Hydrogen that has evolved quite a bit since we last talked it as a remix based stack
for with a lot of built-ins for building like beautiful
customized experiences powered by Shopify APIs.
From there, that work also led me into API infrastructure.
So looking at our GraphQL APIs and trying to understand
first of all, do we expose the right capabilities there?
But second, also once again,
performance capabilities and all the rest.
We have buyers all around the world.
We wanna deliver great user experience to all the buyers.
So like, how do you deploy a global cart?
And how do you deliver the right experience reliably?
And then finally, that led me into the guts
of like technical infrastructure, like how do we actually stand up app servers experience reliably. And then finally, that led me into the guts of technical
infrastructure, like how do we actually stand up app servers in
our like Ruby stack?
Shopify is a Ruby primarily company, right?
So rebooting our application stack and also working on
checkout, which brings us back to the earlier part of the
conversation.
Yeah, full circle.
So we're definitely going to talk checkouts.
Since you somewhat moved on from your, at least web performance years,
I'm curious to get your take on recent work,
specifically Core Web Vitals.
Is that something you've been tracking
and do you have a hot take?
Do you like that?
Do you think it hits the mark, misses the mark?
What are your thoughts on that as a metric?
Core Web Vitals was one of my key projects
when I worked at Google.
So it's-
Well that was you.
At least to me this is definitely, yeah. Well it wasn't purely me, but it was one of the key things that we incubated.
And part of it was, it's actually the same question. Like it, what is the definition of a vital,
right? Like a vital is like a vital. Then the incentive behind the vital was like a vital
signal, just like you have a heartbeat in a human body. Like what are those things that you measure
about a website that tells you that it's a good experience?
And the key problem that we wanted to solve was first,
come up with some shared agreement across browsers
of how we can measure that reliably
and not just in a lab environment
because the thing that we keep learning time and time again
is that the outside world is just so unpredictable that you have
to measure what happens in the real world. You can bake in all kinds of assumptions into
your model and then you get consistently surprised when you release your application or website
into the public and you're like, wow, I never expected this amount of traffic to come from
this particular region, which happens to have this routing topology,
and my CDN just doesn't account for it.
Or my API is located in North America,
but I have this tidal wave of users coming from,
I don't know, Europe or somewhere else.
And for them, the experience is just that much worse.
So ROM, or real user measurement metrics, are critical.
And Web Vitals was our attempt to, first of all,
define what those are, like what is that subset? And second, what are the recommended thresholds?
Right? Because everyone has a different definition, like is fast 100 milliseconds,
is fast one second. We tried to align on that. So I'm really glad to see that the Web Vitals has
continued to evolve. Like the initial set when we first published, I believe was back in 2020 or so, it focused
on loading metrics.
But we knew even when we were walking into that announcement that we really need to also
talk about interactivity.
Like it's not just that the pixels rendered fast, right?
It's also, hey, is it responsive?
Is the page locking up when I'm trying to interact with it?
How about scrolling?
Like how smooth is that?
So Web Vitals continues to evolve and add those metrics.
And I think that's great.
And it's really important for us as an industry
to have that shared definition of what good looks like.
Shows how fast the internet moves.
Cause I thought Core Web Vitals was still relatively new.
And it turns out it's like, you know, five years ago
and you're the one working on it.
So crazy.
I think it highlights the complexity of the problem.
It takes a long time to propagate
kind of those practices and metrics.
But yes, it's been a long journey.
And it was a great capstone of like all the work that I did
at Google and web performance. Shipping with Vitals felt like a good milestone and
that allowed me to kind of give myself permission to shift attention to other things.
Soterios Johnson Right. Well, let's do that. Let's shift attention to checkouts, compliance,
security, PCI, some of these things that honestly scare away or maybe
lull to sleep. Many of us we start talking about compliance matters. PCI
version 4 is burgeoning or maybe it's out now. I don't know tell us the skinny.
What is PCI? Why does it matter? And then we'll get to what's new in the
latest versions. Sounds good. So first, let's unpack the acronyms,
because those are always not helpful.
PCI stands for payment card industry.
And it provides us, it defines a set of security requirements
that you have to comply with as someone who processes
sensitive credentials.
So an example would be your personal access number
or your credit card number plus the CVV
and all the other data, right?
There's a set of protections put in place around that
for if you're handling that data,
then how you should be treated,
what kind of security precautions those services
must comply by and all the rest.
And it is a fairly burdensome set of requirements
to comply with, then you have to get periodically audited and show that you're in good standing and
all the rest.
So as a consumer, this is great because fraud on the intranet is definitely a big thing.
It's still very much an unsolved problem, right?
It is entirely entering a credit card number into a random web form is not a secure undertaking,
right?
But we've managed to build a relatively reliable experience for consumers, thankfully.
Now what's different about PCI v4?
In PCI v3, a key requirement was that you had to protect the service or the surface area
where you're entering your credentials.
So technically how we've solved that as an industry is we said, okay, well, if I want
to accept payments on the web, I'm not going to do the obvious thing, which is you've got
to put a random form on my page and start accepting credit cards because then I'm accepting
those data. and start accepting credit cards because then I'm accepting that data, like my service will get a post request,
and I'm going to have the unencrypted payment credentials, and now I'm liable for all of the compliance.
Instead, why don't we outsource this problem? And hey, we actually have a great tool in the web platform.
It's called iframes. Iframes can provide us to, can give us an ability to embed an external service that can basically do
this. And we can skin it in a way that it looks seamless,
right? Most pages that you visit on the internet, the payment
form, if you actually open up your dev tools, you'll see that
it's iframed for the specific reason. But it doesn't look
janky, it looks integrated into the website. And the nice
property of the solution is you can then just
basically all of the inputs, all of the mouse events
are obfuscated from the parent page.
So that means you can wholly delegate the responsibility
for PCI, V3 at least, to your provider.
A common example of that would be someone like Stripe.
If you wanna accept payments on your website,
they provide a like Stripe elements,
you import a web component,
pass out a few props and boom, you have a checkout form.
Under the hood, it'll inject a knife frame
and do all the things on your behalf.
So that's great.
And that's been effectively where we've settled.
I think PCI v3 came into existence around 2013 or 2014. So the last, let's say 10 years or so, that's
how we've solved that problem as an industry. Now, that is really good, but it's not sufficient.
So in the interim, what we've observed is, hey, sure, you've isolated this particular input into a secure sandbox.
But what happens if your top level page gets compromised?
Let's say you have a supply chain attack or you have an access hole in your checkout page and someone injects a malicious script.
What could they do with that?
Well, what stops them from
removing your secure input form and replacing it with a fake one?
Or maybe providing an overlay and then tricking the user effectively into entering information into an insecure
form that then expulsates the data and then swaps in the the
original, right? This class of attack is called like skimming
attacks, also known as
mage card attacks, which is a nod to Magento, not to cast shade on Magento. But I think one of the
first like published large scale instances of this attack was against Magento, which had some,
some flaw. Magento is an e-commerce platform open source. And for better or for worse,
that like mage card attack
name has stuck. Now, to be clear that this is a problem that spans all platforms regardless,
as long as you have some sort of vector for attack. So PCIe4 tries to solve for this particular
problem. It tries to tighten the perimeter to say, it's no longer sufficient to protect
the payment page, you also have to protect or provide some
guarantees around the parent page, the thing that is
embedding this payment form. Okay. And specifically, there's
a set of set of provisions. I think that in the spec itself,
which is very long, like I will zoom in on one particular
aspect of this whole conversation, which is like sex section 643. It's one of these random numbers that you just
remember once you've been sitting long enough in the PCI game. Yep. 643 defines in high
level terms, the three requirements. It says, hey, for the parent page, you have to maintain
an inventory of all scripts that are being executed on the page.
And also, please document why they're necessary and how they're being used.
Right. So just like give me an inventory. One. Two, once you have that inventory,
have some mechanisms to ensure that only those scripts, the authorized scripts are being loaded.
And then finally, have some way to guarantee or check
the integrity of each load of script, right?
Because you could say, hey, this is my inventory.
These are the scripts I've authorized.
But what if that thing got compromised, as an example?
Right?
Somebody replaced it with a malicious script
because of a supply chain attack or otherwise.
So those three things combined give you
strong assurances about what's executing
on a top-level page, which is great.
Now the practical reality of how you go about implementing
that as you can imagine is complicated.
Sounds like a lot of work.
Yeah, exactly.
And it's complicated for two reasons.
First, we should partition this problem into,
like if the two of us had a check out page
and we were to sit down and try to think through like, okay,
we need to meet these compliance requirements. think through like, okay, we need
to meet these compliance requirements. I would partition the problem into first party and
third party scripts. First of all, right? Like, okay, for the first party content, yes,
like we can define a process for, we audit which scripts we include, we audit their dependency,
we have some security review, we have a release process, we have CI checks. Sure, I can give you inventory of those scripts, right? Also, because they're
first party, I can put a content security policy and maybe even put a sub resource integrity,
which are hashes that effectively fingerprint the specific version of that page. So maybe
during my build, I can just enforce the CSP, snapshot the hashes, put those on the thing.
And like, great, now we have strong assurances that the scripts that are being executed on the page
are mine and tied to a specific version. So far, so good. Now, what about the third parties?
One of the challenges with checkout is they're one of the most important
pages for the entire ecommerce operation. Like this is where instrumentation is critical,
right? You want to know performance telemetry. You want to know which elements user is interacting
with because that affects conversion. You are likely running A-B tests. So you have
either first party or set of third party vendors doing that. Of course, you are likely running AB tests. So you have either first party or a set of third party vendors
doing that.
Of course, you have all the conversion pixels
that need to be executed because of all the ad campaigns
and all of the analytics that you
need to drive that entire up funnel, down funnel loop.
And what about all of the other marketing pixels
that you may need?
It's not uncommon.
If you just take an average checkout page
and you open up the network tab, you'll
probably find hundreds of scripts on many of them.
And oftentimes, it'll be, hey, we
load a tag manager that then allows our marketing
and other teams to inject whatever they need to drive the whole process.
But now we're staring at that problem and we're saying, so how exactly do I apply an inventory in all these things?
Because first of all, like my partner asked me to put a tag manager so they can load things.
Well, I need to unwind that decision. Right? Because now
I need to know exactly what everything that's being executed. And I need to have kind of
a full transitive chain of all the dependencies. I need to be able to account for that. Second,
you need to provide, how do I know what the, what CSP policy should I define? Can I just
say only load from partner.com or is the partner also loading from some other CDNs?
Well, you know, that's I need to go ask the partner for what those assurances are.
And then lastly, if I want to ensure integrity, that's not my content.
How do I obtain the hash of the thing?
And then if that partner wants to rev the version of their script, how do I get the hash so I can put the thing inside and then I'm not the
one injecting the content into the page.
So it becomes this like a really complicated rigmarole of like, actually I just cannot
do this.
Yeah, sounds not possible.
Precisely.
Which is one of those things where the standard was written with good intent, right?
And they in passing mentioned, hey, you have these tools, you have content security policy,
you have sub tools, you have content security policy, you have sub-research integrity. In principle, in theory, you have the right things to do this job.
In practice, if you unpack your average checkout page on the web, it's like,
I don't know how I would achieve this. I could guarantee maybe a slice of it for the first party,
but how do I solve this for third parties? Right. So turns out it's complicated, right? I'm here with Scott Deaton, CEO of Augment Code.
Augment is the first AI coding assistant that is built for professional software engineers
and large code bases.
That means context, aware, not novice, but senior level engineering abilities.
Scott Flexfermy, who are you working with? Who's getting real value from using Augment code?
So we've had the opportunity to go into hundreds of customers over the course of the past year
and show them how much more AI could do for them. Companies like Lemonade, companies like Kodem, companies like Lineage and Webflow.
All of these companies have complex code bases.
If I take Kodem, for example, they help their customers modernize their e-commerce infrastructure.
They're showing up and having to digest code they've never seen before.
In order to go through and make these essential changes to it. We cut their migration time in half because they're able to much more rapidly ramp, find the areas
of the code base, the customer code base that they need to perfect and update in order to take
advantage of their new features and that work gets done dramatically more quickly and predictably as
a result. Okay that sounds like not novice right? Sounds like senior level engineering abilities.
Sounds like serious coding ability required from this type of AI to be that effective.
100%. You know, these large code bases, when you've got tens of millions of lines in a code base,
you're not going to pass that along as context to a model, right?
That is, would be so horrifically inefficient.
Being able to mine the correct subsets of that code base in order to deliver AI insight to help tackle
the problems at hand. How much better can we make software? How much wealth can we release
and productivity can we improve if we can deliver on the promise of all these feature
gaps and tech debt? AIs love to add code into existing software.
Our dream is an AI that wants to delete code,
make the software more reliable rather than bigger.
I think we can improve software quality,
liberate ourselves from tech debt and security gaps
and software being hacked
and software being fragile and brittle.
But there's a huge opportunity
to make software dramatically better,
but it's gonna take an AI that understands your software, not one that's a novice.
Well, friends, augment taps into your team's collective knowledge, your code base,
your documentation, dependencies, the full context. You don't have to prompt it with context.
It just knows ask it the unknown unknowns and be surprised.
It is the most context aware developer AI that you can even tap into today.
So you won't just write code faster.
You'll build smarter.
It is truly an ask me anything for your code.
It's your deep thinking buddy.
It is your stay in flow antidote.
And the first step is to go to augment code.com.
That's A U G M E N T C O D E-T-C-O-D-E.com.
Create your account today, start your free 30 day trial,
no credit card required.
Once again, augmentcode.com.
So how did we approach this at Shopify?
I think there's, let me take first a branch into Shopify
and then we can talk about kind of the broader landscape.
We've been on a mission to provide stronger control and behavior over checkout, not just
because of compliance, but because we want upgrade safety, reliability, performance and
security in checkout.
And our observation is, first of all, for those not familiar, Shopify provides a hosted checkout experience,
where you don't get access to the underlying HTML.
We provide the base UI, and we allow you to configure it.
And it's a very flexible system.
You can customize the branding.
You can introduce custom components.
You can install apps that introduce components.
You can do a lot of
customizations to make it feel like your own. But a key principle that we've been operating on is that
we want a set of predefined results. We're going to define the UI elements because we want to
preserve consistency and experience, and we want to optimize for performance, security, and all the rest.
What that allows us to do is to say,
actually, we're not going to allow any third-party scripts
in our top-level page.
And that is a very consequential and big decision.
This has been work that we've been on an arc
for about three years, if not more, to achieve,
and we're finally there.
And now we're reaping the benefits of that.
So then the question is, wait a second.
So you excluded all third-party scripts,
but what about all those shiny things
that you just mentioned earlier, right?
The analytics, the customizations,
the everything else.
And this is where sandboxing comes in.
So our decision was to say only effectively,
the moment you introduce a third-party script into top top level page, you have untrusted content and you've compromised all integrity of the top level page.
Like we cannot provide any assurances on integrity of the top level page.
Right? Because in the past, when we did allow folks to our merchants to bring their own JavaScript into top level page. They're just doing
and end up doing things that break compatibility. Like they'll hook in a specific selector,
right? To inject an element knowing full well that we've never defined a contract for it.
And then if we change that, we will break them. And then security is compromised as well,
because they're introducing their own scripts, and we can provide any
assurance. So we took away that capability and said, instead,
we're going to give you a sandbox. So we're going to spin
up a set of web workers and give you a bridge. So for example,
we've we built a library and open source called remote dumb,
which allows you to construct an element tree in an isolated worker that operates off the main thread.
And then that UI is reflected back for you in the parent page.
So it feels like ergonomically, DX-wise, it feels still very straightforward because you're just manipulating elements.
And we provide a predefined set of UI elements that fit into the Checkout
UI and work with all the branding primitives. But we do that work on your behalf. And the
critical part is because we control the bridge between the web worker and the top level page,
we have tight control over what kind of mutations can be pushed between the parent and the isolated worker. So you can't just arbitrarily inject JavaScript
or perform unsafe operations on the parent page.
So in that way, we can take any third party script,
put it into a sandbox, and say, you know what?
You can do whatever the heck you want in that environment.
Because you can load a transitive chain
of other dependencies.
We don't particularly care because all we know
that the operations that you can pass back
to the parent page are safe and approved set
that we will allow.
And we also control what data is exposed to you.
So for example, if you have an extension
that wants access to some sensitive buyer data.
First, that application and then the worker itself needs to have the right consent.
So a worker that has not been granted the right consent by the merchant or the buyer will just not have access to that data.
So that is our solution for extensibility and allows us to partition the problem
of first party and third party content.
It's based on remote DOM.
And then we use the same technology for our pixels or analytics as well,
where we define an event bus, we emit all the events,
analytics providers are executed in the sandbox as well.
Is that a compromise in terms of functionality?
Do you get 100% of what you could do before
in terms of what you all are providing?
Or is it like are you constraining people through
and losing features along the way?
Yeah, you asked exactly the right question.
So the answer is we've had to rebuild a lot of stuff
because a web worker, if you're familiar,
is not the same thing as working in the top level page,
right, like it doesn't give you access to the DOM, it doesn, is not the same thing as working on the top level page.
It doesn't give you access to the DOM.
It doesn't expose all the same events.
So the reason it took us as long as it
did to layer all of this infrastructure
is because we had to work with partners and replicate.
So what do you actually need?
Instead of raw access to the DOM tree,
what are you looking for?
For example, if you were building a heat map
solution, as an example, some of our merchants
are really keen on having very visual, clear understanding
of how users are behaving on their checkout page,
you need a lot of different access
to a lot of different events and elements.
OK, well, let's work through that and figure out
what is the right subset that we can expose via this bridge.
So over time, we've built up a collection of these APIs
and primitives, some of which effectively replicate
what is available on the parent page.
One of the challenges here, by the way,
is if you ever worked with web workers,
is they use asynchronous communication.
So you have to post message between a web worker
and the top-level page, whereas a lot of the DOM APIs
are synchronous APIs.
So if you're just naively writing code expecting to be executed on top- level page, whereas a lot of the DOM APIs are synchronous APIs. So if you're just naively writing code,
expecting to be executed on top level page,
you would use synchronous APIs.
So we had to shim some of that and like in places,
we try to keep it as close as we can
to what you would expect as a developer
because we don't want to impose additional friction.
But in certain places, we had to provide the replacement APIs
where we said,
look, you're building for Shopify.
It will operate across scale of millions of merchants
if you're building an application.
It is worth for you to do this extra step
because then you have all of these insurances in place.
So a lot of handholding with partners
and getting the developers to adopt all of those APIs.
But the benefit of all of that work today is,
I'm not gonna say we're done
because there's still more things to build,
but we're in a really good place
because now all of our merchants
are running on the sandboxed primitive
that I've described.
And what we can provide is, first of all, upgrade safety.
We can safely roll forward our capabilities in checkout,
knowing that customizations that you've deployed
will not break as you move forward, right?
Because we control the bridge, we control the API interface.
So if we change the underlying API on our side,
we can still provide guarantees about that.
We have reliability.
We know that.
So for example, we saw examples where merchants would inject scripts, where a partner
would just timeout. So they would have some logic. And for
some reason, their service goes down and then the checkout is
broken. Because, well, it's just waiting to render right like
Shopify, you've broken the checkout. It's like, actually,
it's your part, it's your script that you injected of a partner
that failed to scale to your flash sale.
So now we have assurances about that.
And then finally, performance and security.
Another benefit of putting work into the sandbox is it moves all the work off the main thread.
So you can't have code that monopolizes the main thread and renders the UI unresponsive,
which gets back to our Web Vitals conversation, right?
Like we can make better performance guarantees about how the page is loaded, how responsive it is, and the rest.
And finally, there's a security bit, which is we know that you can't inject arbitrary content on top-level page and exfiltrate data.
And then finally, you have PCI compliance, because now we have a clean partition where we say as Shopify as a platform,
we will provide all of the inventory, authorization, and integrity checks for the first-party scripts
that are executed on top-level page. And oh, by the way, it can totally bring third-party content,
but we will execute it in this isolated context that allows us to punt that problem and not have
that allows us to punt that problem and not have to worry about all of the integrity problems that happen
when you just include it in top level page.
Hmm.
So did I hear you right that you said all your merchants
are already using this?
You're able to deploy that without,
or did you not say all?
Yeah.
Yes.
Yeah, yes, all.
Yeah, so this has been a long journey
to move all of our merchants onto this new platform, Yeah. Yes. Yeah. Yes. Oh, yeah.
So this has been a long journey to move all of our merchants onto this new platform.
But as of earlier this year, like 99.9% of our merchants are on this platform.
There may be like one or two exceptions, but effectively any Shopify power checkout that
you visit today as a consumer is running on this infrastructure.
And that was something that they had to opt into or that you just did on it.
Like how'd you all roll that out?
You said it took a long time, but what was, what it looked like?
Well, it took a long time because of the right question that you asked, which is,
Hey, did you, what, what did you have to take away?
Right.
And the answer is, we had to rebuild a lot of the capabilities because we've
created this isolated environment. We've had to recreate a lot of the capabilities because we've created this isolated environment.
We had to recreate a lot of APIs.
So a lot of our work was working with other developers, partners who provide capabilities that merchants want in checkout
to make sure that they can bring the same capabilities into this new world of sandbox execution. That was the long haul. And then for some merchants that had ability
to manipulate content in the top level page,
it was a combination of documentation, handholding,
consulting, and just getting them to move to the new world
so they can benefit from all of these capabilities.
But we're there and the time is right
because now you have PCIev4 compliance
effectively taking care for you.
And do you think that PCIev4 compliance means
you cannot be skimmed in the way that you could prior?
Or do you think it could still happen
in new and exciting ways?
Right, right.
So I think this actually is another layer that we should add here.
What I've described is runtime compliance or runtime guarantees.
Right.
So the thing that we've built actually allows us to provide assurance or like
extend some guarantees over, we just know that it's not, it's not possible
to inject third party content.
So if you have a supply chain attack and on that, that it's not possible to inject third party content. So
if you have a supply chain attack on that, like it's isolated into a thing that doesn't
matter.
Right.
In practice, I think what a lot of other players and e-commerce providers will end up using
or how they will provide compliance is retroactive monitoring. So PCI does not enforce a requirement
that you have to have runtime guarantees.
What it says is, hey, you should have a process
that provides an inventory,
make sure that scripts are authorized
and you have the integrity.
It doesn't specify that it needs to be guaranteed.
So practically, how could you implement this?
And how do most, like if you go and search
for PCI compliance security products,
you will find plenty that will basically say,
hey, I know a great solution for your PCI problem.
You know what it is?
Deploy my JavaScript into your page
because more JavaScript is always a solution.
And I will instrument the page and listen
for all the things that are happening. I will observe all the other scripts. I will build an
inventory. I will monitor if it changes. I will try to provide hashes and effectively, I'll like,
you can delegate this problem to me. Now you can see a flaw in that reasoning, right? It's like,
how do you know that your script is not gonna get compromised either?
Your watch is the watcher.
Well, there's that.
And how do you know that the malicious thing
doesn't come up with a clever way
to obfuscate itself from you, right?
It's the antivirus problem.
Right, like cat and mouse.
Virus hiding, exactly, virus hiding
from the antivirus problem.
But that is likely a solution that many
will adopt as a retroactive solution. So effectively, you observe if anything has changed. It's like,
oh, well, that's odd. I'm seeing a set of reports for a script that I did not expect relative to my
inventory as I defined. Does that indicate that I have a problem on my side?
Probably, right? So there's some guardrails that PCI sets for like how long that period can be
and how you need to react to it, but it is strictly lesser and less secure experience.
Which gets back to your question. Like if you have these assurances, does it mean that the class of
attacks is eliminated?
The answer is it depends on how you implement it.
Right.
So in our case as Shopify, I would feel pretty strongly about extending a promise of like,
yeah, unless our content, first party content is compromised, it would be very hard to compromise
this page.
Now what we can control at Shopify is
the buyer has installed a browser extension
that injects arbitrary scripts into the page.
Like that is outside of our control
because that operates at a higher layer.
Or maybe you even have malware on your computer
that does things and inject content into the page
or otherwise intercepts, like when you're typing.
Like those things are still possible.
It is not a, we've completely eliminated this type of attack,
but it certainly makes it a heck of a lot harder
because now it means that at least there, at a minimum,
there's a way the merchants are required to detect
these changes or these attacks and remediate
so they can't just go unnoticed.
So this all sounds like a lot of really good work you all have done at Shopify for Shopify
and Shopify's customers.
Thinking bigger, it would be great if your hard work and years of rethinking this runtime
and sandboxing and actually providing the security that PCIV4 wants everybody to have, whether or not they do or not to be compliant.
Can't some of that get into the browser? Like, couldn't we just build it? Like, could your work extend beyond Shopify's borders and help other people too?
Could your work extend beyond Shopify's borders and help other people too?
This is not just about Shopify, it's about improving the buyer experience on the web holistically.
Two things to answer that. First of all, the remote DOM library that I mentioned,
it's an open source project that we've built and open source. So if you go to github.com slash Shopify slash remote Dom, you'll find that there
take a look at it, use it. This is, that's the technology that powers Shopify checkout.
Other large companies have already adopted it. I believe Stripe is using it for their apps.
Actually fun story. When we built the project, I think Stripe beat us to using it in a production
product. Really? Even though we were the ones developing it for a checkout.
But like it is, it is used at Shopify and by other big players to provide this form
of isolation. And the general pattern is, Hey, I have a trusted first party surface
into which I want to bring in third party content. And I, I don't want to compromise
integrity of my first party top level service.
Well remote DOM is one of the technical solutions for that. So please take a look at that. That's
answer number one. Second though, and coming back to the browser conversation, absolutely.
The primitives that we have in browsers today, content security policy and SRI, we can make
better. And we've actually done a bunch of work
on exactly that at Shopify.
We don't want to do work in JavaScript
that we could push into the browser
because the browser is just much more efficient
and it has capabilities that we otherwise would be very hard
for us to replicate.
So first let's like enumerate some trivial examples
of gaps, script integrity.
So sub resource integrity,
for those not familiar on your script tag, you can pass in effectively a hash. So when you inject
the tag into your HTML, you can pass in a hash that is a fingerprint. And when the browser loads
the script before it executes it, it can compare the hash of the thing that it fetched versus what
you've defined and say, hey, those two things match, great, I will execute the script.
Otherwise I'm going to raise a violation and not report this.
That's a big capability in that existing browsers today.
It's not simple to deploy, but it is doable, right?
Because you need to figure out how do I get these hashes and how do I inject them at the
right place?
But then one of the gaps that existed for a long time
was module imports.
So SRI worked for top level scripts,
but if you're building a JavaScript application
and you're using an import,
you just could not pass in an integrity hash.
Why?
Well, because module imports came
after sub-resource integrity, both designs.
It was just never a thing.
That was a pain point for us because we used module imports
at Shopify, so we worked with Chrome and Safari
to upstream some patches to get that supported
for module imports.
So the good news is that's now baked in,
I believe as of May of 2024,
I think when Safari shipped it in their release, both Chrome and Safari
support SRI for module imports.
So that's one.
Another thing that came up in our thinking
when we were exploring CSP compliance
and how do we make our own life simpler
is this idea of require SRI for.
So what if you could express content security policy that
says, hey, all scripts must have an SRI or integrity hash?
Gotcha.
All right.
And why is that useful?
Well, then you can make a strong claim
that if you have that policy being enforced by the browser,
then if for some reason
you sneak through by accident or malicious act, a script that doesn't have it, they would just be
rejected, right? Which today would just execute normally without any questions. And even though
that might be hard to deploy in an enforcement mode, it could totally work and be really useful
in report only mode. So for those not and be really useful in report-only mode.
So for those not familiar with content security policy, you have an enforcement mode and a
report-only mode where you can get violations, which is incredibly useful because you could say,
hey, this is a policy I would like to enforce. What are the violations?
So with the Require SRI 4, you could deploy this in report-only mode and say,
So with the require SRI for you could deploy this in report only mode and say,
great, now I'm going to get reliable reports from the browser, from the while,
for any time a browser detects that a script is missing an SRI capability.
This is great because sophisticated attackers would not emit these scripts on every single page load. They might target specific users or a class of user, or maybe they target specific browser,
or maybe if it's an extension, it'll apply some sort of other heuristic, right?
It's very hard to... This kind of mirrors our conversation on why ROM is important,
real user measurement.
Gathering violation reports from real users gives you a much better and reliable signal
for where the problems are.
So Require SRI 4 is another capability that we've shipped into Chrome and that allows
you to get violations on missing SRI attributes, which allows you to build an inventory of
like, this is the list for me to burn down and figure out why, right? And if anything changed, how do I, how should I react to it?
Another example is, okay, great. Now we have these reports coming in. Wouldn't it be nice
if we could also get the hash of the content, right? Today you would just get a report saying, hey, I detected script from example.com slash xyz.js.
But what was the content of that?
You don't know.
Right.
Wouldn't it be nice if you could also
get a hash in the report such that you could audit it
and say, oh, well, maybe that's totally OK
because the partner revved their version
and it just happens to be the v2. I just put that into my approved list and everything's fine versus
I have no idea if that was a compromised version or a legitimate version of the script.
Interesting. So pardon my ignorance for a moment, but where does the reporting take
place or post to the browsers browser's doing the reporting.
Is it?
Who gets the report and how?
Is it the browser sends it off somewhere or?
Yep, so on the wire, you would, when you emit a page,
you can define a content security policy,
CSP policy in a header.
And you would define for script source,
list, for example,
a list of origins from which you're allowed to fetch,
for images and all the rest.
You also have a report to target and a separate report
to header that provides a specification for you specify
the endpoint to where you want the violation
report to be reported.
And as good hygiene that reporting endpoint
should ideally be like a distinct origin and all
the rest.
But you provide a destination.
So you can find services that will do this for you.
They'll just say, point your report to us,
and we will provide a dashboard which you can drill down
reports.
We will aggregate.
We'll give you metrics and all the rest.
That's something that we do in-house at Shopify and I think many other large providers will do
on their own, but you could outsource that problem. But just having the ability to even get the report
with, hey, a report has been emitted because the script is missing an integrity hash is by itself
a really useful capability because otherwise
you'd probably have to set up some sort of crawling infrastructure that periodically
checks your page and says, you know, I access this page from five different points on the
globe every 24 hours and I observed that nothing has changed. Well, that's good, but we could
do much better by just actually observing what the real users are seeing
and getting the actual reports of violations.
Gotcha.
So this new one require SRI for would work in like manner
as the CSP violations in terms of reporting.
So you would, the CSP policy is require SRI for scripts.
Right, so you're saying all of my script resources must have a hash.
And then you can configure that to be a report only policy such that it would still execute
if the script is missing the hash, but you would get the violation fired in the background.
And the browser has its own logic for prioritizing batching delivery and doing all of that
to get you the report.
Now, do you deploy this one in Shopify?
Yep.
And do you use it in report mode
or do you let lockdown mode or how do you use it?
So for this one, it would be a report mode,
but it depends on the shape of your checkout, on how much control you have for
your first party or third party content. Just to double back on that, for Shopify, for our
checkout, we enforce a CSP policy. Actually, let me run through the whole list. For our
first party content, we have a well-defined process for vetting all the dependencies and
a process for updates, auditing to make sure that we provide some guarantees over if the
library that we depend on has been compromised, how can we detect that?
We have change management process for it.
So this is the reviews, testing, CI, all the things that you would expect.
That allows us to create the inventory.
We know from where it's served, which means that we can define a strict CSP policy that
says you should only fetch from these sub-origins that we trust.
In our build step, we can inject the hashes, the SRI hashes for known content.
And we can also emit the require SRI for policy to ensure that if anything else, for some
reason, if we omit a missed some script, that we will get a violation on that because we
don't want to break checkout, but we want to be notified immediately if those things
are detected, then we can react to it.
And we have our own reporting endpoint, which we aggregate.
We look at the reports. This is a thing that merchants don't have to worry about because we do this work on
their behalf.
And we can provide this guarantee over overall integrity.
And then finally, we've protected the parent page, but the payment credentials page or
the payment form itself is also isolated into its own iframe, just as it was before.
So this is a defense in depth, right?
We protected the parent,
but we also have our own implementation of the iframe
and like the full PCI compliance
behind that particular form.
Well, that's a lot of stuff for PCI compliance, Ilya.
What happens with V5?
How many years are you gonna put into that one?
I don't know.
That's a good question.
I'm pretty sure that V4 will keep us busy for a long while.
Yeah, because this is only section 6.4.3, right?
That's all we're talking about right here.
That's right.
There's this all the others.
Okay, so interesting stuff.
It sounds like you've solved some really difficult technical challenges in order to do this in
a way that's not just compliant, but actually in the spirit of the compliance as well, like
trying to actually make it more secure.
What are some takeaways for listeners out there?
Maybe they're doing their own checkout.
Maybe they have compliance they need to do.
Maybe they just want some more secure websites.
Like what do you think they could be thinking walking away from this?
If they're not in the actual situation that Shopify is in and having to implement
this stuff, what could they learn from this conversation?
Yeah, I think the meta pattern and message takeaway here is broadly the integrity and security of first party
versus third party content.
We mix first party and third party in most contexts.
But even outside of checkout, there are many surfaces.
Let's say you have an admin surface or a privileged surface that you only want certain users to access,
and you want some extensibility in there.
You want to bring in third-party content or customization in all the rest.
The pattern that we're describing with isolating third-party content
is a generic pattern that you can deploy there.
We use the same sandboxing technology in. So we use the same sandbox and
technology in checkout. We use the same technology in our admin. So for merchants, we allow customizations
and third-party developers to bring in their custom UI and other aspects. As you can imagine,
that's a very sensitive surface. Order data is there, customer data is there. You don't
want to just open up a Pandora's box of injector arbitrary
JavaScript because who knows where that data might travel. So the isolation primitive,
it may be remote DOM, it may be something else, but this way of thinking of isolating into either
an iframe or a worker, I think is a pattern that we should be adopting
more widely.
And it has these additional benefits.
You have better assurances about security, yes, performance as well, because you're isolating
content and moving it off the main thread.
You get to define the API contract, so you have better upgradability if you need to maintain
that. And I think that's just something that we need to get
better at on the web.
Now, the challenge I think for all of us
and kind of as industry practitioners is to think through
boy, the worker is kind of this like naked environment.
We can probably figure out, we should think about
how do we figure out some better set of APIs
where we don't have to reinvent the entire wheel just as we did with, you know, at Shopify for great.
Now I want to build a heat map thing.
What does that mean?
How do I mirror the entire stream of events from top level page into this isolated environment?
I think we can do some thinking and innovation there.
Very cool. Anything else that's on your mind that we haven't discussed in this
context or honestly in any developer context, I always love to hear your
opinions on stuff. Anything else on your mind?
I think one really interesting topic coming,
coming back to the world of checkout and commerce is of course agents and how
agents will interact or how they might affect any of these behaviors.
Yes.
MCP, are you done with MCP?
That's the newest acronym, Model Context Protocol.
It's burgeoning.
Yep, yep, MCP is definitely top of mind
and we're looking at it intently.
We're using it for a number of tools
and internal services at Shopify.
We're also considering if and how we should be exposing
MCP as a protocol and endpoint as a service on behalf of merchants. So imagine you could
have a merchant storefront as a remote MCP endpoint. But more broadly, like if you think
of, let's imagine you interacting with an agent asking it's, hey, I'd like to have a pair of white sneakers size 10,
$50 to $100 range.
Please go find me a pair and check out.
Under the hood, the agent might crawl the web,
find the storefront, add to cart, head to checkout.
And what does it do then as it's looking at a payment form?
Is it a responsibility of the agent
to hold onto your payment credentials?
And what are the implications of that? For entering? How does it enter those credentials? Are there any security and compliance
problems or challenges in that? I think that's a wide open question that we as an industry are yet
to figure out an answer. Is the human required in that loop? What if it's an accelerated checkout
where maybe information is vaulted? I think it's an accelerated checkout where maybe
information is vaulted?
I think there's a range of questions and answers
that we need to figure out in this space.
What's your personal thought on is the human required
in the loop?
How do you feel, confidence-wise, on removing
the human from that loop?
I think it's context-dependent.
I think there's definitely a class of commerce
in certain types of transactions where I know
exactly what I want.
There's very low risk and it's kind of a predefined flow
where I just say, look, I need another carton of milk.
You know exactly what I'm looking for.
You know where to shop and please go finish it.
And I just want it at my front door.
And then there's other types of experiences where maybe this is your first time engaging
with a merchant. Maybe merchant has a set of rules where they actually require you or require the
agent to decelerate because, hey, for compliance reasons, I may need to verify your age or I need
you to read this disclaimer
on this product before you purchase it. You can't just have the agent blindly ignore that
context or click approve and then proceed with the transaction. So I think we'll need
to define some protocol or shared mechanism to signal to agents that like, Hey, in this particular case, I
need you to pause and ask for human to either confirm or take over control and complete
the transaction.
There's so many questions there.
I just don't feel like I even have the brain right now to analyze all the things that have
to be considered.
I'm glad that you're, are you going to be working on this for Shopify? Are you going to stay all the things that have to be considered.
Are you going to be working on this for Shopify?
Are you going to stay in the NPC island?
What's next for you inside of you?
Is this an active thing that you're thinking about for Shopify?
It is definitely an active area of exploration for us.
That is one of the things I'm looking with our team
and many of our partners who are building these agents,
who are trying to figure out what is the future of checkout where agents drive some meaningful
portion of that experience.
What does a good experience even look like in that context?
So I think those are all very interesting and pertinent question given where we are
today.
Hmm.
Well, I'll have to have you come back in a year or two and let us know what you end up building
as you've figured it all out.
You seem to have figured out at least this hairy technical problem that comes with this
new PCI stuff.
So I'm sure you'll figure out something.
Yeah, we'd love to be back.
And at the rate that we're moving in the AI world in a year or two for now, who knows
what will be there?
So yes, I'm trying to think of the most recent person who said.
Six to nine months and LLM will be writing a hundred percent of code.
So, I mean, who knows, man, they will be, you and I will be out on the street
corner talking about this stuff.
I doubt that is the case, but well? Yeah, me too.
But, you know, it's not a week goes by that somebody doesn't declare software engineering is dead or dying.
So, how to squeeze that one in there.
Yes, I think what we're actually saying is the definition of what software engineering is is changing.
Right?
I am constantly amazed by what AI is capable of doing in terms of software development.
But I'm also constantly surprised
by the silly and stupid mistakes that it makes.
And oftentimes those mistakes are actually due to misunderstanding or lack of poor definition
of the problem that's being solved.
It's kind of putting the mirror back to yourself, right?
Because oftentimes I'll find that like, actually, you know what?
You did exactly the right thing the way expressed
it. But that's not what I meant. And I didn't even know what I
meant when I typed it. Because now that I've seen the mistake,
I understand what I was actually trying to get to. So it is this
like art of defining the problem. And rubber duck
programming. And I think we're heading more and more towards
the world where we're actively collaborating instead of hands
on keyboard, typing keyboard typing if statements.
Yeah, the best rubber duck programmers might be the best programmers of the future.
The ones who can just talk it out the best, you know, figure it out as you go.
All right, Ilya, appreciate you coming on the show and chatting with us and looking forward to having you back soon.
Thank you, Jared.
Thank you, Jared. Okay, so it turns out securing e-commerce checkouts has never been more complicated.
But thankfully, brilliant engineers like Ilya and his team at Shopify are putting in the
work and some of that work is making its way back into the web platform.
I love when that happens.
And when you think about it, the complicated nature of it all makes sense.
The stakes have never been higher.
I read the other day that last year, e-commerce sales soared
to a record one point two trillion dollars.
That's a lot of moolah being transferred.
And if you can hack it, you can jack it.
So, yeah, it's complicated for a good reason.
Let's give one more thanks to our sponsors of this episode.
Retool, Augment Code, and of course, Fly.io.
Check out their wares to support their work, which supports our work, which we appreciate.
Thanks also to our Beatmaster in residence, Breakmaster Cylinder.
Did you know our next full length album is almost ready, and I'll tell you right now,
it's called Afterparty. And I'll also tell you right now it's called After Party.
And I'll also tell you right now that I've been bumping it all week.
I dig it.
Hopefully you will too.
Soon.
So soon.
Alright, that's all from me, but we'll talk to you again on Change Login, friends, on
Friday.
Bye y'all. So
I'm out.