The Changelog: Software Development, Open Source - Securing ecommerce: "It's complicated" (Interview)

Starting point is 00:00:00 Welcome to the changelog where each and every week we sit down with the hackers, the leaders, and the innovators of the software world to pick their brain, to learn from their mistakes, to get inspired by their accomplishments, and to have a lot of fun along the way. On this episode, I'm joined by Ilya Grigorik, Distinguished Engineer and Technical Advisor to the CEO at Shopify. Ilya has been hard at work securing e-commerce checkouts from sophisticated new attacks such as digital skimming, and he's here to share all the technical intricacies and far-reaching implications of this work.

Starting point is 00:00:45 But first, a mention of our partners at Fly.io, the public cloud built for developers who ship. You know we love Fly, and you might too. Check them out at Fly.io. Okay, Ilya Grigorik on the changelog. Here we go. Well, friends, I'm here with a good friend of mine, David Shu, the founder and CEO of Retool. So David, I know so many developers who use Retool to solve problems, but I'm curious, help me to understand the specific user, the particular developer who is just loving Retool? Who's

Starting point is 00:01:27 your ideal user? Yeah, so for us, the ideal user of Retool is someone whose goal first and foremost is to either deliver value to the business or to be effective. Where we candidly have a little bit less success is with people that are extremely opinionated about their tools. If for example, you're like, Hey, I need to go use WebAssembly. And if I'm not using WebAssembly, I'm quitting my job, you're probably not the best ritual user, honestly. However, if you're like, Hey, I see problems in the business and I want to have an impact and I want to solve those problems. Retool is right up your alley. And the reason for that is ritual allows you to have an impact so quickly. You could go from an idea, you go from a meeting like, hey, you know,

Starting point is 00:02:07 this is an app that we need to literally having the app built at 30 minutes, which is super, super impactful in the business. So I think that's the kind of partnership or that's the kind of impact that we'd like to see with our customers. You know, from my perspective, my thought is that, well, Ritual is well known. Ritual is somewhat even saturated. I know a lot of people who know Retool, but you've said this before. What makes you think that Retool is not that well known?

Starting point is 00:02:32 Retool today is really quite well known amongst a certain crowd. Like I think if you had a poll like Engineers in San Francisco or Engineers in Silicon Valley even, I think it'd probably get like a 50, 60, 70% recognition of Retool. I think where you're less likely to have heard of Retool is if you're a random developer at a random company in a random location like the Midwest, for example, or like a developer in Argentina, for example,

Starting point is 00:02:58 you're probably less likely. And the reason is I think we have a lot of really strong word of mouth from a lot of Silicon Valley companies like the Brex Brexit, Coinbase, Doordash, Stripes, etc. of the world. There's a lot of chat, Airbnb is another customer and Nvidia is another customer. So there's a lot of chatter about Retool in the Valley. But I think outside of the Valley, I think we're not as well done.

Starting point is 00:03:16 And that's one goal of ours to go change that. Well, friends, now you know what Retool is, you know who they are. You're aware that Retool exists. And if you're trying to solve problems for your company, you're in a meeting, as David mentioned, and someone mentions something where a problem exists and you can easily go and solve that problem in 30 minutes, an hour or some margin of time that is basically a nominal amount of time. And you go and use retail to solve that problem. That's amazing.

Starting point is 00:03:47 Go to Retail.com and get started for free or book a demo. It is too easy to use, Retail. And now, you know, so go and try it once again. Retail.com. So I'm here with Ilya Grigorik from Shopify back on the show after years and years. You've been on the show, I think, four or five times. Welcome back. Thank you. Glad to be back.

Starting point is 00:04:26 What have you been up to, man? I think it was 2021, last time you were on the show, we were talking to Hyrogen, you're still at Shopify, so you've been there a very long time, what have you been up to? So I think last time we talked about custom storefronts and a big mission we had at Shopify to enable developers to build customized storefronts

Starting point is 00:04:42 using their own application stack. Since then, I've spent a lot of time diving into our APIs and infrastructure. And then also, kind of in a roundabout way, ended up spending a lot of time in checkout, which at the end of the day is kind of the engine of the entire e-commerce operation, right? Like an analogy, perhaps an apt analogy is kind of like air traffic controller within your commerce operation because everything, all the planes have to land there. You can think in different pieces in isolation. You have taxes, you have shipping, you have fulfillment concerns, you have inventory, but all of that has to come together during checkout where you have all the different

Starting point is 00:05:18 policies, all the different negotiations, all the UI that needs to be in place. And that has been a really interesting and complex domain to kind of wrap your head around and navigate through. And of course, I think today we're gonna dive into one particular aspect of it, which is the compliance aspect, which I admit is not something that I thought I'd be working on, but it turned out to be

Starting point is 00:05:41 at a really interesting technical challenge. So you've been at Shopify for how long now? In dog years, it feels like forever. In chronological time, but it turned out to be a really interesting technical challenge. In dog years, it feels like forever. In chronological time, I think it's been four years, but it's been a pressure cooker. Yeah, and you before that, I think, was it GitHub and Google, or you started PostRank? Can you tell us just briefly your travels? Sure, let's see how far do we wanna rewind. I started my professional career as a founder of a startup. This was back in the 2011 era.

Starting point is 00:06:13 And our insight at the time was on the heels of web two and all of the social things that are happening, blogs at their heyday and all the rest. We figured that we could create a better search algorithm. So if you think of PageRank as the original PageRank of treating links to perform the ranking, effectively that's a thumbs up, right? Except that when we approached this problem, and actually it was not, it wasn't 2011, it was 2008, we observed that there was a lot of extra signals available, like there

Starting point is 00:06:46 was literal thumbs up from different social platforms, you could, you could leave comments, you can share them on different surfaces. So if we could aggregate all of the signals, we could build a better kind of human driven algorithm for identifying what are the interesting topics. So that was the kind of the technical underpinning. And then we went on to build a number of products around it, which were analytics for publishers

Starting point is 00:07:07 to help them understand where their audience is, where the content is being discussed, where people are engaging. There was a product for marketing agencies, which kind of worked in reverse, which is, hey, if I have a thing that I'd like to seed, who are the folks that I should be engaging, what are the communities, and all the rest? And through that work, that led us to Google, which acquired the company.

Starting point is 00:07:30 And I ended up working on Google Analytics at the time, integrating a lot of this kind of social analytics know-how that we acquired into the product. And later took a hard pivot into infrastructure, technical infrastructure within Google, where we did a lot of fun things like building radio towers to figure out if we could build a faster and better radio network and then learning that that's a hard problem. But then later that actually became Google Fi, which is an overlay network. And in the process, I picked up the ambiguous problem of, hey, we keep talking about performance and measuring performance like we want to make it better,

Starting point is 00:08:08 but how do you objectively quantify it? It's one of those things where you kind of know it when you see it. It's like that felt slow. But if I just asked you to put a technical metric on it, how do you actually measure that? So we spent a lot of time with browser developers in a W3C. I was the co-chair of the W3C,

Starting point is 00:08:25 what performance working group just wrestling with that problem of, how do you measure fast? Is it the onload time? No, not really. Okay, if it's like, when did pixels paint on the screen? But how do you measure that and which pixels are most important? So this leads you down those interesting cascade

Starting point is 00:08:42 of questions. So that took a while. That was a good five or six years of my life of working on standards and working with browser developers, which was a lot of fun. And later, I decided to join Shopify because commerce was clearly an interesting area

Starting point is 00:08:59 and a deep domain area. And that's been the last four years of my life here where I got to work on building custom storefronts, which I think we covered in our last show. Yep. Like what is the Shopify opinionated toolkit for building custom experiences? So today this is actually Hydrogen that has evolved quite a bit since we last talked it as a remix based stack for with a lot of built-ins for building like beautiful customized experiences powered by Shopify APIs.

Starting point is 00:09:34 From there, that work also led me into API infrastructure. So looking at our GraphQL APIs and trying to understand first of all, do we expose the right capabilities there? But second, also once again, performance capabilities and all the rest. We have buyers all around the world. We wanna deliver great user experience to all the buyers. So like, how do you deploy a global cart?

Starting point is 00:09:57 And how do you deliver the right experience reliably? And then finally, that led me into the guts of like technical infrastructure, like how do we actually stand up app servers experience reliably. And then finally, that led me into the guts of technical infrastructure, like how do we actually stand up app servers in our like Ruby stack? Shopify is a Ruby primarily company, right? So rebooting our application stack and also working on checkout, which brings us back to the earlier part of the

Starting point is 00:10:18 conversation. Yeah, full circle. So we're definitely going to talk checkouts. Since you somewhat moved on from your, at least web performance years, I'm curious to get your take on recent work, specifically Core Web Vitals. Is that something you've been tracking and do you have a hot take?

Starting point is 00:10:35 Do you like that? Do you think it hits the mark, misses the mark? What are your thoughts on that as a metric? Core Web Vitals was one of my key projects when I worked at Google. So it's- Well that was you. At least to me this is definitely, yeah. Well it wasn't purely me, but it was one of the key things that we incubated.

Starting point is 00:10:49 And part of it was, it's actually the same question. Like it, what is the definition of a vital, right? Like a vital is like a vital. Then the incentive behind the vital was like a vital signal, just like you have a heartbeat in a human body. Like what are those things that you measure about a website that tells you that it's a good experience? And the key problem that we wanted to solve was first, come up with some shared agreement across browsers of how we can measure that reliably and not just in a lab environment

Starting point is 00:11:18 because the thing that we keep learning time and time again is that the outside world is just so unpredictable that you have to measure what happens in the real world. You can bake in all kinds of assumptions into your model and then you get consistently surprised when you release your application or website into the public and you're like, wow, I never expected this amount of traffic to come from this particular region, which happens to have this routing topology, and my CDN just doesn't account for it. Or my API is located in North America,

Starting point is 00:11:51 but I have this tidal wave of users coming from, I don't know, Europe or somewhere else. And for them, the experience is just that much worse. So ROM, or real user measurement metrics, are critical. And Web Vitals was our attempt to, first of all, define what those are, like what is that subset? And second, what are the recommended thresholds? Right? Because everyone has a different definition, like is fast 100 milliseconds, is fast one second. We tried to align on that. So I'm really glad to see that the Web Vitals has

Starting point is 00:12:22 continued to evolve. Like the initial set when we first published, I believe was back in 2020 or so, it focused on loading metrics. But we knew even when we were walking into that announcement that we really need to also talk about interactivity. Like it's not just that the pixels rendered fast, right? It's also, hey, is it responsive? Is the page locking up when I'm trying to interact with it? How about scrolling?

Starting point is 00:12:51 Like how smooth is that? So Web Vitals continues to evolve and add those metrics. And I think that's great. And it's really important for us as an industry to have that shared definition of what good looks like. Shows how fast the internet moves. Cause I thought Core Web Vitals was still relatively new. And it turns out it's like, you know, five years ago

Starting point is 00:13:09 and you're the one working on it. So crazy. I think it highlights the complexity of the problem. It takes a long time to propagate kind of those practices and metrics. But yes, it's been a long journey. And it was a great capstone of like all the work that I did at Google and web performance. Shipping with Vitals felt like a good milestone and

Starting point is 00:13:31 that allowed me to kind of give myself permission to shift attention to other things. Soterios Johnson Right. Well, let's do that. Let's shift attention to checkouts, compliance, security, PCI, some of these things that honestly scare away or maybe lull to sleep. Many of us we start talking about compliance matters. PCI version 4 is burgeoning or maybe it's out now. I don't know tell us the skinny. What is PCI? Why does it matter? And then we'll get to what's new in the latest versions. Sounds good. So first, let's unpack the acronyms, because those are always not helpful.

Starting point is 00:14:08 PCI stands for payment card industry. And it provides us, it defines a set of security requirements that you have to comply with as someone who processes sensitive credentials. So an example would be your personal access number or your credit card number plus the CVV and all the other data, right? There's a set of protections put in place around that

Starting point is 00:14:32 for if you're handling that data, then how you should be treated, what kind of security precautions those services must comply by and all the rest. And it is a fairly burdensome set of requirements to comply with, then you have to get periodically audited and show that you're in good standing and all the rest. So as a consumer, this is great because fraud on the intranet is definitely a big thing.

Starting point is 00:14:56 It's still very much an unsolved problem, right? It is entirely entering a credit card number into a random web form is not a secure undertaking, right? But we've managed to build a relatively reliable experience for consumers, thankfully. Now what's different about PCI v4? In PCI v3, a key requirement was that you had to protect the service or the surface area where you're entering your credentials. So technically how we've solved that as an industry is we said, okay, well, if I want

Starting point is 00:15:33 to accept payments on the web, I'm not going to do the obvious thing, which is you've got to put a random form on my page and start accepting credit cards because then I'm accepting those data. and start accepting credit cards because then I'm accepting that data, like my service will get a post request, and I'm going to have the unencrypted payment credentials, and now I'm liable for all of the compliance. Instead, why don't we outsource this problem? And hey, we actually have a great tool in the web platform. It's called iframes. Iframes can provide us to, can give us an ability to embed an external service that can basically do this. And we can skin it in a way that it looks seamless, right? Most pages that you visit on the internet, the payment

Starting point is 00:16:14 form, if you actually open up your dev tools, you'll see that it's iframed for the specific reason. But it doesn't look janky, it looks integrated into the website. And the nice property of the solution is you can then just basically all of the inputs, all of the mouse events are obfuscated from the parent page. So that means you can wholly delegate the responsibility for PCI, V3 at least, to your provider.

Starting point is 00:16:43 A common example of that would be someone like Stripe. If you wanna accept payments on your website, they provide a like Stripe elements, you import a web component, pass out a few props and boom, you have a checkout form. Under the hood, it'll inject a knife frame and do all the things on your behalf. So that's great.

Starting point is 00:17:01 And that's been effectively where we've settled. I think PCI v3 came into existence around 2013 or 2014. So the last, let's say 10 years or so, that's how we've solved that problem as an industry. Now, that is really good, but it's not sufficient. So in the interim, what we've observed is, hey, sure, you've isolated this particular input into a secure sandbox. But what happens if your top level page gets compromised? Let's say you have a supply chain attack or you have an access hole in your checkout page and someone injects a malicious script. What could they do with that? Well, what stops them from

Starting point is 00:17:48 removing your secure input form and replacing it with a fake one? Or maybe providing an overlay and then tricking the user effectively into entering information into an insecure form that then expulsates the data and then swaps in the the original, right? This class of attack is called like skimming attacks, also known as mage card attacks, which is a nod to Magento, not to cast shade on Magento. But I think one of the first like published large scale instances of this attack was against Magento, which had some, some flaw. Magento is an e-commerce platform open source. And for better or for worse,

Starting point is 00:18:23 that like mage card attack name has stuck. Now, to be clear that this is a problem that spans all platforms regardless, as long as you have some sort of vector for attack. So PCIe4 tries to solve for this particular problem. It tries to tighten the perimeter to say, it's no longer sufficient to protect the payment page, you also have to protect or provide some guarantees around the parent page, the thing that is embedding this payment form. Okay. And specifically, there's a set of set of provisions. I think that in the spec itself,

Starting point is 00:19:00 which is very long, like I will zoom in on one particular aspect of this whole conversation, which is like sex section 643. It's one of these random numbers that you just remember once you've been sitting long enough in the PCI game. Yep. 643 defines in high level terms, the three requirements. It says, hey, for the parent page, you have to maintain an inventory of all scripts that are being executed on the page. And also, please document why they're necessary and how they're being used. Right. So just like give me an inventory. One. Two, once you have that inventory, have some mechanisms to ensure that only those scripts, the authorized scripts are being loaded.

Starting point is 00:19:41 And then finally, have some way to guarantee or check the integrity of each load of script, right? Because you could say, hey, this is my inventory. These are the scripts I've authorized. But what if that thing got compromised, as an example? Right? Somebody replaced it with a malicious script because of a supply chain attack or otherwise.

Starting point is 00:19:59 So those three things combined give you strong assurances about what's executing on a top-level page, which is great. Now the practical reality of how you go about implementing that as you can imagine is complicated. Sounds like a lot of work. Yeah, exactly. And it's complicated for two reasons.

Starting point is 00:20:16 First, we should partition this problem into, like if the two of us had a check out page and we were to sit down and try to think through like, okay, we need to meet these compliance requirements. think through like, okay, we need to meet these compliance requirements. I would partition the problem into first party and third party scripts. First of all, right? Like, okay, for the first party content, yes, like we can define a process for, we audit which scripts we include, we audit their dependency, we have some security review, we have a release process, we have CI checks. Sure, I can give you inventory of those scripts, right? Also, because they're

Starting point is 00:20:49 first party, I can put a content security policy and maybe even put a sub resource integrity, which are hashes that effectively fingerprint the specific version of that page. So maybe during my build, I can just enforce the CSP, snapshot the hashes, put those on the thing. And like, great, now we have strong assurances that the scripts that are being executed on the page are mine and tied to a specific version. So far, so good. Now, what about the third parties? One of the challenges with checkout is they're one of the most important pages for the entire ecommerce operation. Like this is where instrumentation is critical, right? You want to know performance telemetry. You want to know which elements user is interacting

Starting point is 00:21:38 with because that affects conversion. You are likely running A-B tests. So you have either first party or set of third party vendors doing that. Of course, you are likely running AB tests. So you have either first party or a set of third party vendors doing that. Of course, you have all the conversion pixels that need to be executed because of all the ad campaigns and all of the analytics that you need to drive that entire up funnel, down funnel loop. And what about all of the other marketing pixels

Starting point is 00:22:02 that you may need? It's not uncommon. If you just take an average checkout page and you open up the network tab, you'll probably find hundreds of scripts on many of them. And oftentimes, it'll be, hey, we load a tag manager that then allows our marketing and other teams to inject whatever they need to drive the whole process.

Starting point is 00:22:29 But now we're staring at that problem and we're saying, so how exactly do I apply an inventory in all these things? Because first of all, like my partner asked me to put a tag manager so they can load things. Well, I need to unwind that decision. Right? Because now I need to know exactly what everything that's being executed. And I need to have kind of a full transitive chain of all the dependencies. I need to be able to account for that. Second, you need to provide, how do I know what the, what CSP policy should I define? Can I just say only load from partner.com or is the partner also loading from some other CDNs? Well, you know, that's I need to go ask the partner for what those assurances are.

Starting point is 00:23:11 And then lastly, if I want to ensure integrity, that's not my content. How do I obtain the hash of the thing? And then if that partner wants to rev the version of their script, how do I get the hash so I can put the thing inside and then I'm not the one injecting the content into the page. So it becomes this like a really complicated rigmarole of like, actually I just cannot do this. Yeah, sounds not possible. Precisely.

Starting point is 00:23:35 Which is one of those things where the standard was written with good intent, right? And they in passing mentioned, hey, you have these tools, you have content security policy, you have sub tools, you have content security policy, you have sub-research integrity. In principle, in theory, you have the right things to do this job. In practice, if you unpack your average checkout page on the web, it's like, I don't know how I would achieve this. I could guarantee maybe a slice of it for the first party, but how do I solve this for third parties? Right. So turns out it's complicated, right? I'm here with Scott Deaton, CEO of Augment Code. Augment is the first AI coding assistant that is built for professional software engineers and large code bases.

Starting point is 00:24:45 That means context, aware, not novice, but senior level engineering abilities. Scott Flexfermy, who are you working with? Who's getting real value from using Augment code? So we've had the opportunity to go into hundreds of customers over the course of the past year and show them how much more AI could do for them. Companies like Lemonade, companies like Kodem, companies like Lineage and Webflow. All of these companies have complex code bases. If I take Kodem, for example, they help their customers modernize their e-commerce infrastructure. They're showing up and having to digest code they've never seen before. In order to go through and make these essential changes to it. We cut their migration time in half because they're able to much more rapidly ramp, find the areas

Starting point is 00:25:30 of the code base, the customer code base that they need to perfect and update in order to take advantage of their new features and that work gets done dramatically more quickly and predictably as a result. Okay that sounds like not novice right? Sounds like senior level engineering abilities. Sounds like serious coding ability required from this type of AI to be that effective. 100%. You know, these large code bases, when you've got tens of millions of lines in a code base, you're not going to pass that along as context to a model, right? That is, would be so horrifically inefficient. Being able to mine the correct subsets of that code base in order to deliver AI insight to help tackle

Starting point is 00:26:10 the problems at hand. How much better can we make software? How much wealth can we release and productivity can we improve if we can deliver on the promise of all these feature gaps and tech debt? AIs love to add code into existing software. Our dream is an AI that wants to delete code, make the software more reliable rather than bigger. I think we can improve software quality, liberate ourselves from tech debt and security gaps and software being hacked

Starting point is 00:26:37 and software being fragile and brittle. But there's a huge opportunity to make software dramatically better, but it's gonna take an AI that understands your software, not one that's a novice. Well, friends, augment taps into your team's collective knowledge, your code base, your documentation, dependencies, the full context. You don't have to prompt it with context. It just knows ask it the unknown unknowns and be surprised. It is the most context aware developer AI that you can even tap into today.

Starting point is 00:27:08 So you won't just write code faster. You'll build smarter. It is truly an ask me anything for your code. It's your deep thinking buddy. It is your stay in flow antidote. And the first step is to go to augment code.com. That's A U G M E N T C O D E-T-C-O-D-E.com. Create your account today, start your free 30 day trial,

Starting point is 00:27:29 no credit card required. Once again, augmentcode.com. So how did we approach this at Shopify? I think there's, let me take first a branch into Shopify and then we can talk about kind of the broader landscape. We've been on a mission to provide stronger control and behavior over checkout, not just because of compliance, but because we want upgrade safety, reliability, performance and security in checkout.

Starting point is 00:27:59 And our observation is, first of all, for those not familiar, Shopify provides a hosted checkout experience, where you don't get access to the underlying HTML. We provide the base UI, and we allow you to configure it. And it's a very flexible system. You can customize the branding. You can introduce custom components. You can install apps that introduce components. You can do a lot of

Starting point is 00:28:25 customizations to make it feel like your own. But a key principle that we've been operating on is that we want a set of predefined results. We're going to define the UI elements because we want to preserve consistency and experience, and we want to optimize for performance, security, and all the rest. What that allows us to do is to say, actually, we're not going to allow any third-party scripts in our top-level page. And that is a very consequential and big decision. This has been work that we've been on an arc

Starting point is 00:29:01 for about three years, if not more, to achieve, and we're finally there. And now we're reaping the benefits of that. So then the question is, wait a second. So you excluded all third-party scripts, but what about all those shiny things that you just mentioned earlier, right? The analytics, the customizations,

Starting point is 00:29:16 the everything else. And this is where sandboxing comes in. So our decision was to say only effectively, the moment you introduce a third-party script into top top level page, you have untrusted content and you've compromised all integrity of the top level page. Like we cannot provide any assurances on integrity of the top level page. Right? Because in the past, when we did allow folks to our merchants to bring their own JavaScript into top level page. They're just doing and end up doing things that break compatibility. Like they'll hook in a specific selector, right? To inject an element knowing full well that we've never defined a contract for it.

Starting point is 00:29:58 And then if we change that, we will break them. And then security is compromised as well, because they're introducing their own scripts, and we can provide any assurance. So we took away that capability and said, instead, we're going to give you a sandbox. So we're going to spin up a set of web workers and give you a bridge. So for example, we've we built a library and open source called remote dumb, which allows you to construct an element tree in an isolated worker that operates off the main thread. And then that UI is reflected back for you in the parent page.

Starting point is 00:30:33 So it feels like ergonomically, DX-wise, it feels still very straightforward because you're just manipulating elements. And we provide a predefined set of UI elements that fit into the Checkout UI and work with all the branding primitives. But we do that work on your behalf. And the critical part is because we control the bridge between the web worker and the top level page, we have tight control over what kind of mutations can be pushed between the parent and the isolated worker. So you can't just arbitrarily inject JavaScript or perform unsafe operations on the parent page. So in that way, we can take any third party script, put it into a sandbox, and say, you know what?

Starting point is 00:31:16 You can do whatever the heck you want in that environment. Because you can load a transitive chain of other dependencies. We don't particularly care because all we know that the operations that you can pass back to the parent page are safe and approved set that we will allow. And we also control what data is exposed to you.

Starting point is 00:31:37 So for example, if you have an extension that wants access to some sensitive buyer data. First, that application and then the worker itself needs to have the right consent. So a worker that has not been granted the right consent by the merchant or the buyer will just not have access to that data. So that is our solution for extensibility and allows us to partition the problem of first party and third party content. It's based on remote DOM. And then we use the same technology for our pixels or analytics as well,

Starting point is 00:32:13 where we define an event bus, we emit all the events, analytics providers are executed in the sandbox as well. Is that a compromise in terms of functionality? Do you get 100% of what you could do before in terms of what you all are providing? Or is it like are you constraining people through and losing features along the way? Yeah, you asked exactly the right question.

Starting point is 00:32:35 So the answer is we've had to rebuild a lot of stuff because a web worker, if you're familiar, is not the same thing as working in the top level page, right, like it doesn't give you access to the DOM, it doesn, is not the same thing as working on the top level page. It doesn't give you access to the DOM. It doesn't expose all the same events. So the reason it took us as long as it did to layer all of this infrastructure

Starting point is 00:32:53 is because we had to work with partners and replicate. So what do you actually need? Instead of raw access to the DOM tree, what are you looking for? For example, if you were building a heat map solution, as an example, some of our merchants are really keen on having very visual, clear understanding of how users are behaving on their checkout page,

Starting point is 00:33:12 you need a lot of different access to a lot of different events and elements. OK, well, let's work through that and figure out what is the right subset that we can expose via this bridge. So over time, we've built up a collection of these APIs and primitives, some of which effectively replicate what is available on the parent page. One of the challenges here, by the way,

Starting point is 00:33:33 is if you ever worked with web workers, is they use asynchronous communication. So you have to post message between a web worker and the top-level page, whereas a lot of the DOM APIs are synchronous APIs. So if you're just naively writing code expecting to be executed on top- level page, whereas a lot of the DOM APIs are synchronous APIs. So if you're just naively writing code, expecting to be executed on top level page, you would use synchronous APIs.

Starting point is 00:33:51 So we had to shim some of that and like in places, we try to keep it as close as we can to what you would expect as a developer because we don't want to impose additional friction. But in certain places, we had to provide the replacement APIs where we said, look, you're building for Shopify. It will operate across scale of millions of merchants

Starting point is 00:34:10 if you're building an application. It is worth for you to do this extra step because then you have all of these insurances in place. So a lot of handholding with partners and getting the developers to adopt all of those APIs. But the benefit of all of that work today is, I'm not gonna say we're done because there's still more things to build,

Starting point is 00:34:29 but we're in a really good place because now all of our merchants are running on the sandboxed primitive that I've described. And what we can provide is, first of all, upgrade safety. We can safely roll forward our capabilities in checkout, knowing that customizations that you've deployed will not break as you move forward, right?

Starting point is 00:34:52 Because we control the bridge, we control the API interface. So if we change the underlying API on our side, we can still provide guarantees about that. We have reliability. We know that. So for example, we saw examples where merchants would inject scripts, where a partner would just timeout. So they would have some logic. And for some reason, their service goes down and then the checkout is

Starting point is 00:35:13 broken. Because, well, it's just waiting to render right like Shopify, you've broken the checkout. It's like, actually, it's your part, it's your script that you injected of a partner that failed to scale to your flash sale. So now we have assurances about that. And then finally, performance and security. Another benefit of putting work into the sandbox is it moves all the work off the main thread. So you can't have code that monopolizes the main thread and renders the UI unresponsive,

Starting point is 00:35:44 which gets back to our Web Vitals conversation, right? Like we can make better performance guarantees about how the page is loaded, how responsive it is, and the rest. And finally, there's a security bit, which is we know that you can't inject arbitrary content on top-level page and exfiltrate data. And then finally, you have PCI compliance, because now we have a clean partition where we say as Shopify as a platform, we will provide all of the inventory, authorization, and integrity checks for the first-party scripts that are executed on top-level page. And oh, by the way, it can totally bring third-party content, but we will execute it in this isolated context that allows us to punt that problem and not have that allows us to punt that problem and not have to worry about all of the integrity problems that happen

Starting point is 00:36:29 when you just include it in top level page. Hmm. So did I hear you right that you said all your merchants are already using this? You're able to deploy that without, or did you not say all? Yeah. Yes.

Starting point is 00:36:41 Yeah, yes, all. Yeah, so this has been a long journey to move all of our merchants onto this new platform, Yeah. Yes. Yeah. Yes. Oh, yeah. So this has been a long journey to move all of our merchants onto this new platform. But as of earlier this year, like 99.9% of our merchants are on this platform. There may be like one or two exceptions, but effectively any Shopify power checkout that you visit today as a consumer is running on this infrastructure. And that was something that they had to opt into or that you just did on it.

Starting point is 00:37:07 Like how'd you all roll that out? You said it took a long time, but what was, what it looked like? Well, it took a long time because of the right question that you asked, which is, Hey, did you, what, what did you have to take away? Right. And the answer is, we had to rebuild a lot of the capabilities because we've created this isolated environment. We've had to recreate a lot of the capabilities because we've created this isolated environment. We had to recreate a lot of APIs.

Starting point is 00:37:27 So a lot of our work was working with other developers, partners who provide capabilities that merchants want in checkout to make sure that they can bring the same capabilities into this new world of sandbox execution. That was the long haul. And then for some merchants that had ability to manipulate content in the top level page, it was a combination of documentation, handholding, consulting, and just getting them to move to the new world so they can benefit from all of these capabilities. But we're there and the time is right because now you have PCIev4 compliance

Starting point is 00:38:07 effectively taking care for you. And do you think that PCIev4 compliance means you cannot be skimmed in the way that you could prior? Or do you think it could still happen in new and exciting ways? Right, right. So I think this actually is another layer that we should add here. What I've described is runtime compliance or runtime guarantees.

Starting point is 00:38:35 Right. So the thing that we've built actually allows us to provide assurance or like extend some guarantees over, we just know that it's not, it's not possible to inject third party content. So if you have a supply chain attack and on that, that it's not possible to inject third party content. So if you have a supply chain attack on that, like it's isolated into a thing that doesn't matter. Right.

Starting point is 00:38:52 In practice, I think what a lot of other players and e-commerce providers will end up using or how they will provide compliance is retroactive monitoring. So PCI does not enforce a requirement that you have to have runtime guarantees. What it says is, hey, you should have a process that provides an inventory, make sure that scripts are authorized and you have the integrity. It doesn't specify that it needs to be guaranteed.

Starting point is 00:39:22 So practically, how could you implement this? And how do most, like if you go and search for PCI compliance security products, you will find plenty that will basically say, hey, I know a great solution for your PCI problem. You know what it is? Deploy my JavaScript into your page because more JavaScript is always a solution.

Starting point is 00:39:42 And I will instrument the page and listen for all the things that are happening. I will observe all the other scripts. I will build an inventory. I will monitor if it changes. I will try to provide hashes and effectively, I'll like, you can delegate this problem to me. Now you can see a flaw in that reasoning, right? It's like, how do you know that your script is not gonna get compromised either? Your watch is the watcher. Well, there's that. And how do you know that the malicious thing

Starting point is 00:40:11 doesn't come up with a clever way to obfuscate itself from you, right? It's the antivirus problem. Right, like cat and mouse. Virus hiding, exactly, virus hiding from the antivirus problem. But that is likely a solution that many will adopt as a retroactive solution. So effectively, you observe if anything has changed. It's like,

Starting point is 00:40:33 oh, well, that's odd. I'm seeing a set of reports for a script that I did not expect relative to my inventory as I defined. Does that indicate that I have a problem on my side? Probably, right? So there's some guardrails that PCI sets for like how long that period can be and how you need to react to it, but it is strictly lesser and less secure experience. Which gets back to your question. Like if you have these assurances, does it mean that the class of attacks is eliminated? The answer is it depends on how you implement it. Right.

Starting point is 00:41:08 So in our case as Shopify, I would feel pretty strongly about extending a promise of like, yeah, unless our content, first party content is compromised, it would be very hard to compromise this page. Now what we can control at Shopify is the buyer has installed a browser extension that injects arbitrary scripts into the page. Like that is outside of our control because that operates at a higher layer.

Starting point is 00:41:34 Or maybe you even have malware on your computer that does things and inject content into the page or otherwise intercepts, like when you're typing. Like those things are still possible. It is not a, we've completely eliminated this type of attack, but it certainly makes it a heck of a lot harder because now it means that at least there, at a minimum, there's a way the merchants are required to detect

Starting point is 00:41:59 these changes or these attacks and remediate so they can't just go unnoticed. So this all sounds like a lot of really good work you all have done at Shopify for Shopify and Shopify's customers. Thinking bigger, it would be great if your hard work and years of rethinking this runtime and sandboxing and actually providing the security that PCIV4 wants everybody to have, whether or not they do or not to be compliant. Can't some of that get into the browser? Like, couldn't we just build it? Like, could your work extend beyond Shopify's borders and help other people too? Could your work extend beyond Shopify's borders and help other people too?

Starting point is 00:42:50 This is not just about Shopify, it's about improving the buyer experience on the web holistically. Two things to answer that. First of all, the remote DOM library that I mentioned, it's an open source project that we've built and open source. So if you go to github.com slash Shopify slash remote Dom, you'll find that there take a look at it, use it. This is, that's the technology that powers Shopify checkout. Other large companies have already adopted it. I believe Stripe is using it for their apps. Actually fun story. When we built the project, I think Stripe beat us to using it in a production product. Really? Even though we were the ones developing it for a checkout. But like it is, it is used at Shopify and by other big players to provide this form

Starting point is 00:43:33 of isolation. And the general pattern is, Hey, I have a trusted first party surface into which I want to bring in third party content. And I, I don't want to compromise integrity of my first party top level service. Well remote DOM is one of the technical solutions for that. So please take a look at that. That's answer number one. Second though, and coming back to the browser conversation, absolutely. The primitives that we have in browsers today, content security policy and SRI, we can make better. And we've actually done a bunch of work on exactly that at Shopify.

Starting point is 00:44:08 We don't want to do work in JavaScript that we could push into the browser because the browser is just much more efficient and it has capabilities that we otherwise would be very hard for us to replicate. So first let's like enumerate some trivial examples of gaps, script integrity. So sub resource integrity,

Starting point is 00:44:26 for those not familiar on your script tag, you can pass in effectively a hash. So when you inject the tag into your HTML, you can pass in a hash that is a fingerprint. And when the browser loads the script before it executes it, it can compare the hash of the thing that it fetched versus what you've defined and say, hey, those two things match, great, I will execute the script. Otherwise I'm going to raise a violation and not report this. That's a big capability in that existing browsers today. It's not simple to deploy, but it is doable, right? Because you need to figure out how do I get these hashes and how do I inject them at the

Starting point is 00:45:03 right place? But then one of the gaps that existed for a long time was module imports. So SRI worked for top level scripts, but if you're building a JavaScript application and you're using an import, you just could not pass in an integrity hash. Why?

Starting point is 00:45:19 Well, because module imports came after sub-resource integrity, both designs. It was just never a thing. That was a pain point for us because we used module imports at Shopify, so we worked with Chrome and Safari to upstream some patches to get that supported for module imports. So the good news is that's now baked in,

Starting point is 00:45:39 I believe as of May of 2024, I think when Safari shipped it in their release, both Chrome and Safari support SRI for module imports. So that's one. Another thing that came up in our thinking when we were exploring CSP compliance and how do we make our own life simpler is this idea of require SRI for.

Starting point is 00:46:04 So what if you could express content security policy that says, hey, all scripts must have an SRI or integrity hash? Gotcha. All right. And why is that useful? Well, then you can make a strong claim that if you have that policy being enforced by the browser, then if for some reason

Starting point is 00:46:25 you sneak through by accident or malicious act, a script that doesn't have it, they would just be rejected, right? Which today would just execute normally without any questions. And even though that might be hard to deploy in an enforcement mode, it could totally work and be really useful in report only mode. So for those not and be really useful in report-only mode. So for those not familiar with content security policy, you have an enforcement mode and a report-only mode where you can get violations, which is incredibly useful because you could say, hey, this is a policy I would like to enforce. What are the violations? So with the Require SRI 4, you could deploy this in report-only mode and say,

Starting point is 00:47:04 So with the require SRI for you could deploy this in report only mode and say, great, now I'm going to get reliable reports from the browser, from the while, for any time a browser detects that a script is missing an SRI capability. This is great because sophisticated attackers would not emit these scripts on every single page load. They might target specific users or a class of user, or maybe they target specific browser, or maybe if it's an extension, it'll apply some sort of other heuristic, right? It's very hard to... This kind of mirrors our conversation on why ROM is important, real user measurement. Gathering violation reports from real users gives you a much better and reliable signal

Starting point is 00:47:46 for where the problems are. So Require SRI 4 is another capability that we've shipped into Chrome and that allows you to get violations on missing SRI attributes, which allows you to build an inventory of like, this is the list for me to burn down and figure out why, right? And if anything changed, how do I, how should I react to it? Another example is, okay, great. Now we have these reports coming in. Wouldn't it be nice if we could also get the hash of the content, right? Today you would just get a report saying, hey, I detected script from example.com slash xyz.js. But what was the content of that? You don't know.

Starting point is 00:48:33 Right. Wouldn't it be nice if you could also get a hash in the report such that you could audit it and say, oh, well, maybe that's totally OK because the partner revved their version and it just happens to be the v2. I just put that into my approved list and everything's fine versus I have no idea if that was a compromised version or a legitimate version of the script. Interesting. So pardon my ignorance for a moment, but where does the reporting take

Starting point is 00:49:02 place or post to the browsers browser's doing the reporting. Is it? Who gets the report and how? Is it the browser sends it off somewhere or? Yep, so on the wire, you would, when you emit a page, you can define a content security policy, CSP policy in a header. And you would define for script source,

Starting point is 00:49:24 list, for example, a list of origins from which you're allowed to fetch, for images and all the rest. You also have a report to target and a separate report to header that provides a specification for you specify the endpoint to where you want the violation report to be reported. And as good hygiene that reporting endpoint

Starting point is 00:49:45 should ideally be like a distinct origin and all the rest. But you provide a destination. So you can find services that will do this for you. They'll just say, point your report to us, and we will provide a dashboard which you can drill down reports. We will aggregate.

Starting point is 00:50:02 We'll give you metrics and all the rest. That's something that we do in-house at Shopify and I think many other large providers will do on their own, but you could outsource that problem. But just having the ability to even get the report with, hey, a report has been emitted because the script is missing an integrity hash is by itself a really useful capability because otherwise you'd probably have to set up some sort of crawling infrastructure that periodically checks your page and says, you know, I access this page from five different points on the globe every 24 hours and I observed that nothing has changed. Well, that's good, but we could

Starting point is 00:50:43 do much better by just actually observing what the real users are seeing and getting the actual reports of violations. Gotcha. So this new one require SRI for would work in like manner as the CSP violations in terms of reporting. So you would, the CSP policy is require SRI for scripts. Right, so you're saying all of my script resources must have a hash. And then you can configure that to be a report only policy such that it would still execute

Starting point is 00:51:15 if the script is missing the hash, but you would get the violation fired in the background. And the browser has its own logic for prioritizing batching delivery and doing all of that to get you the report. Now, do you deploy this one in Shopify? Yep. And do you use it in report mode or do you let lockdown mode or how do you use it? So for this one, it would be a report mode,

Starting point is 00:51:41 but it depends on the shape of your checkout, on how much control you have for your first party or third party content. Just to double back on that, for Shopify, for our checkout, we enforce a CSP policy. Actually, let me run through the whole list. For our first party content, we have a well-defined process for vetting all the dependencies and a process for updates, auditing to make sure that we provide some guarantees over if the library that we depend on has been compromised, how can we detect that? We have change management process for it. So this is the reviews, testing, CI, all the things that you would expect.

Starting point is 00:52:23 That allows us to create the inventory. We know from where it's served, which means that we can define a strict CSP policy that says you should only fetch from these sub-origins that we trust. In our build step, we can inject the hashes, the SRI hashes for known content. And we can also emit the require SRI for policy to ensure that if anything else, for some reason, if we omit a missed some script, that we will get a violation on that because we don't want to break checkout, but we want to be notified immediately if those things are detected, then we can react to it.

Starting point is 00:53:00 And we have our own reporting endpoint, which we aggregate. We look at the reports. This is a thing that merchants don't have to worry about because we do this work on their behalf. And we can provide this guarantee over overall integrity. And then finally, we've protected the parent page, but the payment credentials page or the payment form itself is also isolated into its own iframe, just as it was before. So this is a defense in depth, right? We protected the parent,

Starting point is 00:53:28 but we also have our own implementation of the iframe and like the full PCI compliance behind that particular form. Well, that's a lot of stuff for PCI compliance, Ilya. What happens with V5? How many years are you gonna put into that one? I don't know. That's a good question.

Starting point is 00:53:46 I'm pretty sure that V4 will keep us busy for a long while. Yeah, because this is only section 6.4.3, right? That's all we're talking about right here. That's right. There's this all the others. Okay, so interesting stuff. It sounds like you've solved some really difficult technical challenges in order to do this in a way that's not just compliant, but actually in the spirit of the compliance as well, like

Starting point is 00:54:15 trying to actually make it more secure. What are some takeaways for listeners out there? Maybe they're doing their own checkout. Maybe they have compliance they need to do. Maybe they just want some more secure websites. Like what do you think they could be thinking walking away from this? If they're not in the actual situation that Shopify is in and having to implement this stuff, what could they learn from this conversation?

Starting point is 00:54:41 Yeah, I think the meta pattern and message takeaway here is broadly the integrity and security of first party versus third party content. We mix first party and third party in most contexts. But even outside of checkout, there are many surfaces. Let's say you have an admin surface or a privileged surface that you only want certain users to access, and you want some extensibility in there. You want to bring in third-party content or customization in all the rest. The pattern that we're describing with isolating third-party content

Starting point is 00:55:21 is a generic pattern that you can deploy there. We use the same sandboxing technology in. So we use the same sandbox and technology in checkout. We use the same technology in our admin. So for merchants, we allow customizations and third-party developers to bring in their custom UI and other aspects. As you can imagine, that's a very sensitive surface. Order data is there, customer data is there. You don't want to just open up a Pandora's box of injector arbitrary JavaScript because who knows where that data might travel. So the isolation primitive, it may be remote DOM, it may be something else, but this way of thinking of isolating into either

Starting point is 00:56:02 an iframe or a worker, I think is a pattern that we should be adopting more widely. And it has these additional benefits. You have better assurances about security, yes, performance as well, because you're isolating content and moving it off the main thread. You get to define the API contract, so you have better upgradability if you need to maintain that. And I think that's just something that we need to get better at on the web.

Starting point is 00:56:28 Now, the challenge I think for all of us and kind of as industry practitioners is to think through boy, the worker is kind of this like naked environment. We can probably figure out, we should think about how do we figure out some better set of APIs where we don't have to reinvent the entire wheel just as we did with, you know, at Shopify for great. Now I want to build a heat map thing. What does that mean?

Starting point is 00:56:53 How do I mirror the entire stream of events from top level page into this isolated environment? I think we can do some thinking and innovation there. Very cool. Anything else that's on your mind that we haven't discussed in this context or honestly in any developer context, I always love to hear your opinions on stuff. Anything else on your mind? I think one really interesting topic coming, coming back to the world of checkout and commerce is of course agents and how agents will interact or how they might affect any of these behaviors.

Starting point is 00:57:24 Yes. MCP, are you done with MCP? That's the newest acronym, Model Context Protocol. It's burgeoning. Yep, yep, MCP is definitely top of mind and we're looking at it intently. We're using it for a number of tools and internal services at Shopify.

Starting point is 00:57:42 We're also considering if and how we should be exposing MCP as a protocol and endpoint as a service on behalf of merchants. So imagine you could have a merchant storefront as a remote MCP endpoint. But more broadly, like if you think of, let's imagine you interacting with an agent asking it's, hey, I'd like to have a pair of white sneakers size 10, $50 to $100 range. Please go find me a pair and check out. Under the hood, the agent might crawl the web, find the storefront, add to cart, head to checkout.

Starting point is 00:58:16 And what does it do then as it's looking at a payment form? Is it a responsibility of the agent to hold onto your payment credentials? And what are the implications of that? For entering? How does it enter those credentials? Are there any security and compliance problems or challenges in that? I think that's a wide open question that we as an industry are yet to figure out an answer. Is the human required in that loop? What if it's an accelerated checkout where maybe information is vaulted? I think it's an accelerated checkout where maybe information is vaulted?

Starting point is 00:58:46 I think there's a range of questions and answers that we need to figure out in this space. What's your personal thought on is the human required in the loop? How do you feel, confidence-wise, on removing the human from that loop? I think it's context-dependent. I think there's definitely a class of commerce

Starting point is 00:59:05 in certain types of transactions where I know exactly what I want. There's very low risk and it's kind of a predefined flow where I just say, look, I need another carton of milk. You know exactly what I'm looking for. You know where to shop and please go finish it. And I just want it at my front door. And then there's other types of experiences where maybe this is your first time engaging

Starting point is 00:59:29 with a merchant. Maybe merchant has a set of rules where they actually require you or require the agent to decelerate because, hey, for compliance reasons, I may need to verify your age or I need you to read this disclaimer on this product before you purchase it. You can't just have the agent blindly ignore that context or click approve and then proceed with the transaction. So I think we'll need to define some protocol or shared mechanism to signal to agents that like, Hey, in this particular case, I need you to pause and ask for human to either confirm or take over control and complete the transaction.

Starting point is 01:00:14 There's so many questions there. I just don't feel like I even have the brain right now to analyze all the things that have to be considered. I'm glad that you're, are you going to be working on this for Shopify? Are you going to stay all the things that have to be considered. Are you going to be working on this for Shopify? Are you going to stay in the NPC island? What's next for you inside of you? Is this an active thing that you're thinking about for Shopify?

Starting point is 01:00:33 It is definitely an active area of exploration for us. That is one of the things I'm looking with our team and many of our partners who are building these agents, who are trying to figure out what is the future of checkout where agents drive some meaningful portion of that experience. What does a good experience even look like in that context? So I think those are all very interesting and pertinent question given where we are today.

Starting point is 01:00:59 Hmm. Well, I'll have to have you come back in a year or two and let us know what you end up building as you've figured it all out. You seem to have figured out at least this hairy technical problem that comes with this new PCI stuff. So I'm sure you'll figure out something. Yeah, we'd love to be back. And at the rate that we're moving in the AI world in a year or two for now, who knows

Starting point is 01:01:20 what will be there? So yes, I'm trying to think of the most recent person who said. Six to nine months and LLM will be writing a hundred percent of code. So, I mean, who knows, man, they will be, you and I will be out on the street corner talking about this stuff. I doubt that is the case, but well? Yeah, me too. But, you know, it's not a week goes by that somebody doesn't declare software engineering is dead or dying. So, how to squeeze that one in there.

Starting point is 01:01:53 Yes, I think what we're actually saying is the definition of what software engineering is is changing. Right? I am constantly amazed by what AI is capable of doing in terms of software development. But I'm also constantly surprised by the silly and stupid mistakes that it makes. And oftentimes those mistakes are actually due to misunderstanding or lack of poor definition of the problem that's being solved. It's kind of putting the mirror back to yourself, right?

Starting point is 01:02:21 Because oftentimes I'll find that like, actually, you know what? You did exactly the right thing the way expressed it. But that's not what I meant. And I didn't even know what I meant when I typed it. Because now that I've seen the mistake, I understand what I was actually trying to get to. So it is this like art of defining the problem. And rubber duck programming. And I think we're heading more and more towards the world where we're actively collaborating instead of hands

Starting point is 01:02:43 on keyboard, typing keyboard typing if statements. Yeah, the best rubber duck programmers might be the best programmers of the future. The ones who can just talk it out the best, you know, figure it out as you go. All right, Ilya, appreciate you coming on the show and chatting with us and looking forward to having you back soon. Thank you, Jared. Thank you, Jared. Okay, so it turns out securing e-commerce checkouts has never been more complicated. But thankfully, brilliant engineers like Ilya and his team at Shopify are putting in the work and some of that work is making its way back into the web platform.

Starting point is 01:03:19 I love when that happens. And when you think about it, the complicated nature of it all makes sense. The stakes have never been higher. I read the other day that last year, e-commerce sales soared to a record one point two trillion dollars. That's a lot of moolah being transferred. And if you can hack it, you can jack it. So, yeah, it's complicated for a good reason.

Starting point is 01:03:42 Let's give one more thanks to our sponsors of this episode. Retool, Augment Code, and of course, Fly.io. Check out their wares to support their work, which supports our work, which we appreciate. Thanks also to our Beatmaster in residence, Breakmaster Cylinder. Did you know our next full length album is almost ready, and I'll tell you right now, it's called Afterparty. And I'll also tell you right now it's called After Party. And I'll also tell you right now that I've been bumping it all week. I dig it.

Starting point is 01:04:10 Hopefully you will too. Soon. So soon. Alright, that's all from me, but we'll talk to you again on Change Login, friends, on Friday. Bye y'all. So I'm out.

The Changelog: Software Development, Open Source - Securing ecommerce: "It's complicated" (Interview)

Ilya Grigorik and his team at Shopify has been hard at work securing ecommerce checkouts from sophisticated news attacks (such as digital skimming) and he's here to share all the technical intricacies... and far-reaching implications of this work.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.