Screaming in the Cloud - How Scaling Turns Rare Occurrences Into Common Ones with Jason Cohen

Starting point is 00:00:00 That's another thing that's true of engineering. You can do anything if you really want, and you can write stuff in any language if you really want. Doesn't mean you should, doesn't mean it's a good fit, but okay. Welcome to Screaming in the Cloud, I'm Corey Quinn. Periodically, I will have people from a variety of different companies

Starting point is 00:00:21 doing different things for different reasons come on the podcast. Every once in a while, I like to crack down, okay, who's a vendor that I've used a lot of and often don't necessarily think about? And when I started framing it that way, today's guest became relatively obvious. Jason Cohen is the founder of WP Engine. Jason, thanks for joining me. It's great to be here. I have a painful history with running websites at even small scale, then medium scale, then large scale. And WordPress has been sort of a thing that has taken over the world. It felt like the late 90s.

Starting point is 00:01:01 Now there's still a disgusting percentage of the world that runs on top of WordPress. I've run it myself. It was a terrific demo app for teaching people how to use Puppet. It touches a whole bunch of different things. And when it came time to decide I should have a website that probably is useful to work with, the first iteration that I went with personally was building my own custom thing on serverless. This was a bad idea. When it became an actual real business, I went with WordPress and figured, huh, who can I wind up finding to run this for me that isn't me? And the answer was WP Engine pretty quickly. So I've been your customer for something like seven years now. Thanks for not going down that during that time. It's appreciated. Oh, oh sure thanks for giving us money

Starting point is 00:01:45 that's that's what we like yeah i mean wordpress currently powers 43 percent of every domain on earth which as you say is a staggering and unbelievable number but there's many different data sources who all point to that and uh yeah it's it's because it's open it's because there's a community it's because i think it is true that once you have some success and momentum there, it builds on it because people know it and then they build another site, just like you said. And so there's also a big set of design agencies that use WordPress. So they are almost essentially like a sales force for WordPress.

Starting point is 00:02:21 WordPress is free. So that's in quotes. But, you know, hey, if you're going to build your agency or freelancing business off of it, clearly you're going to advocate for that. So I think you have this set of things like that, which made it so successful. Even 14 years ago, when I started WP Engine, WordPress was already 11 or 12 percent of the web, which is already kind of infinity for a new company to be able to sell to a tenth of the web. That's huge. And so, yeah, it's only grown from there, which is hard to believe. People love to talk smack about it in engineering circles. It's PHP. Who wants to work in PHP?

Starting point is 00:02:54 Great. Cool. I don't want to think about PHP. I don't have to. I care about the content. What the technology stack that powers my website is, is never going to be a determining factor in did my company succeed or not from where i see yeah but here's the thing engineers hate whatever is popular like whenever the language becomes too popular then everyone hates it so like at first java was cool because it was new and weird even though it's slow and actually kind of bad then java became the most popular language and everyone hates java okay i mean you just hate whatever it is that whatever bug tracking system you use, you hate it. Like almost guarantee you hate it. Okay. I just feel like this is, this is the

Starting point is 00:03:32 standard thing that we do. Also there's, there's a general thing in engineering where it's not necessarily the highest quality, best thing that wins. There's other factors like it being easy to try, easy to troubleshoot, easy to troubleshoot easy to understand easy to dig in under the covers and so forth so you have things like open source projects that have all those attributes and are they as good as some commercial things in many ways no but it has those attributes so it wins anyways wordpress has all of these things because it's open source and it's easy and it's accessible for lots of people to use it and so on and so forth there's an old article called worse is better on this and so it shows stuff like some of these text formats for for moving stuff around it's inefficient like we

Starting point is 00:04:16 should use other formats like i know but the thing with text is you can write it you can read it you can look at it in your packet dumping stuff you can you can mess with it easily you can use grep you can dump it to a log and all that other stuff is harder and so right so it's worse but it's better because it's more accessible it's more you know it's more observable and so forth so i feel like there's a lot of things and then when those things become popular for good reason because those other attributes are good engineers like to say i hate it because and then they list those other types of attributes and they're not wrong that those other attributes are missing or not optimized for. But I just feel like this is very common in many things in engineering, so it doesn't bother me. In fact, maybe it means you're winning.

Starting point is 00:04:57 For me, the big reason to go WordPress is not because I have some deep-seated love affair with it. If anything, just the opposite. Before they wound up dying slash being absorbed by MediaTemple, I worked on large-scale hosted WordPress at MediaTemple for about a year, year and a half. And that was enough to teach me I didn't want to run WordPress myself if I could possibly avoid it. Because running it on a laptop or in a container or who knows, probably someone who's working on Kubernetes these days, is probably not that challenging. But running anything at scale introduces an entire series of separate problems. Right. Yeah, we run it on Kubernetes.

Starting point is 00:05:33 You can run it everywhere if you want. Of course, that's another thing that's true of engineering. You can do anything if you really want. And you can write stuff in any language if you really want. Doesn't mean you should. Doesn't mean it's a good fit, but okay. But yes, doing anything at scale obviously is hard yeah when i got started i built a bunch of serverless stuff uh that to run the website and power the blog and the newsletter then i realized that you know at least in the website piece of it other people could be much

Starting point is 00:05:58 better at doing a lot of these things than i could and i didn't want all the engineering to be bottlenecked on at the time the four people on the planet who understood these technologies that came out last week. And but with WordPress, you swing a dead cat, you'll hit 15 people who know how to work on this, basically, no matter what room you happen to be in. Yeah, that's part of why the spoils continue to go to the winner, those kind of things like what you just said. Well, whatever it is, we definitely can hire people full time or contractors or part time or flex or that we can definitely do it. Also, will the cloud support the tech that's behind it? Yeah, of course.

Starting point is 00:06:30 It's 43% of the web. What are they going to not support it? That's crazy. So it's those kinds of things where you go, okay, well, let's just do that. So exactly as you say, is WordPress or your marketing website in general, is it incredibly core to your business that it be unique? And the answer is almost always, no, that's not what makes us unique. What CMS we use and how the marketers mess about with an article like that's that is really far away from what makes the product unique for almost every company. Okay. So for almost every company, like you

Starting point is 00:06:59 shouldn't, you should spend the least amount of time on this and you should spend enough money only so that bad things don't happen. Like the site goes down, the site is slow, the site is hacked. Okay, yeah, we need to spend enough money to where that's not happening because that is bad. But beyond that, there's no additional benefit. Therefore, outsourcing it to us or a competitor of ours for that matter, just simply makes sense. It's not where you're going to get a comparative advantage.

Starting point is 00:07:21 So why are you spending your time on it other than the base, the core is needed for it to be functional and do its job. One of the things that I think is lost on a lot of folks is the idea of scale as being its own particular skill set. As you say, you have a, an awful lot of competitors. You are not the only company in the world that provides managed WordPress hosting. And you are also by a landslide, not the least expensive. The trouble with just getting the results of all the various companies that do these things and sorting by price from low to high is that there's a universe of folks out there who, well, I ran my own website for a couple of years on WordPress, didn't seem that hard. Oh,

Starting point is 00:07:58 it has a multi-tenancy option. I'm going to go ahead and spin that up and then I'll start making money by offering that to other folks. That starts to fall apart extremely quickly. I wanted to trust a company that has been there before when there's something that's going on and the website goes non-responsive for some reason. Okay, there are people who know what they're doing looking at this. It'll be a minute or two and it'll come back as opposed to having to wake someone up in the middle of the night because they didn't realize that that's how computers work. It's interesting. Scale is an interesting topic. It's also interesting to be expensive. If you're used to GoDaddy and you pay $2.9 per month for your website, then paying us $29 a month is 10 times as much. It sounds very expensive. Now, okay, you get what you pay for, you get service, you get it's fast, it's scalable, blah, blah. But it is expensive too. So okay,

Starting point is 00:08:48 fine. On the other hand, we have tens of thousands of larger customers. For them, we're the low cost alternative to what they see as website with their website development, which is things like Adobe Experience Manager or Drupal or Sitecore, these kinds of things, which are millions and millions of dollars to build a website and then millions of dollars to host it and millions of dollars every time you're going to do marketing campaigns. So for them, we are 10 times cheaper and we're the low cost alternative as opposed to the GoDaddy side of the market, the other end of the market where we're the expensive, we better be great at that price. And so it's very interesting since it's the whole internet and we're at a scale, we have

Starting point is 00:09:26 200,000 customers. So we're at a scale where we do see every kind of person. And so it's interesting, like, are we expensive? It depends who you ask and what's going on. And it's interesting that there's that complexity to it. But yeah, the scale is interesting because I think engineers who haven't done it before have this in mind. They say, look, what we do is rewrite code and we have these tools that help us automate things in particular

Starting point is 00:09:49 infrastructure. And so with CloudFormation or Ansible or, you know, Docker containers, we have all these tools to say, I want something that looks like this, or I push a button and it creates a set of services that are connected. And if I can do that once, then I can just keep pushing that button and do it 10 times or write code that pushes the connected. And if I can do that once, then I can just keep pushing that button and do it 10 times or write code that pushes the button. And now I have a thousand servers, 10,000 servers.

Starting point is 00:10:11 I have to pay more money to allocate the physical resources, but the scale takes no effort is the thought. So why is that wrong? It is wrong, but it's not obvious why it's wrong. Like it's computers. I should just keep pushing the button and it works. It becomes super obvious the second time, but the first time it completely catches people by

Starting point is 00:10:27 surprise. Yeah. But why is it, what are we missing? So here's, here's the answer. Let's say you have a laptop and let's say it's pretty high quality and pretty stable. And so it only crashes once every four years, not bad. Like it locks up and you're like, eh, you have to reboot it. What was that? Who knows? Some really odd thing, something crashed. The operating system has a bug. A cosmic ray hit it. Who knows? Something that rare.

Starting point is 00:10:51 You're not going to diagnose it. You don't even care to diagnose it because you're like, whatever. Okay, I reboot it every couple of years. Who cares? Like, this is pretty good. That would be a high quality laptop. Now, we have 17,000 servers. Okay. So let's say they're all this good that they only crash once every four years, randomly, unpredictably can't prevent it. Can't

Starting point is 00:11:13 say when, cause it's the, some weird thing. What happens when there's 17,000 of them? And by the way, our servers are doing way more stuff than your laptop. So by all rights, they should crash a lot more than that. But let suppose let's suppose right well 17 000 you know four years is what like 12 1200 days ish 1300 days 17 000 servers so you start doing the math and you're like we should have totally random unpredictable unpreventable crashes like 10 times a day oh wait Yeah. Like crap is going to be blowing up constantly. And we just said you could never predict or prevent it. Wait, what? Yeah. And so then you might say, okay, well, fine. We'll reboot them. Yeah, I know you will. And then how many customers will get mad about that? Well, yeah, but, but it's only this tiny, tiny fraction of our customers. Right. But let's

Starting point is 00:12:04 say hundreds and hundreds of customers a day have downtime from your weird, unpredictable thingy. So what do they do? They all call support and you have a thousand support tickets a day just from this one thing. Wait, what? Or how many go to Twitter and say you suck? I don't know. Every day? What?

Starting point is 00:12:24 Very few people take social media to say good things about companies, but something goes wrong, oh, it's all over the place. The best outage detector I've ever found. But we just agreed. Well, we didn't agree, but I'm pretending like you're agreeing. This is a totally normal and expected, we could be the greatest ever, and this is just going to happen. So how do you summarize that? That's the story that shows why it is in fact true. And you go, oh, okay. So I summarize that by saying rare things become common.

Starting point is 00:12:53 Rare being hard to detect, hard to prevent, hard to, and they become common automatically simply because there's a lot of them. So if you roll dice enough, then things happen, right? Kind of like million monkeys sort of thought. We also see tens of billions of web requests per day across our platform. So what kind of quality percentage would you need to not see any errors? It's like, I don't know.

Starting point is 00:13:18 That's a lot of zeros. I don't know. Yeah. It's like something impossible, clearly impossible. So impossible, that doesn't sound very nice. Now, a couple of things to take away. One is, okay, so when we talk about quality, it's just a whole other level by which I mean, orders of magnitude different, really, really different. So is that going to make us have very different development processes and procedures and what does testing mean and blah, blah, blah, blah, blah. Yes. it means those are going to have to be quite different. Not because small companies are dumb.

Starting point is 00:13:48 The small companies would actually be dumb if they implemented all that heavyweight process while they're small. That's wrong too, because that's not a problem for you right now. But if you're at scale, it is. And so the big companies that do all that stuff, that's not dumb. It's mandatory because everything's multiplied by powers of 10. And so things appear that were there. You just didn't see them often enough to do anything about it, rightly so. So yeah, your processes have to get better because you do need more, you know, percentages of quality or however you'd like to measure, you know, different ways of measuring that. But the other thing is, but it's never going to be perfect. And at sufficient scale,

Starting point is 00:14:22 stuff's going to happen. And so you also this different mindset of well given that it's going to happen for sure then what oh well then our reaction time has to be faster the reaction has to be automated remember it's like this kind of meta second layer prevent prevent prevent but knowing that prevention completely is impossible and scale means that that will be common comma oh, oh, what kinds of detect? So you start getting to these numbers like mean time to detect, mean time to recover, as opposed to how many incidents. Of course you do both, you do both,

Starting point is 00:14:52 but the number of incidents you want smaller and smaller as a percentage of everything. But smaller as a percentage of something that's growing, it's still an absolute number growing, and so you still need to know like, but do we detect and recover in like a minute or two versus an hour and it takes a human that's a big difference but it's a totally different question of detect recover automatically than preventing in the first place which of course is quote unquote better but if it's impossible for it to be good enough so in how you allocate your time or

Starting point is 00:15:21 investment you might say across these things and then we haven't even gotten to security which is a whole nother thing and often hurts things like performance and uptime, et cetera. So that's another thing that can be at odds with scale is security. So I don't mean to overcomplicate it, but it just goes to show these are not only things that you don't think about at first, you shouldn't think about it at first. It would be a waste of your time. It would be premature optimization. So you shouldn't do that at first. But on the other hand, if later you're not doing it, that's bad. And it applies at all layers of the stack too. Easy example. You said a few minutes ago that you have 17,000 servers. Okay, great. That is a significant point of scale. You can almost

Starting point is 00:15:59 certainly get some incredible discounts from Dell or HP, whoever's making servers these days. Super Micro's been on the rise for a while. But you're almost, if not entirely, based on AWS and Google Cloud, based upon what I've seen over the years of various service offerings you have. I get to sometimes pick which one of those two providers I'm hosted out of, which, cool, fantastic.

Starting point is 00:16:21 I don't have a strong preference, believe it or not, for my corporate website. Why don't you run your own servers? You certainly have enough that people would say that people can do basic arithmetic and say, okay, if a server costs this much, a calculator tells me this much, and wow, that's a lot of money on instances.

Starting point is 00:16:37 Do you just hate money? No, we love money, but Google and Amazon know that. And so they simply set their prices for us such that it would be more expensive to move, to rebuild and move and manage ongoing management. Let's not forget. It's not the price of the service. Of course, that's less.

Starting point is 00:16:53 It's not that it's managing them. And as you say, everything I just said, cross apply to the physical layer. So you have to be ready for that now, but you could outsource that. I know, but all that's expense. So when you take the total cost of all of that stuff and then you that then what google does is they know that and so they set their prices such that we go okay if we were to do that maybe we could save this much per month but we would have to do this that and do this distraction and is exactly what you just said about wordpress is how we feel about infrastructure. How exactly those SSDs get

Starting point is 00:17:25 racked and powered does not affect our customers. It needs to exist and have high uptime. Beyond that, our customers don't care how that happens. So if we could save tons of money doing it, but they just simply set the prices for us so that it's not worth it. So as we spend more and more with them, they're like, you know, then it becomes more and more economical for us to do it our own. But then they change the price so that it's not. Discounting at scale is very much a thing. I've yet to find an AWS environment

Starting point is 00:17:52 that's built out anywhere other than at a startup where the infrastructure costs more than the people working on the infrastructure. It's not the, it's hard to reliably replace SSDs at scale in a data center. It's that it's hard to be able to afford the people to be able to do that until you're at a certain inflection point. And again, you folks are terrific at running WordPress at scale.

Starting point is 00:18:15 I don't know, for example, that you would be nearly as effective at remembering to do generator maintenance on a consistent schedule and only one at a time so you aren't taking down both power rails in various ways and causing site-wide outages and i wish i could start making that one up it's just not it's just not an expertise it's not an expertise that we have and so you could choose to build that expertise or perhaps acquire something etc etc and then you but then you start asking the normal strategic questions is this good for our? Does this make us more differentiated in the market? Does this add some innovation that keeps ahead of trends or does something valuable? And the answer is no to all. The best thing it could possibly do is save

Starting point is 00:18:57 us money, which is a good reason. That is a good reason to do something. It's just the least strategic thing you can do to save money, right? Anything you can do for your customers, whether you're charging them more for it or maybe accepting that value rather than in price by things like retention or advocacy. There's many, many ways to trade value with your customer. I like to say what you should do is create more value for the customer and decide how to split it with them. It could be a higher price, you know, but like there's many ways. And anyway, just create more value. That's number one. And then split it. That's the business side. Fine. Saving money is none of that. It's good for us and customers don't care. So we should do it. It's stupid. As you say, it's stupid to burn money for nothing. But again, since the

Starting point is 00:19:37 vendors know that they just simply set the line such that that isn't that isn't a good use of our time. You know, you hear stories like, oh, with Dropbox, they did this and that with disks. Right, because at some point, at some level, at some scale, for some companies, it's a good idea. Of course, of course. No such thing as a law of physics

Starting point is 00:19:55 that's true everywhere. And there are a lot fewer companies with specific large scaled out workloads that are running into capability barriers at that scale than there are people who look at that and say, yeah, we've got several hundred of these things now. We should definitely build a data center. No, please don't. No, no. It's hyper-specialized to want to do that.

Starting point is 00:20:15 If you're Facebook, it makes sense to have data centers in Iceland for long-term storage. That does make sense. At some scale, in some situation, it makes sense. But for almost all of us, including us, and we're a hosting company, so if anyone should, it's us, right? It makes no sense. Again, because at best, it's a cost savings, and P.S., it isn't. And there's value to understanding the market you operate. A couple of years ago, I was profiled in the New York Times, which was great. But when I called into WP Engine in advance to let you folks know, the response was not, okay, good for you. You just called a gloat or what? It's like, no, no, there's none of that. They understood, oh, great. So you're going to

Starting point is 00:20:54 start potentially seeing some scale. Here's what we can do to mitigate that and make sure the site doesn't go down at the worst possible moment because they're not going to run the profile a second time. And there were processes and procedures set up. There was a migration from shared to dedicated for a four-hour span, but things still stayed up during that time. It was clearly communicated. And everything just worked. That's the sort of thing you only learn to do really well

Starting point is 00:21:17 by doing it really poorly a few times first. Yeah, we've done it so many times. You know, another thing that happens with scale is people. What does not scale is one person doing a thing. What scales is teams of people who do things and they have their policies and procedures and training and teaching each other and so on, so that the whole system is of higher quality and you have checklists. And if one person leaves the team, the team progresses

Starting point is 00:21:45 in any way. That's the kind of thing that you build a skill with humans. And so that's what you're describing too with service. So that's also true. And of course, we can do that because it's advertised over 200,000 customers and no one customer can do it because it's not advertised over 200,000 customers. It makes no sense for them to try to become an expert. It just doesn't make any sense. So this is true of many things. Like you said, like it's true for us in the cloud. Like we treat Amazon like you're saying you treat us, right? Like we all treat the next layer down on the stack as a, oh, that's not my business.

Starting point is 00:22:16 That's necessary. I need it to be high quality, but that's not my business. It's not what my customers are. That's not how I differentiate it. It's not how I'm going to win. Therefore, it's not strategic. So I need something good.

Starting point is 00:22:28 And that's it. It's like an SLO. And it's in the Google SRE. Like I need it to be, I need to hit the SLO and then stop. Don't, I don't need more. I don't need to pay more. I don't even need you to deliver more past the SLO. We're done here.

Starting point is 00:22:41 If it goes below the SLO, we have a problem. But if it goes up to the SLO or at the SLO, then you say, well, the whole point of having an SLO line is when it's above the line, we all breathe a sigh of relief that there's no current problem. Great. And we agree not to further invest because that's not giving us value. We need to invest in whatever does, which is company specific, obviously. And so that's how we think about the cloud. It's how you're thinking about us as a WordPress provider. And it's correct. That's the correct attitude towards these things. Another way to look at it is this. There are

Starting point is 00:23:15 things in the company that you want to maximize, meaning there's no such thing as good enough. Revenue is one. Profit might be one, but revenue certainly is one. Gross margin is certainly one. But there's many kinds of things, like what kind of customer value of delivering there's no such things too much one of our core value propositions is performance site performance there's no such thing as a site that's too fast jeff bezos famously talked about how there's no one will ever say the delivery was too fast i ordered and it came too quickly no like the faster the better probably you know roughly speaking so there are a few not a lot, but there's a few things in the company that you want to maximize again, because it's strategic or important in some big way like that. Good. That's where you should be investing. That's kind of what that means. Most things in the

Starting point is 00:23:57 company, even very important things are things you want to satisfy, not maximize. Once they get to some threshold, some level, some whatever, going beyond that is not that valuable. Either it's not valuable at all, or just diminishing returns or otherwise, like it's not a good use of our time or money, whether because the actual return is diminished, like a diminishing return,

Starting point is 00:24:18 or simply the business value of it is not enough. If my website load time increases by 200 milliseconds from start to finish, great. I don't make another dime in consulting revenue. I don't get one more sign up for the newsletter, none of it. It's all, at this point, it checks a box. For me, one of the big values of going to you folks is that I come from a background where I used to run these things. I do have the engineering mind where it's fun on some level. I want to set up WordPress and run it across this small cluster of things, but it adds zero value to my business and it's not what I need to focus on.

Starting point is 00:24:52 So please take it off my plate. Exactly. Fun's a whole different thing, right? Like, oh, fun, fun. You can throw away everything I just said if you want to do fun. So many of us learn this stuff on open source software in our spare time, in the evenings and weekends or when we're students. And then money is very dear and hard to come by and our time

Starting point is 00:25:10 distills down to basically free. So in time in business, that turns on its head and some people have trouble with that transition. I did when I started. No, that's absolutely right. Time is certainly the most expensive thing, there's no doubt. So it's really important that you be working on the highest priority thing. What is that? Obviously, it's going to be very dependent. But almost for sure, screwing around with attaching storage is not it. It's almost certainly not the top three, top one most important thing. And so almost everything in the business should be something you're satisfying in that model. And so that means outsourcing to something good. Like again, that threshold for what's good enough to be satisfied to something good. Like again, that threshold for

Starting point is 00:25:46 what's good enough to be satisfied can be high. You can set that up really high and say, for example, website speed does matter to me because I rank higher in SEO if my site's faster. And that does equal more dollars at the end of the day for a media company or a e-commerce company, or perhaps even for a consulting company. And there's a lot of data about e-commerce that shows that faster sites, more people check out and even, and I don't know why this is, but even have higher average transaction sizes, like put more in the cart. I don't know why, but there's a lot of data, like lots of studies. Yeah. We haven't, we saw SEO when we did the analysis improve, when we improved a website speed by optimizing some things.

Starting point is 00:26:25 And then we checked that. We got it to a point where, yep, this is awesome. There is not believed to be any discernible benefit if we, okay, if we drop this performance yet further, we're already getting A's on all the grades and the tests that spit out. It's, okay, is this where we want to really focus our time? That's right. So once you get to that, so you might say, I have a high bar for performance because I've seen what happens when it's not. And it really does help our business when the performance is high. So I have a high bar, but then saying like, I want to spend 10 times more to push it a slight bit is like, well, no, not that. Like I'm setting a bar and

Starting point is 00:26:58 maybe the bar is low for some things in the business. They're, they just need to exist, but not very good. Sometimes the bar is very high, no worries, but still it just needs to be satisfied. And then we need to move on. And that should be most things in the business because we don't have the time and number of people. None of us have the time and people to do more than that for most things in the business. The bars might be high or low, or wherever they are, but after we meet them, we need to move on to other things, especially the things where there is no limit of how good it is. And then it's OK if you pour forever and ever into that, like Amazon pours forever and ever into into delivery times or inventory and that kind of thing. Yeah. My experience with my own website is that it is far slower than I would generally find acceptable.

Starting point is 00:27:39 And the reason behind that is that whenever I'm logged into the admin portal and moving around the site, you do a whole bunch of cache busting. It is going direct. Everything is slow and latent because one of the worst problems in the world is, oh yeah, you fixed the issue, but it's cached somewhere. So it looks like it wasn't. And then you mindlessly destroy your own website, iteratively trying to improve it. Been there, got the entire wardrobe, let alone the t-shirt from those problems. So it's like, yeah, this is slow. Why aren't people the entire wardrobe, let alone the t-shirt from those problems. So it's like, yeah, this is slow. Why aren't people complaining? Wait, I'm logged in. Okay. Just to test it, log out, boom, things are loading almost before I click. And okay, good work. It's always fun when that catches you by surprise. Right, right, right, right.

Starting point is 00:28:22 One last topic I want to get into a bit. You mentioned you were running WordPress entirely on top of Kubernetes. WordPress, the last time I looked at it in any seriousness, which was about 15 years ago, it was a product of its times coming from the 90s and PHP. It is the era of servers being physical things. Virtualization was looked at very skeptically in the few places it was deployed in. It is one of the least cloudy packages I've seen in a while. It assumes you're going to have permanent named pet servers running this thing forever, trying to get it to work in a cluster

Starting point is 00:28:53 where it can sustain the outage of one of the nodes, storing assets in object stores rather than on disk, requires a whole bunch of ridiculous patching. So my question for you, given that you are the authoritative experts on running this stuff in the modern era

Starting point is 00:29:07 in a cloud at scale, how vanilla is the WordPress that you folks deploy versus how heavily have you had to either patch or completely fork the thing in order to get it to do the stuff you want? So in terms of the PHP code in WordPress, it's vanilla and there's no forking both for just you could say selfishly in managing the thing because there's always changes and there's plugins and like there's

Starting point is 00:29:31 all kinds of things that otherwise would break but also we we are also a product of the wordpress community we we have benefited so much and always have from the wordpress community and it's one of our core values actually to give back that means several to us. One of them is to give back to the WordPress community that gave us and continues to give us so much. It's part of what our DNA is. So we're not interested in forking it and I don't know, somehow, whatever. We're interested in helping the community, which means that the product it is. However, everything outside of that is super custom. And of course, that's our secret sauce. So that's what people are paying us for.

Starting point is 00:30:07 And everyone else is free to do the same, by the way. It's not like, you know, right. Oh, in my era, we had so many management scripts for WordPress that were all written in the most obfuscated Perl that you've ever seen. It was awful. Not because we tried to, because we're bad at it and we needed something in a hurry. Written in obfuscated Perl, or as we also say, Perl. Yeah, that's an unnecessary adjective.

Starting point is 00:30:27 You can remove it. The meaning is already implicit. It really is. My joke is always like, no one ever admits they know Perl because then they're going to be the one, oh, can you look at this? And the answer is no, no, no, no, no.

Starting point is 00:30:39 Perl to me kind of looks like when they picked up the modem, your mom picked up the modem and then it went, like that's what a Perl script looks like to me, you know? Yep, suddenly your terminal's sprinting out complete garbage. Yeah, it's the world's only write-only language. Yeah,

Starting point is 00:30:51 absolutely. Write once, read never. Yeah, so everything outside of that is hyper-customized, so you can do things in Kubernetes. You can mount disk that's read-write. You can recover entire things. You can make a set of containers that act like a sort of VM, but still using containers for things like each of the processes. So that's

Starting point is 00:31:13 easier to manage and test and deploy and all the normal reasons why one containerizes things. You can do that and have it move around like a little, I don't want to say cluster because of course Kubernetes cluster means something else. So this is possible. Now you can also do what you said, which is to try to make WordPress much more natively 12-factor is what I would say, right? That's also a good idea in terms of scale. But as you say, it takes a lot of effort and commitment on the part of the site owner. Because as you say, you have to write the site with that in mind, like object storage, really using cache probably,

Starting point is 00:31:49 thinking about how the database works and how much you're going to hit it, like how much you're going to abuse it if it's not going to be local, making sure, of course, that the disk is read-only and only used to deploy code and doesn't have media.

Starting point is 00:32:00 Part of the problem with the WordPress plugin ecosystem is so many of them are written in ways that are disastrous if you implement them either at any kind of scale or in anything other than exactly the scenario that the plugin author was envisioning. That's right. So like a lot of the plugins aren't available to you. So there's a lot of things you have to accept. If you accept them, then WordPress can be 12 factor and there's plenty of sites that do that. In fact, we also have a product line

Starting point is 00:32:25 called Atlas, which is headless WordPress explicitly. Like, we're running your node in Kubernetes and also running your WordPress so that the whole thing is just what you'd hope, I guess you could say. The node is running as fast as it can, things are

Starting point is 00:32:41 cached really intelligently, but it's also talking to WordPress, which is local to it. So it's very fast. So it's a very fast, very scalable thing that uses all the new things like Node.js and blah, blah, blah. All the new hotness and headless sites. So we have that too. So we have sort of the whole gamut between, like you would say, kind of the old-style monolithic WordPress thing, which is running in Kubernetes, but in a situation where you're like, it's in Kube, but it's kind of like not, right? It's like, yes, that's exactly right.

Starting point is 00:33:08 It looks like to WordPress, like it's not, but it is in Kube. And you might say like, what's the point of that? And there's actually a lot of points. And one of them is exactly what we were just talking about. Mean time to recovery. So if a Kubernetes node goes down, of course, Kubernetes will reconstitute the containers and move the traffic and blah, blah, blah. And it does it pretty damn fast, much faster than any kind of VM thing with detection, da, da, da, da, da. And much more reliably and at scale better as well, especially with things like GKS or other things where that's managed for you. So even making a, you might say, like thing on cube gains you things like some of these benefits of scale.

Starting point is 00:33:49 You also have things like there's all these advantages of containerization. You get those, there's like, there's various benefits anyway. Plus we have products though that go down the line to, okay, wait, are you willing to really make full factor like headless apps? Cause if so, we've got a product for you that takes advantage of all that so so welcome and so again we have so many customers so if it was a startup you'd say you have to focus you can't do all these things that's crazy but we have thousands of employees we have been around for 14 years and slowly we've built that from something simple and focused to okay we're going to layer on this product line but we're going to have 50 people working on it we're

Starting point is 00:34:24 going to layer on this this other kind of customer as i mentioned but we're going to layer on this product line, but we're going to have 50 people working on it. We're going to layer on this other kind of customer, as I mentioned, but we're going to have hundreds of people between sales and marketing and support to pay attention to that customer segment so that it's in addition to another customer segment, not instead of or amalgamated. So if you're doing that in the right way, then you can layer on these other things and it's okay. So now we have all that stuff and it's all right. But yeah, you're right. It's one of those things. But once again, it's one of those things where it's hard and in some ways unnatural, but if we solve it, which we do, then we have this competitive edge and we have a product that's useful. Sometimes companies tackle things that are hard and there's not really a big advantage on the other side.

Starting point is 00:35:08 It's just hard. And you go, oh, well, that sucks. That sounds like just a hard business. In this case, doing that hard thing earns us something. Oh, this is a high uptime, high speed, whatever. Like you say, WordPress. Oh, well, and then since it's hard, a lot of competitors won't be able to do it or won't be willing to do it. So it'll be somewhat differentiated, let's say.

Starting point is 00:35:32 And you're like, oh, okay, so if we do the hard thing, there's these rewards. Oh, okay, that's worth doing the hard thing in that case. You've also nailed the pricing as well in that you don't have one of those $4 a month website offerings that I've ever seen, which means that a lot of those very small dollar customers tend to need an awful lot of handholding as they're getting something up and running. And when I was at Media Temple, one of the things that we finally started doing was letting go of the bottom X percent of our customers every year just because you're spending $5 a month or $10 a month, whatever it was at the time, and you're expecting 80 hours of engineering support in a month and that the juice and the squeeze don't align. It's about meeting the right customers and solving the right problems for them. No, you're right. You're right. And it is this there is this ironic inversion where like the less they pay, the more they want.

Starting point is 00:36:20 So it's like, wait a second. Although although that there's a you there, because then if they pay a lot, they also want to literally be on the phone every week with your product managers and maybe you should. But there is this interesting, I wouldn't say it's ethical dilemma, but it's a business dilemma, but it's not obvious what to do, which is, of course, you're going to have some customers that are more profitable and some customers that are medium and some that are unprofitable like you're describing. To what extent is that just okay okay it's the cost of doing business and there's value in having a brand that just says we always help no matter what that even those people who are in that situation they're going to go and say that to others and there's this momentum and brand reviews twitter reference

Starting point is 00:37:03 ability case studies. There's all this stuff that like, if you, the more people are happy with you, you almost want to say karma works. Not in like precise mathematical way, but in just a hand wavy, it kind of does work. Okay? To what extent do you want to keep that magic?

Starting point is 00:37:21 Even though you can't measure that and you're never gonna have metrics, like I get that. But there's a truth to it. what extent is it like look they're unprofitable well some are unprofitable now if that gets too much then we got a business model problem fair enough or maybe some are so crazy on the edge like so so so crazy that no you've broken the argument now you know that just has to be against our terms of service somehow like we got to write that into our AUP or something. You just can't do that. Okay, so maybe

Starting point is 00:37:48 there's this extreme. Yeah, when you have people go, you're humane about it. You help them transition somewhere else, but you make it clear that you're not able to serve them in the way they need to be served. Even on the consulting side, we have the same policy. We found someone paying us $100 a month, and they were the number one bandwidth user. Okay, well,

Starting point is 00:38:04 you can't do that. There's a limit to where it's like you can't do that and you're in cloud too so bandwidth is not it's very dear it's very dear it was it was not good so do you want to trim the tail those cuts if you if you imagine a graph which we've made and maybe you did too at media temple of basically the customers buy their net profit of per customer and and you uh and of course there's this tail that's bad, like you say. Okay, so the absolute, absolute worst ends of the tail you trim. Fine. And trim doesn't necessarily mean you make them quit.

Starting point is 00:38:34 In the case of this customer, for example, they had been doing something really dumb with their site. We helped them fix it and they could remain a customer. But you do have to fix this thing. You can't just, you know, fix it. But maybe they have to leave. Maybe not. Maybe they are willing to pay more. Maybe they can fix it. So, okay. It's worth a conversation. It's a data point. It is not an answer in and of itself. Yeah. It's like, Hey, we can't let this continue, but like, there's lots of ways forward here. Kind of a thing, right? My, for the only ones I was glad to see leave were the ones that were just abusive to the

Starting point is 00:39:00 support reps. That was, that was awful. I made a point to at least once a month beyond the support floor, taking calls myself just because, but that's experience. What the actual, what other people are seeing and talk to customers. Imagine that, it helps with things. Yeah, abuse, you can't allow abuse. The way I look at it is if you do,

Starting point is 00:39:18 sometimes you do that in the name of we love our customers, we care about customers, customers never wrong, that kind of stuff. And while that's generally the right attitude, it's the right default attitude, let's say, like in the absence of other information, yes. But do we love our employees less? Do we respect and trust and love and care about our employees less than our customers? Well, if you allow abuse, the answer is yes, you do. And I think the right way to do business is the answer is no, they're all people. And we all we none of them

Starting point is 00:39:46 should be abused neither our customers nor employees and so if you're abusing our employees okay we'll tell you etc but if you can't stop that's not acceptable i don't care what you're paying i don't care what your profit margin is you just can't do that because it's not like we don't care about our employees you know and so i think that's just the right attitude that's just a good a good relationship these are all human beings let right attitude. That's just a good, a good relationship. These are all human beings. Let's have a, let's have a good, more or less respectful, more or less safe, more or less professional relationship. Right. I mean, that just seems like a good idea, but, but how much do you trim? So you could argue, trim it all the way up to the people, you know, blah, blah, blah. And that's not a bad idea. Like that you're, you're,

Starting point is 00:40:22 you will have a profitable company for sure. And nothing wrong with that. And I'm fine. There's really nothing wrong with it. But I do think there's a magic there that you might want to be careful before you snip it, especially because I don't know that you can measure it.

Starting point is 00:40:37 At least I don't know how. And so I think there's an art there to like what happens over there. It's not obvious, but I think it's quite interesting. Yeah customer basis is always is always incredibly important i mean especially when you're a hosting company uh you always have you have a whole category of problems that many customers many companies just don't realize exist stolen credit cards because effectively even before cryptocurrency came out great i can use this to send spam to two billion inboxes.

Starting point is 00:41:06 And I can do this to run control and control attack sites and all kinds of other nonsense. People will say all the right things. For better or worse, I have not yet found a way for my consulting projects to be turned into something actively harmful to the rest of the world.

Starting point is 00:41:21 But I'm sure someone, Enterprise, will come up with one sooner or later. It's true. That is a constant worry and a constant challenge for us. One way is bad guys taking control of other people's websites. And that could be tech or it could be social engineering, by the way. There's also, like you say, if you can run code, you can do whatever. All the things you said and more. So using a stolen credit card to get a website and then do stuff because you're uploading code and doing stuff. And so that can be arbitrary stuff,

Starting point is 00:41:49 even just as simple as bouncing through the site to something else, just to cover your tracks some more. I mean, whatever, or going to a site and injecting something in the, in the, like some JavaScript in their site. So it's not quite so obvious that they've been hacked, but it's doing whatever click fraud or whatever the heck it's doing in there. So yeah, a constant thing. So we, security has always been a critical thing that we've had to invest a lot in. And one of the reasons why people pay to be with us is that, of course, there's no such thing as quote unquote perfect security, but there is such a thing as having layer after layer, thing after thing. And, and, you know, either you do or don't do all that stuff so that you're at least not being negligent and you're doing as much as you can.

Starting point is 00:42:26 That is true. So we have everything, and we have a whole security department, and we're SOC 2 and ISO, which is not exactly security, as any security person will tell you, but it does show that we're trying to be organized and thoughtful about our processes

Starting point is 00:42:37 and access control and so on. But we do things like every year, everyone at the company has to go through a security training, including social engineering training, especially for sales and support. It's easier for that to happen. And of course, all of our code goes through all these reviews and there's automation as well as humans on that. And I mean, just everywhere you look, there's like stacks of things that have to do with security. So, yeah, security is definitely like performance.

Starting point is 00:43:03 There's no such thing as we're done. We've done all the things that needed to be secure. Hooray, we're finished. But if you're like, well, I installed a firewall, but there's, but, but I've never thought about social engineering. That's negligence. I mean, when you're first starting out, it's not, it's whatever, but like at some scale at something, or especially if you're promising that security is one of your features or benefits or whatever you'd like to call it. Okay. Well then if you're not doing social engineering training then that is negligence if you don't have keylogger detectors on the laptops that's negligence when we installed that by the way it must have been a decade ago we found like a 10 of the laptops had a keylogger on it oops like it's there it's definitely like

Starting point is 00:43:42 all this shit is happening for sure. No doubt. The only question is like, are you looking, do you know to look, are you doing something about it? So if you're doing a ton of stuff like we are, and there's still some crazy side route where this thing happened, it's like,

Starting point is 00:43:55 right. I mean, that's not good. We want to do something about it, but it's not negligence. There's a big difference. Right. You don't inadvertently take something down.

Starting point is 00:44:02 That's important. Just because someone doesn't have a full understanding of what's going on it's all this stuff is complicated at massive scale the way i look at it is if something happens i want the story of how it happened to be ludicrous it's like they did this and then that and they use this thing and we don't even know about that and and then the customer wrote their own code and that's where they got in in the first place and then it's like okay that again that doesn't mean there's nothing we want to do about it there may be things for us to do about it there may be lessons to learn no problem no worries there but we can rest easy going like well geez if that's what it takes

Starting point is 00:44:32 then we're doing our job yeah we've successfully raised the bar of required to get into something okay now it requires active unlikely misbehavior or choices made on behalf of customers. Yeah. Yeah. And, and, and like not obvious. So like, uh, there was a security bug in some library we use and it was reported a month ago and we still haven't upgraded and they got in that way.

Starting point is 00:44:56 Okay. That's, that's negligence. Why didn't we update it in a certain timeframe? Right. Another way is a zero day bug was just reported. We patched it within 12 hours, but it was already exploited in hour one. Now is it negligence? No. We're doing actually way more

Starting point is 00:45:13 than any of our customers would have ever done for themselves. Far, far more. And we prevented from nearly every single customer. And you have the telemetry and the organization around it to be able to track that down and not say, well, we have no evidence of any compromise. Yeah. Cause you don't have logs turned on. Right. Right. Right. So it's like, well, we protected, you know, 99.996% of our customers from it, but a couple of them before it's like, well, then we're really again, then, okay, that's back to it. We have a thing that we use in our values that again, I keep, I keep coming back to that because we actually have them. And you know that because you keep referring to them and using them to make decisions.

Starting point is 00:45:49 So one of them I really like is it's called do the right thing, which doesn't mean anything by itself. That's just nonsense. But what it says next is to define it is if it's right for the customer and right for the company and you're proud of your decision, then you've done the right thing. And this discussion right now with security is an interesting application of this. So when something happens, you look at it and you say, are we proud of this, or are we kind of not proud about this? Well, if it's some crazy ass whatever thing, you're like, yeah, we're fine. As opposed to, oh my God, you guys, we should have gotten that. How did we miss? Like we, that should not have happened. I'm not proud of that happened. I, so what's funny is on the one hand, being proud of something sounds so subjective.

Starting point is 00:46:39 How is that an objective measure? But what's funny is it isn't subjective at all. You know, immediately when you're proud, it's hard to codify in a handbook somewhere in a way that any, that you're going to be able to distort into something that fits in a contract, but you know, it's hard to codify in a handbook somewhere in a way that any that you're going to be able to distort into something that fits in a contract. But, you know, exactly right. If it's in a contract, never mind. Right. But like just from one person to another, are you proud of this? You know, the answer immediately.

Starting point is 00:47:01 And so it actually works really well, in fact, because something happens. Everyone's like, oh, everyone knows we're already agreeing here. It is objective. You know, not that there's no gray hairs ever, but, everyone knows we're already agreeing here. It's, it is objective, you know, not that there's no gray hairs ever, but you know, you get the idea. So I'll give you another example where this is a fun application. Another thing you get in hosting is our customers put all sorts of stuff on the web. Is it okay? Are we okay with it being on the web? How much are we looking? How much can we look? These are all kinds of things where you deploy these same kinds of responses to abuse reports, but that doesn't mean you're basically crawling everyone's website in your spare time to see what they've got.

Starting point is 00:47:31 Yeah. So there's things like, you know, Cloudflare goes through this kind of famously all the time where there's somebody and they're saying stuff and it's really offensive to some people. And some people say Cloudflare should shut them down because this is over the line. And some people say Cloudflare should not shut them down because it's the Internet and they shouldn't make those decisions. It shouldn't be up to Cloudflare. And I think both of them have a point. Hence the dilemma.

Starting point is 00:47:56 It's like they both have a point to make. Cloudflare has a way of finding themselves in the middle of that debate over and over and over again in a way that other providers never seem to. Well, so much of the Internet goes through there and people use that to keep their websites up. So I mean, we have that too. It's not quite as in the news, but we have that as well. So here's my attitude. And of course, you don't have to agree with me, but here's my attitude on that. You don't have to agree with me, but let's say as someone who's had to wrestle with such things, and we as a company have had to figure out procedurally how do we wrestle with such things and we really struggle with it is this i want to see the organization struggle i want them

Starting point is 00:48:32 to i want to see them go on the one hand this and we value that and the other hand this other thing we value that too we you know we a lot of us don't like what they say but that can't be it the right way to do it and and and uh And we do believe in free speech. We do believe in the internet, blah, blah, blah, blah. And that makes sense. And yeah, who are we to make that decision? And yet we do have to make the decision because we're here. We do have that.

Starting point is 00:48:55 And yet, yeah, should we even? And well, we have this in our terms of service and maybe we should invoke it. But you could read it this other way because, of course, there's gray areas. Some things are black and white, but some things are not none of these conversations are simple yeah and like so i want to see that i want to see them going ah but but we really value this we value that yeah i want to see the struggle and then i don't care how they just resolve it i want to see them because that means they're trying to do the right thing, whatever that means to them. And none of us will always agree on what someone

Starting point is 00:49:30 else ends up deciding in a thing like that. None of us will agree with each other a hundred percent. So that can't be the metric or that can't be the thing that decides whether they're trying to do the right thing, but they're proud of that decision. So to me, if, if, if I see our team struggles and struggles and then comes up with a answer, I'm proud of that. I're proud of that decision. So to me, if I see our team struggles and struggles and then comes up with an answer, I'm proud of that. I'm proud that we tried really hard and we came up with something. We had some rationale. Of course, not everyone will agree. I'm proud of that. I'm proud of that way of deciding. I think that has to be good enough. I mean, it has to be, of course, genuine, right? If they genuinely genuinely humans tried to figure it out and and like are we

Starting point is 00:50:06 getting a million of those a day or have we improved and improved and improved our aup such that only the hardest craziest things are still hard and everything else is is uh is known because again like if the answer is no oh well then we're being negligent about having a good policy but if the answer is yes yeah almost everything we're being negligent about having a good policy but if the answer is yes yeah almost everything is handled and this is just one of those very few things that still fell in that that that gap again i'm proud of that then i'm proud of our policy i'm proud that this is this is rare and uh then we struggle good like i mean that's at least that's my approach so there was this lady there was a late i don't remember when it was, but I remember

Starting point is 00:50:46 one of these times with Cloudflare, they put out this big letter explaining their struggle. And I remember reading the letter and just thinking, there's the struggle. That's what I, personally, I'm like, see, so I'm happy whichever way they went, because I feel like they're trying to do the right thing. They cared enough for it to bother them. Yeah. Yeah. And then they cared enough to tell everyone, thing. They cared enough for it to bother them. Yeah. Yeah. And then they cared enough to tell everyone, like, it's just, and so of course people are

Starting point is 00:51:09 like, no. And some people are like, yay, you know, whatever. What can you do? So that's, of course, that's the outcome. So I really want to thank you for taking the time to speak with me. If people want to learn more about how you view this and so many other things, where's the best place for them to find you these days? Sure. So for me personally, it's asmartbear.com, like the animal.

Starting point is 00:51:35 We both like animals, I guess. My previous company was called Smart Bear. That's why it's called that, because it's my online identity from long ago. And then of course, WP Engine is wpengine.com. And we'll of course put links to both of those things in the show notes. Thanks for having me. And I hope you see I didn't duck any of the questions. No, you did not. It's appreciated. Thanks again for agreeing to do this. I really appreciate you taking the time. It was fun. Great topics. Jason Cohen, founder of WP Engine. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review

Starting point is 00:52:09 on your podcast platform of choice, along with an angry, insulting comment that won't get published correctly because your platform of choice decided to run its own WordPress instance instead.

Screaming in the Cloud - How Scaling Turns Rare Occurrences Into Common Ones with Jason Cohen

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.