Screaming in the Cloud - From Aurora to PlanetScale: Intercom’s Database Evolution with Brian Scanlan

Starting point is 00:00:00 Saying this exact thing to many, many people in Amazon over the last while. My excellent account manager has been setting me up with various leaders. They've been asking for documents. They've been asking for examples. You know, they are hungry for this stuff. So I don't doubt that there's no desire to be the leaders or to really satisfy their customers. But, you know, it's execution we care about. And when it comes down to it, we need excellent databases.

Starting point is 00:00:27 We need the best databases to be able to ship. world-class product to our customers. Welcome to Screaming in the Cloud. I'm Corey Quinn, and I am here to correct an oversight because I have known Brian Scanlan for many years, and somehow he has slipped through the cracks and not been on this show previously. So let's go ahead and fix that.

Starting point is 00:00:50 Brian Scanlan is a senior principal engineer at Intercom, where he has been for damn near 11 years at this point. Brian, welcome to the show. Thanks so much, Corey. It's great to finally be here. I know. We save the best for the end. Yeah, exactly, except we're not ending the show anytime soon,

Starting point is 00:01:07 much to various people's chagrin because most people want me to shut the hell up a lot more than I do. Not me. I hope we keep going for a long time. Crying out cloud is one of the few cloud security podcasts that's actually fun to listen to. Smart conversations, great guests, and zero fluff. If you haven't heard of it, it's a cloud and AI security podcast from WIS.

Starting point is 00:01:27 Run by CloudSec Pros for CloudSec Pros. I was actually one of the first guests on the show, and it's been amazing to watch it grow. Make sure to check them out at whiz.io slash crying dash out-dash cloud. So Intercom, I have complicated feelings about the company. Originally, I hated the thing because I'm on a website trying to get something done, and it pops up like freaking digital clippy. Hey, do you want to talk to a person? God, no, I'm a millennial, elder millennial, but still, I don't want to talk to people.

Starting point is 00:01:56 But that's come back around because it turns out the issue is not Intercom. it was bad implementations thereof. Maybe you don't need to pop up, talk to a human on your landing page at the front of it. But when you're dealing with a support issue and you want to tag someone in and suddenly you're talking to a human right there, it's transformative.

Starting point is 00:02:12 So, yeah, it turns out that anything can be dumb if you hold it wrong. Yeah, I used to introduce Intercom to people to say, yeah, we're one of those chatbots that pop up in the right-hand bottom corner of your website, except we're the good one. I think there are a lot of bad implementations, And, you know, people use it for outreach and for marketing and sales purposes, and, you know,

Starting point is 00:02:34 they want messages in your face. But I think when it comes to basic customer support and actually putting humans and increasingly humans in touch with AI bots that will answer your questions, I think intercoms definitely got really good properties for in the marketplace. It's especially strange in this era that we're in, where everyone is building AI chatbots. And, okay, I have a whole one. list of angry opinions on it, but where I really get annoyed is when they don't admit that they're a chat bot up front. And it irks me because by the time I finally break down and admit

Starting point is 00:03:08 I have to talk to a human being, I have exhausted the documentation. It is not going to be something simple of. Have you tried jiggling the handle? Instead, it's, okay, there is now a weird corner case, which I'm very good at blindly stumbling into. I need someone to go ahead and fix a thing on the back end or let me know that what I want is simply not possible with the platform or admit that your documentation is rubbish. And that's something that AI historically can't do. But forcing me to go through that filter to hit some arbitrary target of a fewest number of customer contacts, fastest resolution possible, and never let them talk to a human, has been maddening. Where are you right now on that whole Gen A.I. Spectrum from bot to

Starting point is 00:03:50 human. So we are very much all in on Gen. I. Gen.I. Chatbots. So much so that, you know, in October, November, 2022, when chat GPT came out, we reorientated the entire company. We saw that it was the future. We had actually been building ML chatbots and had them in the marketplace for a good few years prior to the current explosion of ML chatbots. But the lifts or the improvement that we saw with GPT 3.5 and that came out, that was a, okay, we need to change the entire company. And we've changed the entire company and moved strongly towards what we think is like the best AI chatbot in the market. It needs to work well with humans and there's a transition as well. Also, there's just a lot of work that's to be done to get the quality really

Starting point is 00:04:41 great so that people don't hate using it. One of the interesting things that we've seen is many of our customers who have good knowledge bases and who get high resolution rates from using the chatbots is that they'll find that customers actually just start asking more questions because if you can get really fast answers and you don't have to go around documentation sites and the chatbot's actually useful, people get addicted to this.

Starting point is 00:05:05 They're going to use it all the time. And weirdly enough, have increased the number of conversations coming in because people in some cases, and we're seeing more and more of this, where people will ask more questions because they're getting good, fast answers, which is a little bit counterintuitive. But I think we always need to have ways to fall back to humans. Of course, depends on the business and the volume and what makes sense for them.

Starting point is 00:05:28 But I don't think humans are going away. And from what we've seen in the market as well, even for places that have adopted and are deflecting or answering large numbers of questions, we see that they're not reducing their support team sizes at all. They're putting their people on better questions, higher quality work, are just deeper work with customers, as well as, like, feeding the bot with better documentation, that kind of thing. So we're seeing it as, like, actually a net positive into the customer experience.

Starting point is 00:05:55 But there's definitely lots of bad implementations out there as well. It feels like AI-assisted support is, on many cases, better than pure AI support. And this is somewhat controversial among people who want to sell bots. But I find that chatbots are not necessarily a great interface just due to complete lack of discoverability. It's the Alexa problem or the Siri problem or whatever robot assistant you want to do. You ask it a question. Today, I was getting out of the shower and I asked one of the bots or out of, I tried it with both Siri and Alexa. Neither one could do it.

Starting point is 00:06:25 What is the partial presser of oxygen at 10,000 feet above sea level? And they both drop the thing completely because I have weird shower thoughts. Roll with it. But it's the question then is, okay, it can't do it. Of course it couldn't. I will never ask that question again, despite the fact that maybe it does know how to do that. it just didn't hear me properly. Or in two weeks, it will be able to answer that question.

Starting point is 00:06:46 But when you ask a question and can't answer it, you kind of feel dumb for having thought for a second that it might have been able to. It's the problem that the Alexa group has had forever, which is that humans use something like 98% of all the features they will ever use on their Alexa device within 90 minutes of setting it up the first time.

Starting point is 00:07:04 Play a song, set a timer, turn on lights, and that's about it for most people. It's just because it gains a bunch of these. features. How do you tell people about that? Turns out that finishing every sentence with, by the way, pitching something unrelated, just pisses people off. I think there will be a change as the products get better and as knowledge bases get better, as customers, our customers and customers of other chatbots know how to work with them better, that the expectations of us as consumers, of users of these things will improve our change over time and just not assume that

Starting point is 00:07:39 these things are as brain dead as when we started interacting with them first time round. So I'm curious as far as what you've been up to from a technical perspective lately. We've known each other for many years. You have been to my house for dinner, your friends with my brother, which I think most listeners will be shocked to realize I have one of those. It's true. I do. He lives in Belgium.

Starting point is 00:08:00 Great. But what I found that was so interesting and got me talking to you is I was recently talking to some of the fine folks over at Planet Scale. and they have talked about intercom in general, and you in particular, as being very pleased with their database offering. Now, I have talked to people in the past where I asked them questions about that. The response has been a, wait, a company saying, what about me now? So, right, you and I go back long enough that I can trust you not to bullshit me on these things. So, okay, is it as good as they're telling me it is, and your response distilled down to, if not better?

Starting point is 00:08:34 You are a big champion of what planet scale is up to. Tell me more, please, because I do a lot of things here. Scale is generally not one of them for my own personal shit-posting projects. Yeah, short answer is yes. Planet Scale is great, but I'm going to give you a long answer as well. So Intercom is a Ruby on Rails monolith, and we really, really like this setup. We deploy our Ruby on Rails app onto EC2 computers, not stuffed away, and Docker containers, and using incomprehensible three-letter acronyms for different parts of the setup.

Starting point is 00:09:07 And we... Oh, you can tear down and rebuild your stack on top of the latest trendy thing every 18 months? Wow, almost like you're not based in Silicon Valley. Yeah, maybe being in Dublin has its advantages. So, yeah, we run really boring infrastructure and we have stuck with Ruby and Rails. And it's largely been great for us.

Starting point is 00:09:25 You know, you have to do a lot of work to scale it out to millions of lines of code, hundreds of developers working in it. But at some stage, you have to connect the thing to a database. And database scaling has been a large part of the problems, but also the joy of scaling Intercom in the 11 years that I've been there. When Intercom started off, we had a very simple,

Starting point is 00:09:44 nice MySQL database. Then, unfortunately, we hit product market fit, having explosive growth, that was very challenging. And even though we were based in RDS, we were cloud first at the start from the very start, but native RDS couldn't deal or deal easily with what we were doing with it. When Aurora came along, we honestly was a game changer for us.

Starting point is 00:10:06 And we jumped in very aggressively. We got to work with the Aurora team. We had some of the biggest tables on Aurora at the time. They would do all sorts of custom work for us. It was good fun. And just the Aurora architecture itself, the split between compute and storage and how low latency the reed replicas were and everything.

Starting point is 00:10:26 This stuff bought us years and years and years of scalability. Up to a certain point, and then we started having to do things like sharding some of our data. We had tables that were so large. We couldn't mutate. We couldn't add new columns or do kind of other database migrations on this data because it was changing so fast without taking a lot of downtime. So we had to take action.

Starting point is 00:10:47 And so we built our own kind of database charting system. Again, built on top of Aurora, choosing technologies that we were very comfortable with. And this bought us, again, like years of scalability. I bought us the ability to be able to do database migrations on our tables. And life was good for another few years. But over time, these like different sharding patterns and way we were kind of using Aurora meant that we had 13 clusters connected to one application. And you get into these unfortunate situations where AWS will say, hey, we got to patch out

Starting point is 00:11:21 and you really need to apply it to all of your clusters. Yes, your downtime will be at some point during this broad window that is inconvenient for you and non-deterministic. So you've got to be able to build a graceful degradation. mode into your app from the get-go if you're using this technology, because one of the things you lose with a managed database offering is the ability to be very granular around when and what gets applied where. Yeah, and even some of the upgrades that we would do, most of the time, the cluster will come

Starting point is 00:11:52 back in a minute or two, you know, not bad, but occasionally we'd have, you know, something would get stuck inside of a queue inside of Amazon or something, and it would be 20 minutes. and this kind of stuff starts to wear you down, especially when you got so many clusters connected to your app. And you are highly critical as far as customer-facing stuff. You're on the front of everyone's website. This is one of those areas where mistakes will show. Absolutely.

Starting point is 00:12:18 And, you know, we have loads of customers who have up to thousands of people whose job it is to be using intercom all day, replying to their customers. And it's definitely no fun. when they have their entire teams unavailable, not able to do their job because we're twiddling our thumbs waiting for an Aurora upgrade to complete.

Starting point is 00:12:42 Well, just do it outside of core business hours. I'm in San Francisco. Lots of people here use it. You're in Dublin. Lots of people there use it. You have customers in Australia, Japan, India, around the world. It's core business hours somewhere

Starting point is 00:12:53 for someone at any given point of the day. There is no, and now it is nighttime. The servers can take a nap now. Absolutely. This is not the DM. in the U.S. or the Social Security website in the United States, which still blows my mind. It has a six-hour maintenance window every night. Like, the last person out of the office turns off the mainframe or something. Probably some legacy batch job or whatnot, and there's

Starting point is 00:13:15 good reason for it. But it feels like that the servers keep bankers' hours. So we're aware of Vitesse and Planet Scale. So Vitesse, like the brief introduction to Vitesse is it's a MySQL rapper as such, or system. that came out of YouTube about 12, 13 years ago. It's an open source project. It's got other large SaaS, B2P SaaS providers, like people like Slack HubSpot. The Slack folks have been a huge advocate of this.

Starting point is 00:13:46 It makes sense since they think it talks about this, where effectively all of Slack is basically a giant MySQL database. Sharded heavily, obviously, but yeah, it is every message is a line in a database. So, yeah, having the database work and not, take a nap for a 20-minute upgrade at random times. It's kind of high on their list of it must do this. Yeah, and it's, you know, it's not just downtime. It's we need to be able to do things like have, we have to shard a lot of data.

Starting point is 00:14:16 And our customer's data is extremely shardable. We have, we're a multi-tenant application. We have lots and lots of our customers' data. And they don't need to join across their, like, different workspaces or different customers. So we have very, very shardable data. The other thing we do is we struggle with. connection pool management.

Starting point is 00:14:35 We have hundreds of thousands of Ruby on Rails processes that all need to connect to databases that can only take 16,000 connections at most. And so we have to run a layer of proxy SQL connection proxies in between our application and our Aurora database. Sometimes the proxy SQL layer goes wrong. And it's just another layer of complication that we don't want to think of. So we were aware of a test and it became increasingly clear that planet scale we're the way to get Vitesse.

Starting point is 00:15:04 And we have no interest in as well. Like, we'd rather avoid running our own infrastructure or running our own high-level services. If we can pay somebody to run a database for us, we will absolutely do that. Now, I'm going to stop you there because historically, that has been the entire rallying cry of cloud,

Starting point is 00:15:21 where, oh, great, you don't want to run servers yourself to a point where some people have now gone so far around the bend that we view running servers in data centers as being a skill set the ancients possessed, has since been lost to modern humanity outside of three hyperscalers? No, but those folks have been with a default go-to for a lot of things for a lot of years. Your answer was not to go and yell at the Aurora team to make it better. It was to look somewhere else.

Starting point is 00:15:50 You know, we did talk to the Aurora team about the problems, and certainly Amazon are going in the right direction with the likes of Aurora Limitless, which does have like native sharding. It is a Postgres setup, but they're thinking about it in the right way. And they do have things like RDS proxy, which could do some of these proxy things. So they do have these building blocks. And some of the problems we could solve or maybe swap out with some Amazon managed services. But really, we were looking for something a bit bigger and better.

Starting point is 00:16:22 And where actually serving queries, actually serving customers, our customers, is the problem of the provider that we have. We don't just want to be getting a proxy service from one part of the company and limited insights or no ability to go in and help us out with bad queries or give us insights into what's going wrong. You know, we really need somebody who's like a partner who can go deeper into our problems and share our problems and not just be hands off with them, which, you know, Amazon, due to scale and due to the way they treat due security and a bunch of other reasons, they don't act that way in their day-to-day operations. You can convince them eventually kind of to get into certain things, but it's certainly they don't have one small solution that, it fixes all the problems that we want like connection pooling, sharding, fast failovers and everything, they're kind of just vending a bunch

Starting point is 00:17:18 of building blocks and maybe it's just because there are two pizza teams the whole way down. But Planet Scale are a good, healthy, up and coming company who we like the look of, we liked the way that they were talking about providing managed vatess in their

Starting point is 00:17:33 into companies like us. I think what we liked about Planet Scale was like they were clearly building for companies like us using a technology built for exactly customers like us. And, you know, the kind of way, like they're kind of like a one-stop shop, white glove service, you just show up, send your queries at their database, and they'll do the rest, as opposed to you need to assemble a variety of building blocks and hope for the best.

Starting point is 00:18:01 The one challenge I see coming out of the planet scale folks, they have amazing talent there. Richard Crowley, I've known for years, is phenomenal. Sam Lambert is the CEO, and he is there. They have a bunch of terrific folk work in there. But I find that the way that their position are the stories they tell are aligned perfectly for folks like you. You are deep in the weeds. You know this stuff cold.

Starting point is 00:18:21 You have been running hyperscale systems for many years. Terrific. There are a lot more people that look like me, by which I mean dumb out in the universe, than there are people like you. So making it a broader mass market appeal seems like it's not the story they're telling at the moment, which is kind of a shame. because based on the stories I've had with you and others, and the conversations around this,

Starting point is 00:18:43 they're solving a problem that meets an awful lot of people. It meets a lot of people's problems. I will also say that this reinforces a belief I've had for many years, which is as things move up the stack, the value and the margins increase by being able to do it. Amazon has got the low-level infrastructure stuff on lock. No one is going to build a better VM platform than they're doing. Their reliability is untouchable.

Starting point is 00:19:08 they have all kinds of great baseline foundational services. But every time they try to move up the stack into applications or things a little further up the chain, they fail miserably. They've never yet built a good user interface on anything Amazon has ever done. We all learned to use their website, not because it's good, because we have to. And what we're seeing with things like planet scales are now, the rest of the industry is starting to erode some of those things and come further down faster than Amazon is able to go up the stack.

Starting point is 00:19:37 It's not just things like Planet Scale. We see it with Snowflake, Databricks, a whole bunch of other folks out there that are doing these things. People are using Confluent instead of running their own Kafka clusters or MSK. It's those companies are eroding AWS. They're charging more in some cases, but delivering vastly superior value. And this tells me in the future, unless they're going to come out with something I can't foresee, Amazon is going to become the equivalent of the layer one backbone providers. They're going to be, like, if NTT goes down, the internet isn't working so well today,

Starting point is 00:20:15 and we're all having a bad time, but most people don't know what the hell that company is. Everything and all the value rides on top of them. And I think that's Amazon's future, given their course. Yeah, we've seen, this isn't a pattern that is, we've only seen with the move from Aurora to planet scale. We had the exact same with a move from Redshift to Snowflake. And again, we worked closely with Amazon. We tried to resolve a redshift stability problems with them. They gave us a bunch of things to do.

Starting point is 00:20:41 But ultimately, when we moved to Snowflake, not only was the technology just that bit better, they were just able to ship for us, a bit more responsive on solving for our needs, that bit hungrier of where we didn't feel like we were just one out of a million customers for Redshift. With Snowflake, we got stuff turned around quickly. And the thing has been pretty awesome as well and just kind of left Amazon behind. I think once things are business critical for us and their higher level applications, I think we're at a point now where we'd be considering taking it off Amazon rather than trying to fix it on Amazon. Something that could become important could be, say DDoS has became really problematic for us.

Starting point is 00:21:29 At the moment, we just use WAF. You know, we use the Amazon stack. It's fine. It's not that big a deal for us. But if we really had to nail the DDoS problem, I'd probably go to Cloudflare and wouldn't stick around with Amazon for too long. Kind of on the understanding that I think Amazon probably do a reasonably good job

Starting point is 00:21:47 and will, you know, they'll take support tickets and whatnot. If you're going to going at DDoS, you definitely need to talk to your provider. There's no real way around that. Yeah, and their team is excellent. But the customer touchpoints, they're not really. I'm sorry, but they aren't. Yeah, and, you know, AWS support, it can be tough to get listened to at times.

Starting point is 00:22:07 Like, I've done on-call, a lot of on-call and opened a lot of issues with AWS support. And even just knowing how to open a case, it's like it's pretty difficult. Whereas if I'm opening a support case with some of our providers, like OpenAI or Incident.A.O or Snowflake or Planet Scale, very often it's little more than a message in a Slack channel.

Starting point is 00:22:28 And all of their automation kicks in. you get rooted to the right person very, very quickly, and they're able to tell us very quickly if it's our problem or their problem, as opposed to an Amazon of where they're going to try and catch you out, asking you, like, which region your problem is in, and that can be frustrating at times. Again, it's a problem of scale, and I kind of get it, but the experience is way more tailored to our needs

Starting point is 00:22:49 from smaller, hungry companies in our experience. This episode is sponsored by my own company, the Duck Bill Group, having trouble with your AWS bill, perhaps it's time to renegotiate a contract with them. Maybe you're just wondering how to predict what's going on in the wide world of AWS. Well, that's where the duck bill group comes in to help. Remember, you can't duck the duck bill bill, which I am reliably informed by my business partner, is absolutely not our motto. I think that you're right, and it's kind of sad.

Starting point is 00:23:22 It also, if I'm reading trends, it feels like Amazon is moving away. on the AWS side, at least, from product-led growth and speaking explicitly to large enterprises. And, okay, maybe it's the right answer for them. Lord knows, they have better strategic insight to their customers and their needs and their growth patterns than I do sitting in the cheap seats. But what attracted me to it

Starting point is 00:23:42 was the fact that I could get started with these things for pennies. And so much of what they're coming out with these days, like a prerequisite is enterprise support, which starts at $180,000 a year and ends nowhere. It never ends. It grows as an unbounded growth problem,

Starting point is 00:23:54 like an AWS bill itself. And that is, that rules out a lot of things that I'd want to kick the tires on unless I start taking hostages again. You know, the firehose of AWS updates and launches and stuff, I think the hit rate for me of where I see something that I'm actually going to try out or where I'm thinking, hey, they're nailing this. They're solving our problems. I think that's gone down over the years.

Starting point is 00:24:21 I'm sure they're crying into their money. I'm sure they're, they've got some pretty good business. businesses out there. But for the kind of mid-range tech-first company, it seems like they're not the leaders that they used to be. Because I think with the likes of Aurora and Redshift, maybe they had like early mover advantage because they had obviously access to cloud services before the cloud existed. They were able to build like really great cloud-specific services on top of that. But I think they've been outpaced by hungier competitors at this point. And, you know, it's good for us. We're able to take advantage of these. And so I'm kind of happy to do that.

Starting point is 00:24:58 But I think it's, I'm kind of quietly sad for Amazon as well. I, I am too. Everyone seems to think I have an axe to grind against AWS, but it comes from being close to them for so long. I don't hate the company. If I did what I did for a company, I hated, that's a pathology and I need a diagnosis and probably a restraining order. It's, I like what they do. I want them to be better than they are. I want their offerings to improve over time. I just, I don't see that's the direction it's going in the way that it once was. And it brings me no joy whatsoever to say that. I mean, one of the good things about Amazon is that they do want to hear this stuff.

Starting point is 00:25:33 Like saying this exact thing to many, many people in Amazon over the last while, my excellent account manager has been setting me up with various leaders. They've been asking for documents. They've been asking for examples. You know, they are hungry for this stuff. So I don't doubt that there's no desire to be the leaders. or to really satisfy their customers. But, you know, it's execution we care about.

Starting point is 00:25:57 And when it comes down to it, we need excellent databases. We need the best databases to be able to ship world-class product to our customers. And I think that that's important. It's the need the customers have. And if their cloud provider won't give it to them, they will find ways to meet that need. It's what they do. A last topic that I want to get into.

Starting point is 00:26:15 It's been a recurring theme throughout the years on this show, which is where does the next generation come from? Because people like you and people like me who came up, up being, you know, support folk in the early days, back when this was all an open field and no one really knew how computers were supposed to work. Not that we do now, but we lie to ourselves. We gathered experience and came from those places to where we are now. That door has been firmly shot. That is not a path that is open, at least for me. Where do you come from? How did you get to the place that you are now? Yeah, so the fun part about my career,

Starting point is 00:26:48 I've had just so much luck and fortune and random timing things that have worked out reasonably well. I don't think my career has been too bad to date, but it all started in 1997 or so when I went to university. And we had a what was then called a networking society, which was basically a bunch of students running a few Unix boxes. And this was kind of in the pre-social media age where we didn't have WhatsApp or, even Facebook or anything to talk to each other. So the obvious thing that we did back then was we had a large proportion of the people in our university log on to a Unix shell on a bunch of servers run by students

Starting point is 00:27:33 who, and we all struggled to keep these things online. It was pretty tough running these kind of Solaris servers. And we had like instant messaging. We had these rappers around Wright. If you're old enough, you might remember, And we had like really healthy news groups and IRC and stuff like that. So we had this like super, super awesome community of people who were, partially of people who are interested in doing cool stuff with tech,

Starting point is 00:27:59 learning about Unix, learning about networking. And we had a lot of users and just like we were the largest society in campus. We were two great parties. And it was pretty, pretty cool. But totally coincidentally or like true fortune. There were also some people who've ended up being like really, notable in the tech community since the likes of, say, John Looney, Tanya Riley, Cull McCartig, and the list goes on of people who kind of started off their careers in technology,

Starting point is 00:28:31 just tinkering around on these Unix boxes back in college. And my career largely up until maybe when I joined Intercom, it was all about getting doors open by knowing people through that community, staying in touch with them, you know, doing – doing things, whether it was our local Linux user group or different activities like that, but really having a good fortune from reading a bunch of early, I guess, Unix tinkerers or sysadmins back in the day, who all then kind of grew into working in various places in the industry.

Starting point is 00:29:07 But where I went after tinkering around Unix in college was into Solaris technical support. Then that moved into like real sysadmin work, then later into like running, building out nationwide broadband networks in Ireland and connecting every school in the state and building out ISP services. And it was like a mix of classic system in and a bunch of automation, increasingly automated as things got better in technology. Then for a while as well, I was in this small bookseller called Amazon prior to Intercom as well. But I guess I had like a bit of a classic, well, what I consider to be a classic move

Starting point is 00:29:47 kind of up the stack from sysadmin help desk to writing more software, maybe a bit of management leadership, and then ultimately into the kind of tech leadership area that I'm in at the moment. And like, where do people come from? I mean, like, we certainly don't have the pipeline of lots of people sitting around, like building ISPs or building hosting providers. I think these are all, like, really solved problems. You don't have that kind of tinkering or just hands-on work that you need to build and use these services. So that kind of gateway into running services, infrastructure services, networking

Starting point is 00:30:23 and all of that, isn't obviously there as much at the moment, I think. No, that's the problem. The cloud providers abstracted so much of that away that I know a lot of folks at hyperscale born in the cloud environments, like Intercom. I'll even ask you this. This is not necessarily, please, stab co-workers in the back, but looking at your technical team across the board, what is the depth of networking knowledge at Intercom? Oh, I think I am the networking team. And you're no slouching it. I want to be very clear.

Starting point is 00:30:55 But I gave a keynote at Nanog about this last year, where this is a perennial problem. I was talking to folks at AWS about this, where a lot of your customers do not have a deep bench of networking knowledge. And they make the very reasonable response of, well, that's not true. We were talking to a customer this morning. And they were as good at this as we are easily. I'm like, great. Just out of curiosity, what sector was that customer in? Oh, they're a telco. Why? Hmm. Wonder if that has anything to do with it. Imagine that. But these board in the cloud

Starting point is 00:31:25 startup companies don't do networking because you don't need to know networking until suddenly you very much need to know networking. But you can go an entire career weaseling your way between the cracks without having to pick it up. You know, some of our recent hire is we're fortunate enough to be close enough to a fairly large Amazon office. and we've hired a good few people from Amazon support. So maybe it's shifted up the stack. You know, it's no longer people who are building ISPs or hosting providers. It's people who work for larger providers in kind of entry-level tech roles or support-type roles. So there's something of maybe looks like the old pipelines.

Starting point is 00:32:00 It's not the exact same. Definitely different shape of people. They tend to be actually better at coding than I was back when I was at their level. But it does seem like with AI coming in as well, it seems like there's going to be a good bit of change to like where people, like what skills people use and grow and need in their careers. There's concerns at the moment that the use of AI and engineering

Starting point is 00:32:30 and to write code will like remove the need for junior engineers or like just will maximize or benefit largely senior engineers are people who can guide the agentic LLM coding tools rather than working your way up by working on small problems and building and shipping things. So I think in tech, you know, there's a lot of change. Certainly there's many entry paths like the one that I take, which I think are gone, since some kind of replacements.

Starting point is 00:33:04 But do worry about like, especially in areas like networking and low-level Unix and stuff that we're not seeing the kind of depth or knowledge that we used to have and I don't think I'm just being bitter and old about that. I think it is pretty useful stuff to know.

Starting point is 00:33:17 Oh, it is. One thing I want to point out, because this is a recurring theme that I see a lot, where you mentioned a few extraordinary names of people who are terrific and in the space that have been formative influences.

Starting point is 00:33:26 Would it surprise you to know that when I've spoken to multiple of those people your name comes up in the same context? People don't realize that we all learn from each other. It's one of those things Oh yeah, those people are smart. I'm just an idiot sitting here. It's a common pattern,

Starting point is 00:33:40 and I think we internalize it pretty well. But it shows. There's one other aspect I want to get into about intercom. I was going to mention earlier, but we got this conversational path. It's one of those interesting things about it. I wound up focusing my skill set, which is not that dissimilar to yours, on AWS bills, because I wanted a specific expensive problem eight years ago when I was getting started down this path. Nine years now, my God. And what? But the reason I did it was I was down to this or I am. Like, did I know a lot about IAM at the time? No, but I didn't know that much about AWS bills either.

Starting point is 00:34:13 And it turns out when you focus on things, you could pick up a lot. But the reason I went with bills is because there is never a 2 a.m. billing emergency. I've had enough horrifying on-call experiences in my career that I am effectively done with it. Companies across the board have on-call because they need this stuff to work in various ways. and you don't have every team have representatives at every hour around the clock in a follow-the-sun rotation. Intercom takes a unique approach to this, to my understanding. Tell me about it. Yeah, so this is one of the things that I'm most proudest of at Intercom. And to be clear, it's not all my work. And arguably, I didn't initiate it, but I was a big influence on it.

Starting point is 00:34:55 And I've certainly spent a lot of time running about it. And more importantly, talking about it in public and taking loads of credit for it. But we have an on-call system where we use volunteers rather than conscripts. And this means that we put people on-call out of office hours, not because they happen to be on a certain team or are on a rota or know something about maybe networks or systems or anything. We ask for people to volunteer to join this rotation. And so we generally have about six or seven people in this rotation. And we compensate them for their time on call.

Starting point is 00:35:34 So the way we do it is it's you're on call for a week. Not in office hours. The teams who own the alarms that are firing will get those alarms at that time. But outside of office hours, if you're on call in this volunteer team, you get the page for that. But of course, you can't just say these things like let's have a volunteer based on call and hope that it works out for the best. We have to put in place a bunch of things, both on the technical and social. side of things to make sure that this thing was sustainable,

Starting point is 00:36:04 that people would feel like the work was valued, and not just because of the compensation, but that the work was rewarding and you might actually learn something and maybe even enjoy doing on-call work. So we insisted on all teams writing runbooks for every alarm that can page somebody. Most importantly, we treat every single page

Starting point is 00:36:22 like a heart attack, kind of using charity majors quote here. And so this means that, say, the next morning after a page, goes off in the middle of the night or whatever, our teams take it seriously. In fact, they take it more seriously than as if they had paged somebody in their own team out of bed. When you're paging somebody you don't know or is remote from you at a bed in the middle of night because you set up a bad alarm or because your thing fell over, you feel a lot guiltier about that. Whenever I page anyone, I start to call it, I'm sorry to wake you, but because it's just a little

Starting point is 00:36:55 politeness and courtesy can move mountains, but please continue. Without too much. effort, we just got excellent buy-in from the teams who own these different areas of the product and could be building a lot and a lot of stuff can go wrong. But we were able to hold a high bar for pages being actually something that a human needs to do and then giving that person the tools to actually fix the problem. We have some technical reasons why this stuff is easier for us than it is compared to other companies, such as having a large Ruby on Rails monolith as opposed to every single team having their own bespoke tech stack. So that stuff helps us, but it's more the culture

Starting point is 00:37:33 and how we also reward and give shoutouts to people, you know, everybody from the CEO down at Christmas, whenever at any kind of time, we always make sure to not just pay the people, the money for the time that they spend on call out of ours, but it's recognised socially. And also in things like promotions and things like that, It's something that's really valued in the organization.

Starting point is 00:38:00 So we've had this in place now for seven or eight years. It's hard to remember exactly how long. It's been sustainable. One of the biggest problems we've had is so many people want to join it, so people actually like it. And we've also built people, we've made people better operators. We've made people actually enjoy and learn and learn more about what happens in the company. And it's been actually a great long-term recruitment for my own kind of infrastructure.

Starting point is 00:38:26 oriented teams where people get a taste of this kind of work. They might just be a product engineer from some random part of the business. But then when they see this work and they see, they actually see what's going on under the hood, they ask to join our team full time. There's other stuff we have to do as well. You have to have a way for the person who's on call to bring in an expert. So we have an instant commander program as well. And there's support there so that people don't feel like they're isolated on their own.

Starting point is 00:38:55 Out of the Pager Duty Playbook. Yeah, and when it comes down to it as well, look, not everybody can solve every kind of problem. And we'll just go to the bat phone. We'll page in as many people as we need to solve a problem, which is even if you had 10 people on call, you might need to do that anyway. And so this has been great. I think having a single person on call for a business decision on the intercom, it can be challenging at times. But we've never been at the point of where we've decided or been at any risk of things falling apart or having to put more. multiple teams and lots of people on call,

Starting point is 00:39:28 keeping things down to one person on call, ruins fewer lives. We all get a better quality of life. And doing this sustainably gives us something that we can really feel like we're making a difference in our work and that the work just isn't feeding the robots. It's like high quality, it's good, we're learning. And we're setting each other up for success

Starting point is 00:39:49 and not just saying, not just tolerating low quality alarms and stuff like that. Yeah, that's the important. part. If it wakes you up, you're empowered to fix it or turn off the alarm or just thresholds or something. It's the human element of it. It's the fact that this is a, you are compensated for doing it as a volunteer thing. It's not as part of your job responsibilities. Yes, I know you have a six-month-old who's having trouble sleeping. Get up anyway. None of that. It's a human approach to it. And that is something this industry has lacked historically. Yeah. And I've been

Starting point is 00:40:21 spreading the good word about this, trying to influence other places to improve things and not just accept the status quo. The interesting part has been having conversations with different people from different companies who are interested in doing this, but they have all sorts of other issues, like whether it's many, many tech stack or different compliance approaches or just other socio-technical problems that can make it difficult. I think we're probably on easy mode in intercom. We did design it for our culture and our technology stack.

Starting point is 00:40:53 Not everyone can do it so easily, but I would encourage. everybody to like not accept, again, like you said, status quo around just because you're on a certain team, you need to carry a pageer and be always on call. I think being on call a lot does reduce the quality of your life, even if you're not being paged. And so being deliberate about that, as well as recognizing the work, I think it's very important. And it just gives you a great story that shows you that you actually care about the people who work for the company as opposed to just being part of some machine that needs to satisfy the computers. Which is important.

Starting point is 00:41:27 There's a human piece of it. And that's the thing that gets lost. It's not just a technical problem. I want to thank you for taking the time to speak with me about all of this. If people want to learn more, where's the best place for them to find you? I'm kind of on X Twitter, but not really anymore. I mean, I'm not, so good luck. Some sort of on there.

Starting point is 00:41:51 I mean, you can type in Brian Scanlan and I repost work stuff, I guess. I'm on Blue Sky, but not as much as I was on Twitter. Again, it's like you can type my name. Which is probably a healthy thing, but yeah, I hear you. Yeah, I don't know. I'm kind of sad about those things. I mean, I'm on LinkedIn, but who uses LinkedIn? Oh, God, I maintain that LinkedIn remains the world's largest porn site

Starting point is 00:42:13 because it's where business people go to pleasure themselves on the internet. That is the best description I've got of it, and I have no tolerance for it. So, I don't know. Maybe the best place to find me will be if you set up a Unix server, and we all just log on and use rights to talk to each other. I figured to put up a personal website, just have an Intercom chat with me box in the corner that pops up, because, you know, it's not like you're doing anything else these days, right?

Starting point is 00:42:36 That works. You can find me on Intercom.com. I am the other person on the side. There we go. There we go. I will include links to all of this in the show notes. Brian, thank you so much for taking the time to speak with me. I appreciate it. That's been great. We should do it again. We should. Brian Scanlan, Senior Principal Engineer at Intercom. I'm cloud economist Cory Quinn, and this is screaming in the cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please, leave a five-star review on your podcast platform of choice,

Starting point is 00:43:06 along with an angry, insulting comment. But that platform will not be one of Amazon's because that's way too far up the stack for them to do well. Thank you. Thank you. You know, Thank you.

Screaming in the Cloud - From Aurora to PlanetScale: Intercom’s Database Evolution with Brian Scanlan

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.