Screaming in the Cloud - From Aurora to PlanetScale: Intercom’s Database Evolution with Brian Scanlan
Episode Date: September 18, 2025Brian Scanlan, Senior Principal Engineer at Intercom, the company building Fin.ai, joins Corey Quinn on Screaming in the Cloud to discuss Intercom’s move from AWS Aurora to PlanetScale’s ...managed Vitess after years of scaling challenges with their Ruby on Rails monolith. He explains how 13 Aurora clusters created operational pain and why PlanetScale’s white-glove, partnership-driven model won out over Amazon’s building-block approach.The discussion also covers Intercom’s volunteer-based on-call system, their pivot to AI agents after ChatGPT’s launch, concerns about the shrinking pipeline of systems engineers, and how companies like PlanetScale and Snowflake are outpacing AWS by delivering superior user experiences.About Brian: Brian is an engineer based in Intercom’s Dublin office. He fixes problems, builds things, and grows people. Show Highlights(01:34) The Digital Clippy Rant(2:16) The Good Chatbot vs. Bad Chatbot (03:51) The AI Chatbot Revolution (04:33) Unexpected Consequences of Good Chatbots (05:42) AI Support vs. Human Support (05:59) The Alexa Problem and Feature Discoverability (19:03) Amazon's Struggles Moving Up the Stack (26:55) The Unix Networking Society Origins (34:43) The Global On-Call Challenge(42:09) LinkedIn: The World's Largest Porn Site LinksIntercom: intercom.comBrian on LinkedIn: https://www.linkedin.com/in/scanlanb/Brian on Bluesky: https://bsky.app/profile/bscanlan.bsky.socialSponsor Wiz - Listen to Crying Out Cloud: wiz.io/crying-out-cloud
Transcript
Discussion (0)
Saying this exact thing to many, many people in Amazon over the last while.
My excellent account manager has been setting me up with various leaders.
They've been asking for documents.
They've been asking for examples.
You know, they are hungry for this stuff.
So I don't doubt that there's no desire to be the leaders or to really satisfy their customers.
But, you know, it's execution we care about.
And when it comes down to it, we need excellent databases.
We need the best databases to be able to ship.
world-class product to our customers.
Welcome to Screaming in the Cloud.
I'm Corey Quinn, and I am here to correct an oversight
because I have known Brian Scanlan for many years,
and somehow he has slipped through the cracks
and not been on this show previously.
So let's go ahead and fix that.
Brian Scanlan is a senior principal engineer at Intercom,
where he has been for damn near 11 years at this point.
Brian, welcome to the show.
Thanks so much, Corey.
It's great to finally be here.
I know.
We save the best for the end.
Yeah, exactly, except we're not ending the show anytime soon,
much to various people's chagrin
because most people want me to shut the hell up a lot more than I do.
Not me.
I hope we keep going for a long time.
Crying out cloud is one of the few cloud security podcasts
that's actually fun to listen to.
Smart conversations, great guests, and zero fluff.
If you haven't heard of it, it's a cloud and AI security podcast from WIS.
Run by CloudSec Pros for CloudSec Pros.
I was actually one of the first guests on the show, and it's been amazing to watch it grow.
Make sure to check them out at whiz.io slash crying dash out-dash cloud.
So Intercom, I have complicated feelings about the company.
Originally, I hated the thing because I'm on a website trying to get something done,
and it pops up like freaking digital clippy.
Hey, do you want to talk to a person?
God, no, I'm a millennial, elder millennial, but still, I don't want to talk to people.
But that's come back around because it turns out the issue is not Intercom.
it was bad implementations thereof.
Maybe you don't need to pop up, talk to a human
on your landing page at the front of it.
But when you're dealing with a support issue
and you want to tag someone in
and suddenly you're talking to a human right there,
it's transformative.
So, yeah, it turns out that anything can be dumb
if you hold it wrong.
Yeah, I used to introduce Intercom to people to say,
yeah, we're one of those chatbots
that pop up in the right-hand bottom corner of your website,
except we're the good one.
I think there are a lot of bad implementations,
And, you know, people use it for outreach and for marketing and sales purposes, and, you know,
they want messages in your face.
But I think when it comes to basic customer support and actually putting humans and
increasingly humans in touch with AI bots that will answer your questions, I think intercoms
definitely got really good properties for in the marketplace.
It's especially strange in this era that we're in, where everyone is building AI chatbots.
And, okay, I have a whole one.
list of angry opinions on it, but where I really get annoyed is when they don't admit that
they're a chat bot up front. And it irks me because by the time I finally break down and admit
I have to talk to a human being, I have exhausted the documentation. It is not going to be
something simple of. Have you tried jiggling the handle? Instead, it's, okay, there is now a weird
corner case, which I'm very good at blindly stumbling into. I need someone to go ahead and fix
a thing on the back end or let me know that what I want is simply not possible with the
platform or admit that your documentation is rubbish. And that's something that AI historically can't
do. But forcing me to go through that filter to hit some arbitrary target of
a fewest number of customer contacts, fastest resolution possible, and never let them talk to a
human, has been maddening. Where are you right now on that whole Gen A.I. Spectrum from bot to
human. So we are very much all in on Gen. I. Gen.I. Chatbots. So much so that, you know,
in October, November, 2022, when chat GPT came out, we reorientated the entire company.
We saw that it was the future. We had actually been building ML chatbots and had them in the
marketplace for a good few years prior to the current explosion of ML chatbots. But the lifts or the
improvement that we saw with GPT 3.5 and that came out, that was a, okay, we need to change
the entire company. And we've changed the entire company and moved strongly towards what we think
is like the best AI chatbot in the market. It needs to work well with humans and there's a
transition as well. Also, there's just a lot of work that's to be done to get the quality really
great so that people don't hate using it. One of the interesting things that we've seen is
many of our customers who have good knowledge bases
and who get high resolution rates from using the chatbots
is that they'll find that customers actually just start asking more questions
because if you can get really fast answers
and you don't have to go around documentation sites
and the chatbot's actually useful,
people get addicted to this.
They're going to use it all the time.
And weirdly enough, have increased the number of conversations coming in
because people in some cases,
and we're seeing more and more of this,
where people will ask more questions because they're getting good, fast answers,
which is a little bit counterintuitive.
But I think we always need to have ways to fall back to humans.
Of course, depends on the business and the volume and what makes sense for them.
But I don't think humans are going away.
And from what we've seen in the market as well, even for places that have adopted
and are deflecting or answering large numbers of questions,
we see that they're not reducing their support team sizes at all.
They're putting their people on better questions,
higher quality work, are just deeper work with customers,
as well as, like, feeding the bot with better documentation, that kind of thing.
So we're seeing it as, like, actually a net positive into the customer experience.
But there's definitely lots of bad implementations out there as well.
It feels like AI-assisted support is, on many cases, better than pure AI support.
And this is somewhat controversial among people who want to sell bots.
But I find that chatbots are not necessarily a great interface just due to complete lack of discoverability.
It's the Alexa problem or the Siri problem or whatever robot assistant you want to do.
You ask it a question.
Today, I was getting out of the shower and I asked one of the bots or out of, I tried it with both Siri and Alexa.
Neither one could do it.
What is the partial presser of oxygen at 10,000 feet above sea level?
And they both drop the thing completely because I have weird shower thoughts.
Roll with it.
But it's the question then is, okay, it can't do it.
Of course it couldn't.
I will never ask that question again, despite the fact that maybe it does know how to do that.
it just didn't hear me properly.
Or in two weeks, it will be able to answer that question.
But when you ask a question and can't answer it,
you kind of feel dumb for having thought for a second
that it might have been able to.
It's the problem that the Alexa group has had forever,
which is that humans use something like 98%
of all the features they will ever use
on their Alexa device within 90 minutes
of setting it up the first time.
Play a song, set a timer, turn on lights,
and that's about it for most people.
It's just because it gains a bunch of these.
features. How do you tell people about that? Turns out that finishing every sentence with,
by the way, pitching something unrelated, just pisses people off. I think there will be a change
as the products get better and as knowledge bases get better, as customers, our customers and
customers of other chatbots know how to work with them better, that the expectations of us as
consumers, of users of these things will improve our change over time and just not assume that
these things are as brain dead as when we started interacting with them first time round.
So I'm curious as far as what you've been up to from a technical perspective lately.
We've known each other for many years.
You have been to my house for dinner, your friends with my brother, which I think most listeners
will be shocked to realize I have one of those.
It's true.
I do.
He lives in Belgium.
Great.
But what I found that was so interesting and got me talking to you is I was recently talking
to some of the fine folks over at Planet Scale.
and they have talked about intercom in general, and you in particular, as being very pleased with their database offering.
Now, I have talked to people in the past where I asked them questions about that.
The response has been a, wait, a company saying, what about me now?
So, right, you and I go back long enough that I can trust you not to bullshit me on these things.
So, okay, is it as good as they're telling me it is, and your response distilled down to, if not better?
You are a big champion of what planet scale is up to.
Tell me more, please, because I do a lot of things here.
Scale is generally not one of them for my own personal shit-posting projects.
Yeah, short answer is yes.
Planet Scale is great, but I'm going to give you a long answer as well.
So Intercom is a Ruby on Rails monolith, and we really, really like this setup.
We deploy our Ruby on Rails app onto EC2 computers, not stuffed away, and Docker containers,
and using incomprehensible three-letter acronyms for different parts of the setup.
And we...
Oh, you can tear down and rebuild your stack
on top of the latest trendy thing every 18 months?
Wow, almost like you're not based in Silicon Valley.
Yeah, maybe being in Dublin has its advantages.
So, yeah, we run really boring infrastructure
and we have stuck with Ruby and Rails.
And it's largely been great for us.
You know, you have to do a lot of work to scale it out
to millions of lines of code,
hundreds of developers working in it.
But at some stage, you have to connect the thing to a database.
And database scaling has been a large part
of the problems, but also the joy of scaling Intercom
in the 11 years that I've been there.
When Intercom started off, we had a very simple,
nice MySQL database.
Then, unfortunately, we hit product market fit,
having explosive growth, that was very challenging.
And even though we were based in RDS,
we were cloud first at the start from the very start,
but native RDS couldn't deal or deal easily
with what we were doing with it.
When Aurora came along, we honestly was a game changer for us.
And we jumped in very aggressively.
We got to work with the Aurora team.
We had some of the biggest tables on Aurora at the time.
They would do all sorts of custom work for us.
It was good fun.
And just the Aurora architecture itself,
the split between compute and storage
and how low latency the reed replicas were and everything.
This stuff bought us years and years and years of scalability.
Up to a certain point,
and then we started having to do things like sharding some of our data.
We had tables that were so large.
We couldn't mutate.
We couldn't add new columns or do kind of other database migrations on this data
because it was changing so fast without taking a lot of downtime.
So we had to take action.
And so we built our own kind of database charting system.
Again, built on top of Aurora, choosing technologies that we were very comfortable with.
And this bought us, again, like years of scalability.
I bought us the ability to be able to do database migrations on our tables.
And life was good for another few years.
But over time, these like different sharding patterns and way we were kind of using Aurora
meant that we had 13 clusters connected to one application.
And you get into these unfortunate situations where AWS will say, hey, we got to patch out
and you really need to apply it to all of your clusters.
Yes, your downtime will be at some point during this broad window that is inconvenient
for you and non-deterministic.
So you've got to be able to build a graceful degradation.
mode into your app from the get-go if you're using this technology, because one of the things
you lose with a managed database offering is the ability to be very granular around when and what
gets applied where.
Yeah, and even some of the upgrades that we would do, most of the time, the cluster will come
back in a minute or two, you know, not bad, but occasionally we'd have, you know, something
would get stuck inside of a queue inside of Amazon or something, and it would be 20 minutes.
and this kind of stuff starts to wear you down,
especially when you got so many clusters connected to your app.
And you are highly critical as far as customer-facing stuff.
You're on the front of everyone's website.
This is one of those areas where mistakes will show.
Absolutely.
And, you know, we have loads of customers who have up to thousands of people
whose job it is to be using intercom all day,
replying to their customers.
And it's definitely no fun.
when they have their entire teams unavailable,
not able to do their job
because we're twiddling our thumbs
waiting for an Aurora upgrade to complete.
Well, just do it outside of core business hours.
I'm in San Francisco.
Lots of people here use it.
You're in Dublin.
Lots of people there use it.
You have customers in Australia, Japan, India,
around the world.
It's core business hours somewhere
for someone at any given point of the day.
There is no, and now it is nighttime.
The servers can take a nap now.
Absolutely.
This is not the DM.
in the U.S. or the Social Security website in the United States, which still blows my mind.
It has a six-hour maintenance window every night. Like, the last person out of the office
turns off the mainframe or something. Probably some legacy batch job or whatnot, and there's
good reason for it. But it feels like that the servers keep bankers' hours. So we're aware
of Vitesse and Planet Scale. So Vitesse, like the brief introduction to Vitesse is it's a
MySQL rapper as such, or system.
that came out of YouTube about 12, 13 years ago.
It's an open source project.
It's got other large SaaS, B2P SaaS providers,
like people like Slack HubSpot.
The Slack folks have been a huge advocate of this.
It makes sense since they think it talks about this,
where effectively all of Slack is basically a giant MySQL database.
Sharded heavily, obviously, but yeah, it is every message is a line in a database.
So, yeah, having the database work and not,
take a nap for a 20-minute upgrade at random times.
It's kind of high on their list of it must do this.
Yeah, and it's, you know, it's not just downtime.
It's we need to be able to do things like have, we have to shard a lot of data.
And our customer's data is extremely shardable.
We have, we're a multi-tenant application.
We have lots and lots of our customers' data.
And they don't need to join across their, like, different workspaces or different
customers.
So we have very, very shardable data.
The other thing we do is we struggle with.
connection pool management.
We have hundreds of thousands of Ruby on Rails processes that all need to connect
to databases that can only take 16,000 connections at most.
And so we have to run a layer of proxy SQL connection proxies in between our application
and our Aurora database.
Sometimes the proxy SQL layer goes wrong.
And it's just another layer of complication that we don't want to think of.
So we were aware of a test and it became increasingly clear that planet scale
we're the way to get Vitesse.
And we have no interest in as well.
Like, we'd rather avoid running our own infrastructure
or running our own high-level services.
If we can pay somebody to run a database for us,
we will absolutely do that.
Now, I'm going to stop you there
because historically, that has been
the entire rallying cry of cloud,
where, oh, great, you don't want to run servers yourself
to a point where some people have now gone so far around the bend
that we view running servers in data centers
as being a skill set the ancients possessed,
has since been lost to modern humanity outside of three hyperscalers?
No, but those folks have been with a default go-to for a lot of things for a lot of years.
Your answer was not to go and yell at the Aurora team to make it better.
It was to look somewhere else.
You know, we did talk to the Aurora team about the problems,
and certainly Amazon are going in the right direction with the likes of Aurora Limitless,
which does have like native sharding.
It is a Postgres setup, but they're thinking about it in the right way.
And they do have things like RDS proxy, which could do some of these proxy things.
So they do have these building blocks.
And some of the problems we could solve or maybe swap out with some Amazon managed services.
But really, we were looking for something a bit bigger and better.
And where actually serving queries, actually serving customers, our customers,
is the problem of the provider that we have.
We don't just want to be getting a proxy service from one part of the company and limited insights or no ability to go in and help us out with bad queries or give us insights into what's going wrong.
You know, we really need somebody who's like a partner who can go deeper into our problems and share our problems and not just be hands off with them, which, you know, Amazon, due to scale and due to the way they treat due security and a bunch of other reasons, they don't act that way in their day-to-day operations.
You can convince them eventually kind of to get into certain things, but it's certainly they don't have one small solution that,
it fixes all the problems that we want
like connection pooling, sharding, fast failovers
and everything, they're kind of just vending a bunch
of building blocks and maybe it's just
because there are two pizza teams the whole way down.
But Planet Scale are
a good,
healthy, up and coming company
who we like the look of, we liked
the way that they were talking about
providing managed vatess in their
into companies like us.
I think what we liked about Planet Scale was like
they were clearly building for companies like us
using a technology built for exactly customers like us.
And, you know, the kind of way, like they're kind of like a one-stop shop,
white glove service, you just show up, send your queries at their database,
and they'll do the rest, as opposed to you need to assemble a variety of building blocks
and hope for the best.
The one challenge I see coming out of the planet scale folks, they have amazing talent there.
Richard Crowley, I've known for years, is phenomenal.
Sam Lambert is the CEO, and he is there.
They have a bunch of terrific folk work in there.
But I find that the way that their position are the stories they tell are aligned perfectly
for folks like you.
You are deep in the weeds.
You know this stuff cold.
You have been running hyperscale systems for many years.
Terrific.
There are a lot more people that look like me, by which I mean dumb out in the universe,
than there are people like you.
So making it a broader mass market appeal seems like it's not the story they're telling
at the moment, which is kind of a shame.
because based on the stories I've had with you and others,
and the conversations around this,
they're solving a problem that meets an awful lot of people.
It meets a lot of people's problems.
I will also say that this reinforces a belief I've had for many years,
which is as things move up the stack,
the value and the margins increase by being able to do it.
Amazon has got the low-level infrastructure stuff on lock.
No one is going to build a better VM platform than they're doing.
Their reliability is untouchable.
they have all kinds of great baseline foundational services.
But every time they try to move up the stack into applications or things a little further up the chain,
they fail miserably.
They've never yet built a good user interface on anything Amazon has ever done.
We all learned to use their website, not because it's good, because we have to.
And what we're seeing with things like planet scales are now,
the rest of the industry is starting to erode some of those things and come further down
faster than Amazon is able to go up the stack.
It's not just things like Planet Scale.
We see it with Snowflake, Databricks, a whole bunch of other folks out there that are doing these things.
People are using Confluent instead of running their own Kafka clusters or MSK.
It's those companies are eroding AWS.
They're charging more in some cases, but delivering vastly superior value.
And this tells me in the future, unless they're going to come out with something I can't foresee,
Amazon is going to become the equivalent of the layer one backbone providers.
They're going to be, like, if NTT goes down, the internet isn't working so well today,
and we're all having a bad time, but most people don't know what the hell that company is.
Everything and all the value rides on top of them.
And I think that's Amazon's future, given their course.
Yeah, we've seen, this isn't a pattern that is, we've only seen with the move from Aurora to planet scale.
We had the exact same with a move from Redshift to Snowflake.
And again, we worked closely with Amazon.
We tried to resolve a redshift stability problems with them.
They gave us a bunch of things to do.
But ultimately, when we moved to Snowflake, not only was the technology just that bit better,
they were just able to ship for us, a bit more responsive on solving for our needs,
that bit hungrier of where we didn't feel like we were just one out of a million customers for Redshift.
With Snowflake, we got stuff turned around quickly.
And the thing has been pretty awesome as well and just kind of left Amazon behind.
I think once things are business critical for us and their higher level applications,
I think we're at a point now where we'd be considering taking it off Amazon rather than trying to fix it on Amazon.
Something that could become important could be, say DDoS has became really problematic for us.
At the moment, we just use WAF.
You know, we use the Amazon stack.
It's fine.
It's not that big a deal for us.
But if we really had to nail the DDoS problem,
I'd probably go to Cloudflare
and wouldn't stick around with Amazon for too long.
Kind of on the understanding that I think Amazon probably do a reasonably good job
and will, you know, they'll take support tickets and whatnot.
If you're going to going at DDoS, you definitely need to talk to your provider.
There's no real way around that.
Yeah, and their team is excellent.
But the customer touchpoints, they're not really.
I'm sorry, but they aren't.
Yeah, and, you know, AWS support,
it can be tough to get listened to at times.
Like, I've done on-call, a lot of on-call
and opened a lot of issues with AWS support.
And even just knowing how to open a case,
it's like it's pretty difficult.
Whereas if I'm opening a support case
with some of our providers,
like OpenAI or Incident.A.O or Snowflake or Planet Scale,
very often it's little more than a message in a Slack channel.
And all of their automation kicks in.
you get rooted to the right person very, very quickly,
and they're able to tell us very quickly if it's our problem or their problem,
as opposed to an Amazon of where they're going to try and catch you out,
asking you, like, which region your problem is in,
and that can be frustrating at times.
Again, it's a problem of scale, and I kind of get it,
but the experience is way more tailored to our needs
from smaller, hungry companies in our experience.
This episode is sponsored by my own company, the Duck Bill Group,
having trouble with your AWS bill, perhaps it's time to renegotiate a contract with them.
Maybe you're just wondering how to predict what's going on in the wide world of AWS.
Well, that's where the duck bill group comes in to help.
Remember, you can't duck the duck bill bill, which I am reliably informed by my business partner,
is absolutely not our motto.
I think that you're right, and it's kind of sad.
It also, if I'm reading trends, it feels like Amazon is moving away.
on the AWS side, at least, from product-led growth
and speaking explicitly to large enterprises.
And, okay, maybe it's the right answer for them.
Lord knows, they have better strategic insight
to their customers and their needs
and their growth patterns than I do sitting in the cheap seats.
But what attracted me to it
was the fact that I could get started
with these things for pennies.
And so much of what they're coming out with these days,
like a prerequisite is enterprise support,
which starts at $180,000 a year
and ends nowhere.
It never ends.
It grows as an unbounded growth problem,
like an AWS bill itself.
And that is, that rules out a lot of things that I'd want to kick the tires on unless I start
taking hostages again.
You know, the firehose of AWS updates and launches and stuff, I think the hit rate for me
of where I see something that I'm actually going to try out or where I'm thinking, hey,
they're nailing this.
They're solving our problems.
I think that's gone down over the years.
I'm sure they're crying into their money.
I'm sure they're, they've got some pretty good business.
businesses out there. But for the kind of mid-range tech-first company, it seems like they're
not the leaders that they used to be. Because I think with the likes of Aurora and Redshift,
maybe they had like early mover advantage because they had obviously access to cloud services
before the cloud existed. They were able to build like really great cloud-specific services on top
of that. But I think they've been outpaced by hungier competitors at this point. And, you know,
it's good for us. We're able to take advantage of these. And so I'm kind of happy to do that.
But I think it's, I'm kind of quietly sad for Amazon as well. I, I am too.
Everyone seems to think I have an axe to grind against AWS, but it comes from being close to
them for so long. I don't hate the company. If I did what I did for a company, I hated,
that's a pathology and I need a diagnosis and probably a restraining order. It's, I like what they
do. I want them to be better than they are. I want their offerings to improve over time. I just,
I don't see that's the direction it's going in the way that it once was.
And it brings me no joy whatsoever to say that.
I mean, one of the good things about Amazon is that they do want to hear this stuff.
Like saying this exact thing to many, many people in Amazon over the last while,
my excellent account manager has been setting me up with various leaders.
They've been asking for documents.
They've been asking for examples.
You know, they are hungry for this stuff.
So I don't doubt that there's no desire to be the leaders.
or to really satisfy their customers.
But, you know, it's execution we care about.
And when it comes down to it, we need excellent databases.
We need the best databases to be able to ship world-class product to our customers.
And I think that that's important.
It's the need the customers have.
And if their cloud provider won't give it to them,
they will find ways to meet that need.
It's what they do.
A last topic that I want to get into.
It's been a recurring theme throughout the years on this show,
which is where does the next generation come from?
Because people like you and people like me who came up,
up being, you know, support folk in the early days, back when this was all an open field and
no one really knew how computers were supposed to work. Not that we do now, but we lie to
ourselves. We gathered experience and came from those places to where we are now. That
door has been firmly shot. That is not a path that is open, at least for me. Where do you
come from? How did you get to the place that you are now? Yeah, so the fun part about my career,
I've had just so much luck and fortune and random timing things that have worked out reasonably well.
I don't think my career has been too bad to date, but it all started in 1997 or so when I went to university.
And we had a what was then called a networking society, which was basically a bunch of students running a few Unix boxes.
And this was kind of in the pre-social media age where we didn't have WhatsApp or,
even Facebook or anything to talk to each other.
So the obvious thing that we did back then
was we had a large proportion of the people in our university
log on to a Unix shell on a bunch of servers run by students
who, and we all struggled to keep these things online.
It was pretty tough running these kind of Solaris servers.
And we had like instant messaging.
We had these rappers around Wright.
If you're old enough, you might remember,
And we had like really healthy news groups and IRC and stuff like that.
So we had this like super, super awesome community of people who were,
partially of people who are interested in doing cool stuff with tech,
learning about Unix, learning about networking.
And we had a lot of users and just like we were the largest society in campus.
We were two great parties.
And it was pretty, pretty cool.
But totally coincidentally or like true fortune.
There were also some people who've ended up being like really,
notable in the tech community since the likes of, say, John Looney, Tanya Riley, Cull McCartig,
and the list goes on of people who kind of started off their careers in technology,
just tinkering around on these Unix boxes back in college.
And my career largely up until maybe when I joined Intercom,
it was all about getting doors open by knowing people through that community,
staying in touch with them, you know, doing –
doing things, whether it was our local Linux user group or different activities like that,
but really having a good fortune from reading a bunch of early, I guess,
Unix tinkerers or sysadmins back in the day,
who all then kind of grew into working in various places in the industry.
But where I went after tinkering around Unix in college was into Solaris technical support.
Then that moved into like real sysadmin work,
then later into like running, building out nationwide broadband networks in Ireland
and connecting every school in the state and building out ISP services.
And it was like a mix of classic system in and a bunch of automation,
increasingly automated as things got better in technology.
Then for a while as well, I was in this small bookseller called Amazon prior to Intercom as well.
But I guess I had like a bit of a classic, well, what I consider to be a classic move
kind of up the stack from sysadmin help desk to writing more software, maybe a bit of management
leadership, and then ultimately into the kind of tech leadership area that I'm in at the moment.
And like, where do people come from?
I mean, like, we certainly don't have the pipeline of lots of people sitting around, like
building ISPs or building hosting providers.
I think these are all, like, really solved problems.
You don't have that kind of tinkering or just hands-on work that you need to build and use
these services. So that kind of gateway into running services, infrastructure services, networking
and all of that, isn't obviously there as much at the moment, I think. No, that's the problem.
The cloud providers abstracted so much of that away that I know a lot of folks at hyperscale
born in the cloud environments, like Intercom. I'll even ask you this. This is not necessarily,
please, stab co-workers in the back, but looking at your technical team across the board,
what is the depth of networking knowledge at Intercom?
Oh, I think I am the networking team.
And you're no slouching it.
I want to be very clear.
But I gave a keynote at Nanog about this last year, where this is a perennial problem.
I was talking to folks at AWS about this, where a lot of your customers do not have a
deep bench of networking knowledge.
And they make the very reasonable response of, well, that's not true.
We were talking to a customer this morning.
And they were as good at this as we are easily.
I'm like, great. Just out of curiosity, what sector was that customer in? Oh, they're a telco. Why?
Hmm. Wonder if that has anything to do with it. Imagine that. But these board in the cloud
startup companies don't do networking because you don't need to know networking until suddenly you very much need to know networking.
But you can go an entire career weaseling your way between the cracks without having to pick it up.
You know, some of our recent hire is we're fortunate enough to be close enough to a fairly large Amazon office.
and we've hired a good few people from Amazon support.
So maybe it's shifted up the stack.
You know, it's no longer people who are building ISPs or hosting providers.
It's people who work for larger providers in kind of entry-level tech roles or support-type roles.
So there's something of maybe looks like the old pipelines.
It's not the exact same.
Definitely different shape of people.
They tend to be actually better at coding than I was back when I was at their level.
But it does seem like with AI coming in as well,
it seems like there's going to be a good bit of change
to like where people, like what skills people use and grow
and need in their careers.
There's concerns at the moment that the use of AI and engineering
and to write code will like remove the need for junior engineers
or like just will maximize or benefit largely senior engineers
are people who can guide the agentic LLM coding tools
rather than working your way up
by working on small problems and building and shipping things.
So I think in tech, you know, there's a lot of change.
Certainly there's many entry paths like the one that I take,
which I think are gone, since some kind of replacements.
But do worry about like, especially in areas like networking
and low-level Unix and stuff
that we're not seeing
the kind of depth
or knowledge that we used to have
and I don't think I'm just being bitter
and old about that.
I think it is pretty useful stuff to know.
Oh, it is.
One thing I want to point out,
because this is a recurring theme
that I see a lot,
where you mentioned a few extraordinary names
of people who are terrific
and in the space
that have been formative influences.
Would it surprise you to know
that when I've spoken
to multiple of those people
your name comes up in the same context?
People don't realize
that we all learn from each other.
It's one of those things
Oh yeah, those people are smart. I'm just an idiot sitting here. It's a common pattern,
and I think we internalize it pretty well. But it shows. There's one other aspect I want to get
into about intercom. I was going to mention earlier, but we got this conversational path. It's one of
those interesting things about it. I wound up focusing my skill set, which is not that dissimilar
to yours, on AWS bills, because I wanted a specific expensive problem eight years ago when I was
getting started down this path. Nine years now, my God. And what?
But the reason I did it was I was down to this or I am.
Like, did I know a lot about IAM at the time?
No, but I didn't know that much about AWS bills either.
And it turns out when you focus on things, you could pick up a lot.
But the reason I went with bills is because there is never a 2 a.m. billing emergency.
I've had enough horrifying on-call experiences in my career that I am effectively done with it.
Companies across the board have on-call because they need this stuff to work in various ways.
and you don't have every team have representatives at every hour around the clock in a follow-the-sun rotation.
Intercom takes a unique approach to this, to my understanding. Tell me about it.
Yeah, so this is one of the things that I'm most proudest of at Intercom.
And to be clear, it's not all my work. And arguably, I didn't initiate it, but I was a big influence on it.
And I've certainly spent a lot of time running about it.
And more importantly, talking about it in public and taking loads of credit for it.
But we have an on-call system where we use volunteers rather than conscripts.
And this means that we put people on-call out of office hours,
not because they happen to be on a certain team or are on a rota or know something about maybe networks or systems or anything.
We ask for people to volunteer to join this rotation.
And so we generally have about six or seven people in this rotation.
And we compensate them for their time on call.
So the way we do it is it's you're on call for a week.
Not in office hours.
The teams who own the alarms that are firing will get those alarms at that time.
But outside of office hours, if you're on call in this volunteer team, you get the page for that.
But of course, you can't just say these things like let's have a volunteer based on call
and hope that it works out for the best.
We have to put in place a bunch of things, both on the technical and social.
side of things to make sure that this thing was sustainable,
that people would feel like the work was valued,
and not just because of the compensation,
but that the work was rewarding
and you might actually learn something
and maybe even enjoy doing on-call work.
So we insisted on all teams writing runbooks
for every alarm that can page somebody.
Most importantly, we treat every single page
like a heart attack, kind of using charity majors quote here.
And so this means that, say,
the next morning after a page,
goes off in the middle of the night or whatever, our teams take it seriously. In fact, they take
it more seriously than as if they had paged somebody in their own team out of bed. When you're
paging somebody you don't know or is remote from you at a bed in the middle of night because
you set up a bad alarm or because your thing fell over, you feel a lot guiltier about that.
Whenever I page anyone, I start to call it, I'm sorry to wake you, but because it's just a little
politeness and courtesy can move mountains, but please continue. Without too much.
effort, we just got excellent buy-in from the teams who own these different areas of the product
and could be building a lot and a lot of stuff can go wrong. But we were able to hold a high bar
for pages being actually something that a human needs to do and then giving that person the tools
to actually fix the problem. We have some technical reasons why this stuff is easier for us
than it is compared to other companies, such as having a large Ruby on Rails monolith as opposed
to every single team having their own bespoke tech stack.
So that stuff helps us, but it's more the culture
and how we also reward and give shoutouts to people,
you know, everybody from the CEO down at Christmas,
whenever at any kind of time,
we always make sure to not just pay the people,
the money for the time that they spend on call out of ours,
but it's recognised socially.
And also in things like promotions and things like that,
It's something that's really valued in the organization.
So we've had this in place now for seven or eight years.
It's hard to remember exactly how long.
It's been sustainable.
One of the biggest problems we've had is so many people want to join it,
so people actually like it.
And we've also built people, we've made people better operators.
We've made people actually enjoy and learn and learn more about what happens in the company.
And it's been actually a great long-term recruitment for my own kind of infrastructure.
oriented teams where people get a taste of this kind of work.
They might just be a product engineer from some random part of the business.
But then when they see this work and they see, they actually see what's going on under the hood,
they ask to join our team full time.
There's other stuff we have to do as well.
You have to have a way for the person who's on call to bring in an expert.
So we have an instant commander program as well.
And there's support there so that people don't feel like they're isolated on their own.
Out of the Pager Duty Playbook.
Yeah, and when it comes down to it as well, look, not everybody can solve every kind of problem.
And we'll just go to the bat phone.
We'll page in as many people as we need to solve a problem, which is even if you had 10 people on call, you might need to do that anyway.
And so this has been great.
I think having a single person on call for a business decision on the intercom, it can be challenging at times.
But we've never been at the point of where we've decided or been at any risk of things falling apart or having to put more.
multiple teams and lots of people on call,
keeping things down to one person on call,
ruins fewer lives.
We all get a better quality of life.
And doing this sustainably gives us something that we can really feel like
we're making a difference in our work
and that the work just isn't feeding the robots.
It's like high quality, it's good, we're learning.
And we're setting each other up for success
and not just saying, not just tolerating low quality alarms
and stuff like that.
Yeah, that's the important.
part. If it wakes you up, you're empowered to fix it or turn off the alarm or just thresholds
or something. It's the human element of it. It's the fact that this is a, you are compensated
for doing it as a volunteer thing. It's not as part of your job responsibilities. Yes, I know
you have a six-month-old who's having trouble sleeping. Get up anyway. None of that. It's a human
approach to it. And that is something this industry has lacked historically. Yeah. And I've been
spreading the good word about this, trying to influence
other places to improve things and not just accept the status quo.
The interesting part has been having conversations with different people from different
companies who are interested in doing this, but they have all sorts of other issues,
like whether it's many, many tech stack or different compliance approaches or just other
socio-technical problems that can make it difficult.
I think we're probably on easy mode in intercom.
We did design it for our culture and our technology stack.
Not everyone can do it so easily, but I would encourage.
everybody to like not accept, again, like you said, status quo around just because you're on a
certain team, you need to carry a pageer and be always on call. I think being on call a lot
does reduce the quality of your life, even if you're not being paged. And so being deliberate
about that, as well as recognizing the work, I think it's very important. And it just gives you
a great story that shows you that you actually care about the people who work for the company
as opposed to just being part of some machine that needs to satisfy the computers.
Which is important.
There's a human piece of it.
And that's the thing that gets lost.
It's not just a technical problem.
I want to thank you for taking the time to speak with me about all of this.
If people want to learn more, where's the best place for them to find you?
I'm kind of on X Twitter, but not really anymore.
I mean, I'm not, so good luck.
Some sort of on there.
I mean, you can type in Brian Scanlan and I repost work stuff, I guess.
I'm on Blue Sky, but not as much as I was on Twitter.
Again, it's like you can type my name.
Which is probably a healthy thing, but yeah, I hear you.
Yeah, I don't know.
I'm kind of sad about those things.
I mean, I'm on LinkedIn, but who uses LinkedIn?
Oh, God, I maintain that LinkedIn remains the world's largest porn site
because it's where business people go to pleasure themselves on the internet.
That is the best description I've got of it, and I have no tolerance for it.
So, I don't know.
Maybe the best place to find me will be if you set up a Unix server,
and we all just log on and use rights to talk to each other.
I figured to put up a personal website,
just have an Intercom chat with me box in the corner that pops up,
because, you know, it's not like you're doing anything else these days, right?
That works. You can find me on Intercom.com. I am the other person on the side.
There we go. There we go. I will include links to all of this in the show notes.
Brian, thank you so much for taking the time to speak with me. I appreciate it.
That's been great. We should do it again.
We should. Brian Scanlan, Senior Principal Engineer at Intercom.
I'm cloud economist Cory Quinn, and this is screaming in the cloud.
If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Whereas if you've hated this podcast, please, leave a five-star review on your podcast platform of choice,
along with an angry, insulting comment.
But that platform will not be one of Amazon's because that's way too far up the stack for them to do well.
Thank you.
Thank you.
You know,
Thank you.
