Screaming in the Cloud - Episode 34: Slack and the Safety Dance of Chaos Engineering

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This week's episode is sponsored by Datadog. Datadog is a monitoring and analytics platform that integrates with more than 250 different technologies, including AWS,

Starting point is 00:00:32 Kubernetes, Lambda, and Slack. They do it all. Visualizations, APM, and distributed tracing. Datadog unites metrics, traces, and logs all into one platform so that you and your team can get full visibility into your infrastructure and applications. With their rich dashboards, algorithmic alerts, and collaboration tools, Datadog can help your team learn to troubleshoot and optimize modern applications. If you give it a try, they'll send you a free t-shirt. I've got to say I love mine. It's comfortable and my toddler points at it and yells, Dog! Every time that I wear it. It's endearing when she does it and I've been told I need to leave their booth at reInvent when I do mine. It's comfortable, and my toddler points at it and yells, dog, every time that I wear it. It's endearing when she does it, and I've been told I need to leave their booth at reInvent when I do it. To get yours, go to screaminginthecloud.com slash datadog. That's screaminginthecloud.com slash d-a-t-a-d-o-g. Thanks to Datadog for their support of this podcast.

Starting point is 00:01:23 Hello and welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Holly Allen, Thanks to Datadog for their support of this podcast. at Slack. That's right. Okay, wonderful. First, thanks for taking the time to join me. Absolutely. And secondly, let's pretend for a second that I've been living under a rock for the last five years. What's a Slack? Slack is a collaborative chat program and it tries really hard to differentiate itself from other chat programs

Starting point is 00:02:01 by being a program where you can do most of your work. And it does that through providing a really rich platform of integrations. So for example, you can make a Jira ticket from a message or you can write Slack bots that automate sections of your workflow. And as a manager, I really appreciate that I can approve vacation time from the workday integration without leaving Slack. So it's definitely more than just chatting. It's a place where you can do all of your work. Once upon a time, I was volunteer staff for the Freenode IRC network, where we spent an

Starting point is 00:02:35 awful lot of time yelling at people on the internet. We considered it almost multiplayer notepad as far as that collaboration aspect goes. And when Slack and some of its predecessors first came out, there was a sense, to some extent, I'm not sure it's ever really left some of the arcane, angry nerd corners of the internet that some of us still spend time in, that, oh, it's just IRC. It's something that you can put IRC in the cloud and that's all it is, but now you pay someone for it. And I admit, in early days, I fell into that trap myself, where it seemed like, what's the value add on this?

Starting point is 00:03:16 And I have my answer to that, but what do you see as being the big differentiator? That's a great question. I think that because Slack is really built as a collaborative business tool, first and foremost, those integrations that I talked about really make it for me. I know that at a previous, two companies ago, I used to use HipChat. And now for two companies, I've been using Slack. And I was pretty shocked, frankly, at how much better Slack is in terms of getting your day-to-day work done on the platform. I'm not going to disagree with anything that you said,

Starting point is 00:03:48 but the added benefit from my perspective was when I was running operations teams and started to roll out something company-wide, was instead of giving non-technical users in business functions a 20-item list of what to do and midway through have to send someone over to help them with it, I could send them a single link and suddenly there they were. They were part of the conversation. They were participating. And that, for me at least, was an absolutely transformative experience. Absolutely. Running incident process or any technical process in Slack is pretty amazing. It was definitely built from the ground up for those sorts of workflows. So being a service

Starting point is 00:04:24 engineering manager, I get to see that every day. So you're in service engineering. What does that group do? Engineering is one of those areas where across the board, titles and department group names don't tend to translate from one company to another. What is service engineering at Slack? Service engineering at Slack at a different company would be called operations. We have storage, we have site reliability engineering, Service engineering at Slack at a different company would be called operations.

Starting point is 00:04:48 We have storage, we have site reliability engineering, we have observability, visibility teams, internal tools, build and release. So all of these functions are within service engineering. I'm sorry, did you say storage? As in there's a giant sand living around somewhere in San Francisco that has a giant pile of disks somewhere? Absolutely not. Every bit of our storage and compute is up in the cloud. So I mean, MySQL, Vitesse, Redis, Kafka running up in the cloud.

Starting point is 00:05:15 Wonderful. Well, we'll get there in a minute or two. But your specific area of service engineering is what exactly? I have storage, site reliability engineering, build and release, and safety engineering under my remit. Most of those make some degree of sense to me, except for safety engineering. What is that? That's something that's new to me. Safety engineering is our attempt to bring everything that's great about chaos engineering, resilience engineering, and really good incident management and postmortem process together to help make Slack engineering more resilient and reliable.

Starting point is 00:05:50 What inspired Slack to call it safety engineering rather than a lot of the, shall we say, trendy terms of art at the moment? For example, chaos engineering, regardless of what it means, is a really cool title. Safety engineer doesn't seem to have that same aggressive, move fast, break all the things type of aspect to it. Absolutely. Well, you know, Slack is supposed to be where work happens, so you can't really be breaking all the things.

Starting point is 00:06:16 The analogy that my boss, Dusty Pierce, uses is really good. It's of a race car, right? You want to be able to go as fast as possible around that racetrack safely and not crash and be able to make a pit stop incredibly fast and have it be as boring as possible. So I think we really take that to heart and say, we can still move really fast and do it in a way that we feel completely in control and that we're still providing a really great product, even as we try to move as fast as possible. So getting back to what you said about not having a giant sand living

Starting point is 00:06:50 in a data center somewhere, Slack is a heavy AWS user, or at least so the general public can probably surmise, given the fact that you've had executives at various AWS event stages, you've been mentioned in a whole bunch of slides with your logo showing up when Amazon does an event. So either you're an AWS customer or you have an incredible marketing arrangement with AWS where they get to pretend that you are. So are you all in on AWS or are you looking at multiple cloud providers these days

Starting point is 00:07:22 to deliver the service? We are primarily in AWS right now, that is true. But we're working hard on our multi-cloud strategy because Slack is one of those few companies that other software companies really, really need to work. You were mentioning earlier that working together in a technical context in Slack can be really transformative. And we certainly find that ourselves as we run our own incident process in Slack when that's possible.

Starting point is 00:07:53 So when, say, AWS is having a big problem, it's actually really important that Slack be able to work so that everyone else that's using Slack can run their incident and recovery process in Slack. And so in success, that'll be one of our measures. When a big part of AWS is down, when a big availability zone in AWS is down, Slack still works. A common thread through a number of these episodes that I've hosted has been around the idea that as a best practice in isolation, going to multiple cloud providers as a design goal is generally not a great idea. It's something that incurs an awful lot of complexity and there are generally not a terrific number of reasons for people to pursue that path. You just touched on one of the examples that I like to give, which is the idea of pager duty.

Starting point is 00:08:45 When something is responsible for waking you up because there's been an outage of a major provider, whatever it is that's waking you up definitionally has to be able to withstand the outage of that provider itself. So there is a narrative where making sure that Slack works despite any given cloud provider, whoever that provider might be, is important. I guess the question I have for you as you go down that path is, are you viewing all of what Slack does, by which I mean not just the communications portion, but also the user onboarding flows, the marketing sites, all of the various integrations? Is that something that there's perceived value to Slack in for being able to migrate those

Starting point is 00:09:24 things seamlessly between providers almost on a whim? Or are you focusing largely on individual specific workflows? That's a great question. Well, first and foremost, our status site, for example, has to be in a different availability zone, a different cloud provider is what we've chosen, a different cloud provider. Well, credit where due as well. Your status site has been unfailingly honest when there's been an issue, as opposed to the entire internet's on fire and I'm looking at a sea of green, like some providers whom I'm too polite to name. AWS. We definitely take our status site very, very seriously to the point where if a major integrator,

Starting point is 00:10:03 for example, Giphy, is down, we will actually tell people on our status site, by the way, you might notice that the slash Giphy command is not working right now. We apologize and hope that it will be up again soon. I simply can't do my work without funny videos of cats doing things in a loop. I mean, who can? But when we talk about our strategy for multi-cloud and those failure scenarios of AWS, for example, having a major outage, then what we're really talking about is being able to connect, send, and receive a message. That is bare bones, right? If URLs aren't unfurling and if Giphy's not working,

Starting point is 00:10:40 and I mean, maybe we have to add emoji reacting into this list because honestly I don't know if I can do my work without emoji reacting either. Exactly, without that you have to talk to people using actual words and possibly even complete sentences. That's something that I generally can't countenance myself. Realistically if an emoji doesn't convey the depths of what I'm trying to tell them, how important could it be?

Starting point is 00:11:02 So as a large customer of AWS, are you able to talk at all about what your customer experience with them has been? And I'm not saying that in a sense of, please, this is an open invitation to throw a cloud provider under the bus. No, we do those off the microphone. But my question about it is, what has your experience been as far as seeing new things come to light, seeing service offerings evolve,

Starting point is 00:11:25 experience about the reliability? I mean, the challenges that the rest of us have with our $12 a year AWS accounts generally tends to be mired in frustration from time to time. Is that something that goes away with scale? Does it get worse? That's a good question. We have a really close relationship with AWS. So I feel like one of the best things about being Slack scale is that we have very, very immediate access to AWS staff and our means that when we're seeing a problem, we can immediately write to them in our support channels and say, hey, we're seeing these problems in these areas, what's going on? So the big K that everyone is talking about these days is Kubernetes, the container orchestration system that people can neither spell nor properly pronounce, nor in many cases articulate a business case for, yet somehow people are barreling forward into it.

Starting point is 00:12:26 Is Slack currently working with containers? Is it something that you're considering? Is it something that you're potentially pushing off until other things get done? How does a darling of the internet age today, as Slack is, tend to think about this stuff? That's a great question. Right now we are using Terraform and Chef,

Starting point is 00:12:47 and we are actively working on the Kubernetes project to see if getting some of our production workflows into Kubernetes will be really valuable. We surmise, of course, that it will be. Otherwise, we wouldn't be spending the time. But we're running experiments to sort of de-risk this project to make sure that it is going to give us the benefit that we expect. In my work with some of my clients, a common pattern that I see that tends to emerge is there's a certain instinctive desire to keep their account

Starting point is 00:13:18 reps from AWS at arm's length. They don't tend to loop them into strategy planning sessions. They don't tend to tell them when they're starting to see an outage. And it's almost as if they're ignoring a tremendous amount of the value that a close relationship with a cloud provider can lend to the process. From what you're saying, that's not a problem at Slack. How have you been able to get away from a pattern where there's an instinctive distrust of outsiders? Well, I think that Slack is built from the ground up to be a collaborative company. And so that is definitely part of it. I think like most companies, we could do better at the looping them in at the early stages

Starting point is 00:13:57 of a design. Although we're pretty good at looping them in when we see any signs of trouble. Everything's on fire. Is it us or is it you? And the honest answer in some cases at something like Slack scale is, does it matter? Either way, it's going to be a bad day for the internet.

Starting point is 00:14:12 That's correct, yes. Either way, make it work again. So you mentioned a minute ago about when we talked about Kubernetes, the fact that you're viewing this as something of an experiment. I'd like to get a little bit into the idea of Slack's culture at this point. How do you folks view experiments in a scenario where, as you mentioned earlier, you're working with people's

Starting point is 00:14:36 production workloads, where if you start working with the ideas of chaos engineering and intentionally degrading certain aspects of the service, and this has a customer impact, how do you square that circle? Because you're not going to be able to significantly improve safety or reliability without those experiments. But doing them does seem ethically challenging when companies depend upon a service being available to do their work.

Starting point is 00:14:59 I know that when Slack goes down, I'm having a terrible day. Virtually all of my clients are having a terrible day. And Ops Twitter gets very grumpy. Although Ops Twitter also gets very friendly. The last time we had a major outage, we got about half a dozen deliveries of cookies. So that was really lovely. Oh, that's adorable. One of the chaos programs we have is very, very lightweight compared to many other more mature chaos

Starting point is 00:15:26 programs out there, but it is called Disaster Peace Theater. And what we do in Disaster Peace Theater- I'm sorry, did you say Disaster Peace Theater? Disaster Peace Theater. And what we do in Disaster Peace Theater is we create a scenario, which is a real scenario that might happen, for example, an edge pop going down, and surmise what we think will happen when that happens. Where will the traffic flow, for example? How will the reconnects happen? We will run that experiment first in a non-production environment

Starting point is 00:16:00 to make sure that we aren't about to do something really ridiculous that will ruin people's day. And then when we feel very confident in a very controlled way, we will do it in production. And I have been able to witness about a dozen of these, and we have never caused a production issue in this way. So we get around that with very, very careful planning and doing the experiments that we know will not cause production issues and yet will also teach us something. What you speak about alludes to a somewhat interesting reputation that Slack has in the larger community. And I say interesting because Slack historically hasn't told a whole lot of stories about how it works internally from an engineering context. So people guess and speculate wildly.

Starting point is 00:16:52 And there is an established reputation for Slack having a highly collaborative culture, which I suppose does make sense since you're making a tool that primarily focuses on collaboration. How much of that is accurate and how much of that is just a wonderful marketing story? And as a manager, your primary go-to tool is a horse whip. That's a great question because it was really on my mind when I joined Slack. Being in collaborative environments where we're all trying to get roughly the same thing accomplished and we're all working together towards that goal is really, really important to me personally. And so I did a lot of homework, both in the interviews, but then back channeling. And I'm happy to report

Starting point is 00:17:27 that it's totally true. The reputation is well-deserved. We hire for collaborative, empathetic people, and that makes for a pretty good set of coworkers. I will say that in the brief time I've been sitting here with you,

Starting point is 00:17:40 no one has barged into the conference room and yelled at me to get out. And people at the front desk were extremely nice as well. So fundamentally, yeah, so far the people that I know who work at Slack tend to mirror exactly what you're saying. I haven't seen anyone crying at their desk as I walk on my way over here. That might be on a different floor. So somewhat related to your culture, how does Slack view operations?

Starting point is 00:18:02 You said earlier that in another company, what service engineering does would better be described as operations. And I'm seeing, to some extent, across a large swath of the tech sector, a bit of a tension between an idea of a centralized ops model versus a model where developers are responsible for the code that they write once it enters production. Are you able to comment at all on where Slack is on that spectrum? We are in motion. I would say about a year, year and a half ago, we were firmly in that centralized operations model where there was one group of humans that got almost every page. Those pages were pretty low level and developers were never on call.

Starting point is 00:18:47 And we are in the maybe final stages of the first part of that transformation towards development teams having a bigger list of things that they're responsible for. And the thing that we're calling this here is service ownership. So putting the tools and processes in the hands of the development teams that are writing the software to support it in a more rich way through its whole lifecycle, including getting pages for it. So service engineering in that world becomes the maker of those platforms, the cloud platform you're deploying to the observability tools you're using to make your alerts and figure out how your software is doing in production,

Starting point is 00:19:28 and of course, site reliability engineers who can help you with specialized knowledge and experience and embed with your team to make your service more reliable, more performant, easier to recover in an outage, whatever the problem for your service happens to be. Do you find that that tends to lead to an interesting cultural reaction from a perspective of implementing this? Because historically, even when I was on operations teams and one day people would throw something over the wall and say, congratulations, when this thing breaks,

Starting point is 00:20:01 we're going to be waking you up in the middle of the night and expecting you to fix it. And my response was a poorly articulated version of, what? And I can't imagine that that instinctive knee-jerk reaction of no is something that's gone away since the time I was hands-on keyboard in an engineering sense. How are you finding that being adopted culturally? That's a great question. And with any change like this, you're going to have the narrative that is company-wide, and then you're still going to have pockets of individual reactions. It's happened slowly.

Starting point is 00:20:36 So last fall, engineering teams started going on-call. They were escalation on-call only. So if a human on the operations team found that that development team software was having a problem, then the operations team could page them. And so switching from being on call but not having machines page you to being on call and having machines page you is not nearly as big of a change as not being on call at all and suddenly you've got machine pages. So that's part of it.

Starting point is 00:21:06 Another part is that it has become clear at Slack that the way that software teams can ensure that their software does what they need it to do in production is really to just be with that software through its whole life cycle. And even though no one likes being woken up in the middle of the night and no one likes being told, hey, you're on call this week, the fact is that everyone here does want Slack to succeed and does want Slack to be up. And so they're all willing to do their part. One common observation that I've made historically has been that as an industry, technology is so innovative and so disruptive that we've taken a job that can be done from literally anywhere on the planet and created a land crunch in a single eight square mile block of land that's in the middle of an earthquake zone. So to that end, I'll take that a step further even. Slack is one of those tools that enables people to collaborate remotely. In fact, I have a couple of clients who are pure remote and spend a bulk of their time talking

Starting point is 00:22:10 to one another via Slack. So I guess my question is, why does Slack not encourage remote work? In other words, must be willing to relocate to one of several cities that host a Slack engineering office. When I was director of engineering at H&F, I definitely saw the power of having a remote-first workforce. Almost half of the engineers on that team were working from their home offices in northern Idaho or Wyoming or Texas, wherever it was, definitely not where there were offices. And coming to Slack and of course hearing this joke many times, I can definitely say... I take no credit for it myself. I can definitely say that supporting a remote workforce is something that the entire organization has to bake in. And AT&F was able to do it and support everybody and support itself and get good work out. And Slack, for better or worse, right now is not in a position where it's going to put in that investment. And so hiring a large number

Starting point is 00:23:20 of remote workers would not be setting Slack or those workers up for success. They would just be very isolated because most of the day-to-day work still does happen in a combination of Slack, yes, and just standing around in the hallways getting work done between desks, sitting at each other's desks in those old-fashioned co-working kind of situations. And so even, I would say, my colleagues in New York and Dublin and Melbourne have complaints about how San Francisco doesn't stay in as much touch with those offices as we should, even though we have this amazing, groundbreaking tool that is Slack. So this may sound like a bit of an obvious question to you,

Starting point is 00:24:04 but I am curious. What made you decide to want to work at Slack? Someone with your background at 18F, which is widely respected across the aisle, it's something that people generally should look into if they haven't heard of it already. I'll throw a link to it in the show notes. But going from there to Slack seems like an interesting move to make. Because when you're done doing what you did at 18F, the world is very much your oyster and you could go virtually anywhere.

Starting point is 00:24:31 What made you pick Slack out of all those other opportunities? One of the things I've really enjoyed in my technical career is taking this weird set of skills that are understanding how to make computers do things and how to convince people that know how to make computers do things do things as a manager is taking that work to different domains. So I've been in healthcare and movie making and publishing and then yes, at 18F in government and civic tech. I had never worked for an enterprise software company, which sounds weird.

Starting point is 00:25:06 Well, just as a quick question, and there is Slack, an enterprise software company at this point. I mean, my exposure to it was when I was at a tiny startup with 40 people, and it was sort of, oh, they're a small, scrappy startup, just like us. Given what I'm reading in the papers about various rounds of funding that you've raised and the discussions that we've had about some of the things you're viewing, some of the things you're considering from a cloud architecture perspective, it sounds like it may be time for me to update my mental map of Slack's sense of scale.

Starting point is 00:25:34 Yeah, I think most of us who use Slack have that giant bar of Slack teams that we are in, and maybe five or six of those might be professional, and then the rest are social or vaguely professional but mostly not where you are being paid to do something. And so it is easy to think of it still as a scrappy startup and a place for non-work collaborating. But we have some gigantic enterprise customers. Our largest customers are 250,000 users in their Slack teams.

Starting point is 00:26:09 And at that point, you are an enterprise software company, and you also have to have enterprise software reliability and stability and processes. And so Slack definitely sees itself as an enterprise software company. Okay, thank you. I appreciate that. But fundamentally, there are an awful lot of enterprise software companies you could pursue. What specifically about Slack was appealing? Well, first, I had never worked at a company that was doing cloud computing at this scale before.

Starting point is 00:26:40 And so that in itself is teaching me a lot. The second thing is that there are lots of enterprise software companies, and there are several that are working at this scale. But Slack was the one that seemed like it was the kind of company where I could come in, be the kind of worker and manager and leader that I want to be because the culture was going to support that. That was super important to me, And that has definitely been true. When you say that this is exposing you to a scale that you've never seen before, what changes when you get to that level of scale? Services and right down to just easy two nodes and the networking between them, the reliability of any one of those nodes and any availability zone and any other service in a cloud provider. So the small company is going to have some few dozen EC2 instances,

Starting point is 00:27:39 some networking between them, some storage. That all works great. When you start having about 10,000 servers like Slack does, events and disruptions in those services that are pretty infrequent start to happen a lot more often. And having your system be able to respond and react and repair itself without human intervention from that is really interesting. And also thinking about doing that while providing as much high performance through, for example, our stateful edge cash flannel and keeping things really robust is an interesting problem space that I had never been exposed to. It turns into one of those scenarios where once in a million occurrences start happening I had never been exposed to. to if you ask AWS to spin up 10 instances, in most cases the answer is sure, no problem. Ask them to spin up 10,000 instances

Starting point is 00:28:48 and suddenly you're getting a very weird phone call from people where we don't necessarily have that at the moment. It's the old story of cloud does not scale infinitely. Source tried to do it. It winds up becoming something that you have to start viewing a little bit differently and what happens if certain capacities aren't available. It was something I never understood either until I started playing around with companies that are at significant points of scale.

Starting point is 00:29:12 But I agree absolutely with what you're saying where you have to change the mindset you bring to this. Do you find that it was what you expected it to be since joining Slack? Yeah, all the teams that are writing software here at Slack really understand the scale from the ground up for their services and bake the kind of resiliency into those services that you need at each of those layers. And in a way that shows that Slack has a really great engineering culture around operating at this scale. One thing that continues to impress me about Slack is how, first, you're hiring an ever-increasing number of people that I know, which generally means either I know the right people or we both

Starting point is 00:29:55 have the exact same failure modes as far as judging people on their character. So either way, at least I'm in good company. But what also continues to encourage me is the nice things people continue to say about same way. I mean, everything is terrible, all infrastructure is on fire, and every company is crap. That's something that you can almost inherently assume on. I don't get the vibe from people at Slack that that is how they see the world. Now, yes, I'm incredibly cynical,

Starting point is 00:30:39 but even the cynical people I know who you folks have let sneak through the interview process tend to approach this from the same way. I don't want to say true believers because that comes across as being very cultish and that's not what I'm trying to get at. But every person I've spoken to here seems to not only agree with the mission that Slack is on its way towards, they understand and approve of the way that Slack is going towards that. And that's something that's special. I don't see that in too many places. I think that's true.

Starting point is 00:31:08 And I think I was a little maybe blind to that because 18F had it as well. And it is very special when you're in an environment like that. People genuinely believe we are going to succeed if we do the right things. This will work. We are building something great. we do the right things. This will work. We are building something great. Let's do it together. And it's pretty magical to be a part of and kind of table stakes for a good job, in my opinion. To that end, I'm going to go out on a limb and guess that you folks are hiring. Oh, just a few. We might have a few positions open. We're definitely hiring, always looking for really good engineering talent

Starting point is 00:31:47 and talent in a lot of other departments, although I'm most familiar with our engineering needs. Must be willing to relocate to San Francisco? No, or New York, Melbourne, Dublin, and soon Denver. Wonderful. That's places people might actually want to live at some point. What a thought. One other question. Do you have anything coming up in the near future as far as where people could wind up hearing more about the things you have to say? Should they follow you on Twitter?

Starting point is 00:32:13 Are you doing anything at a conference sense anytime soon, etc.? Yep. I'm Holly J. Allen on Twitter. And I'm giving a talk at QCon in San Francisco in November about this journey from a centralized operations model to an embedded site reliability engineering model. Perfect. Well, be sure to check that out. Holly, thank you so much for taking the time to speak with me today. Thank you. I'm Corey Quinn. This is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com

Starting point is 00:32:50 or wherever fine snark is sold.

Screaming in the Cloud - Episode 34: Slack and the Safety Dance of Chaos Engineering

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.