The Changelog: Software Development, Open Source - Scaling all the things at Slack (Interview)

Starting point is 00:00:00 Bandwidth for Changelog is provided by Fastly. Learn more at Fastly.com. We move fast and fix things here at Changelog because of Rollbar. Check them out at Rollbar.com and we're hosted on Linode servers. Head to Linode.com slash Changelog. This episode is brought to you by Airbrake. Airbrake is full stack, real-time error monitoring, get real-time error alerts, plus all the info you need to fix any error fast. And in this segment, I'm talking to Joe Godfrey, CEO of Airbrake, about why getting to the root cause of errors

Starting point is 00:00:32 is so important. Look, Adam, to me, root cause is everything. All software has bugs. We all know that. And when you find a bug or when you can't find a bug, the amount of time that typically gets spent trying to chase around and figure out how to reproduce the problem

Starting point is 00:00:46 and what's the cause of the problem, even like what part of the code kicked it off or what sort of actions drive it. I mean, that's hours and hours of time wasted spent chasing your tail instead of actually fixing the problem, improving the customer experience and getting back to building more features,

Starting point is 00:01:01 which is really what your company is all about. So to me, being able to really understand like what is the root cause of this problem is the key factor to being able to solve that problem and get back to doing what's most important, which is building new features and improving your product. And quite frankly, fixing the customer experience that's broken as long as that bug is out there. All right. Check out Airbrake at airbrake.io slash changelog. Our list is controlled Airbrake for free for 30 days. Plus, you get 50% off your first three months. Try it free today.

Starting point is 00:01:29 Once again, airbrake.io slash changelog. You're listening to the ChangeLog, a podcast featuring the hackers, leaders, and innovators of open source. I'm Adam Stachowiak, editor-in-chief of ChangeLog, a podcast featuring the hackers, leaders, and innovators of open source. I'm Adam Stachowiak, Editor-in-Chief of ChangeLog. On today's show, Jared and I are talking to Julia Grace about scaling all the things at Slack. Julia is currently the Senior Director of Infrastructure Engineering at Slack and has been there since 2015. So she's seen Slack during its hyper growth. We talked about Slack's growth and skill challenges, scaling engineering teams, the responsibilities and challenges of being a manager, communicating up and communicating down, quality of service and reliability, and what it takes to build high-performing leadership teams.

Starting point is 00:02:18 So we're here on the leadership front lines. I love that line in your summary for your session at Velocity, Julia. Thank you. We're here on the leadership front lines. I love that line in your summary for your session at Velocity, Julia. Thank you. We're here on the leadership front lines. And I think, you know, one thing we talk about a lot in engineering software development is scaling, right?

Starting point is 00:02:34 But you often think about scaling software, not so much scaling teams. And, you know, from my point of view, and I'm sure Jer will agree, is like Slack has been in this constant scale motion. Like you've never been able to just kind of like chill out in some sort of infrastructure setup. Like you've always been scaling. And so this conversation is essentially about scaling all the things. What do you think? You know, so when I interviewed at Slack two and

Starting point is 00:03:01 a half years ago, in a few months, it'll be three years. I was told by Cal Henderson, our CTO, he said, you know, we don't have anyone for you to manage, but you're going to hire some people and we'll figure it out. And by the time, the time that had elapsed between when I interviewed and when I started, I think we had hired five more people. And so my first day I was managing a team of about seven engineers, I think at that time, um, seven incredible front end engineers, because we needed someone to manage front end engineers. I, I would never characterize myself as a, a, um, front end engineer. I'm much more of a backend engineer, which means I have huge respect for folks who do front-end work. So I come on board.

Starting point is 00:03:49 I'm really forced into a fascinating situation where I was not the subject matter expert. As I had said earlier, I'm definitely a back-end person. And so I really had to focus on becoming a great manager because when I would, for example, look at pull requests, I didn't know if some of the code that was being written, you know, if those were the right architectural decisions. So I really had to defer to the team and I really had to get really good at asking a lot of questions. I actually did read a lot of code. I learned a lot about JavaScript in that initial period.

Starting point is 00:04:35 And then if you fast forward from the seven front-end engineers two and a half years ago, I now run an organization that didn't even exist 18 months ago, infrastructure, that has 75 people. And so, you know, that's 10x growth in two and a half years. And over the time, every six months, my job would totally change from managing front end engineers to managing both front end and back end, to managing a junior manager, to leading the infrastructure organization, which was small at the time, which then grew from 12 to 50 to now 75. And so it's, I look back two and a half years ago and I, I, I barely recognize the job I used to have because we've grown so much. So we have a very small team here,

Starting point is 00:05:35 Julia, and I've always been on small teams. I've never even been on a team of 70, let alone managed one. So I hear that number and I am just immediately overwhelmed. I start sweating. You know, my hands are getting a little sweaty. Just thinking about the responsibility. And when you first started, I mean, two and a half years ago, three years ago, wasn't very long and you had a team of seven. Do you sleep well at night?

Starting point is 00:05:59 Do you feel like you have the weight of the world on your shoulders? That's just a lot of people too. Yeah, I agree with that. Just a lot of people. So I have learned a lot over that time. I feel as a leader that the only so that I can sleep well at night. The organization is divided in different ways, and I have incredible leaders that, again, I have the privilege of working with who lead some of those sub-organizations. And I really, you know, part of growing fast is learning how to delegate and give things away really rapidly. So there were so many things I worked on in the early days. An example would be hiring processes and how we hire, especially at the time, front-end engineers because that was my jam.

Starting point is 00:07:11 And now I've given that away to somebody who did additional iterations, made it even better. And then they grew and scaled in their role, and they gave it away to someone else. And I say that in a, in a, in a wonderful way where we're always iterating and, and growing and changing and evolving on everything that we do. And I've had to, the really big challenge that I found is you have to learn how to do that for yourself. Because growing, if you want to, in a hyper growth company, you have to grow and evolve with the company. And that is one of the hardest things and having the mindset of like, every day I go to work, and I do things I don't know how to do that I've never done before. And then I get reasonably good at them. And then I hand them off to someone else,

Starting point is 00:08:03 who they've never done it before, but they can pick up where I left off and make it even better. And so there's a lot of, so the hiring process is one example of that sort of thing. Also, how do we communicate as a 75 person organization? How do we propagate knowledge? How do we propagate decisions? And that means both top down, like from myself and my boss, but also bottoms up from the engineering front lines, the critical decisions that we're making, propagating that up to myself and then my boss. time, you really have to accept that. Maybe what you would learn in the book, Who Moved My Cheese, the idea of change, right? Like you cannot be a fearful person of change because it's inevitable. I think it's probably the case in most lives anyways, but even more so in a hyper growth company where you've got to accept change. And if you're the kind of person that can't deal with change that rapidly, maybe it's not for you. Absolutely. I mean, I do think that just like every company isn't for everyone, you know,

Starting point is 00:09:10 there's folks who are attracted to that high velocity of change. And then there are also individuals for very legitimate and understandable reasons where it may not be the right environment. One of the things I always say with my team, especially given we just moved to a new office building, I was running around trying to find all the conference rooms and the various floors, the thing we say now is the only thing constant at Slack is change. And so, but that doesn't necessarily mean volatility. It doesn't mean things are hectic and scary. Instead, it means that we're always trying to learn and iterate and grow and learn how to do things better. And I, myself as a leader, am always trying to figure out what are the things that I'm dropping on the floor? What are the areas where I need to improve?

Starting point is 00:10:08 And a big part of being able to do that is creating a safe and inclusive culture so that people can provide you with feedback. Because the only way that you'll be able to learn and grow really, really rapidly is with really, really excellent high bandwidth feedback from the people below you, your peers in my case, and from upper management. I think that's so true, that point about the only constant is change at Slack as a software company. And I think it can be applied to anybody

Starting point is 00:10:36 who's writing software or running businesses on software. That's the only thing that we know is gonna happen is that things are going to change and that we don't know as much right now as we're going to know later. Right. And so we build our systems and we design things in order that they can change. Right. Malleable as opposed to rigid. And we do as little as we can now because we're going to be smarter later and we can make wiser decisions later. So you're in a senior role now, right? I am. I mean, it kind of,

Starting point is 00:11:06 it depends on how you define senior. Do you have your new title? I do have it in my title. Thus, it must be. Thus, it must be true. The point I'm getting at is like, so if you're a senior now, you were, were you always senior? Is this new for you? And maybe share some points along your path of like, you know, scaling you from someone who wasn't senior, the things you've learned and the things you've had to endure to get to a senior role and some of the responsibilities you hold day to day. Absolutely. So I have definitely not always been in a senior role. let me tell you it has been a long and fantastic journey to get here that was never always up and to the right.

Starting point is 00:11:54 My career has taken many different twists and turns and I've tried out product management and I tried founding a company and so I've done all kinds of things and learned so many things along the way. So at Slack, it's funny, when I joined, I was a senior engineering manager. So maybe it comes full circle. And then I transitioned into an engineering director when I started running infrastructure, and I'm now a senior director.

Starting point is 00:12:23 So I got that senior back. And in the beginning, when I started running infrastructure, and I'm now a senior director. So I got that senior back. And, you know, in the beginning, when I was managing that team of seven front-end engineers, I was, and again, not hands-on from a, I was writing code, because as we've talked about, you know, I wasn't the right person to be making the technical decisions. Although, you know, I can understand the technology quite well and quite quickly. But I knew what the team was working on. I knew the challenges. I knew with a very high degree what was coming next for them.

Starting point is 00:13:01 Engineering was much smaller then at Slack. It was less than 100 people. So the group that I was in had about 25 engineers. So I knew what our larger plan was for all those 25 engineers. I would often sit in meetings that we're talking about. And again, as a manager, I do attend a lot of meetings. The goal of often attending a lot of those meetings is to gather information and to also see when people are blocked and how I can help them and how I can also help transmit information throughout the organization. So I would sit in meetings that would be talking about things at the feature level. And as I transitioned to lead infrastructure, one of the things that happened was this was a brand new engineering organization.

Starting point is 00:13:53 So when engineering teams get big enough, you have to subdivide them in some sort of logical way. But always knowing that org structures and how you divide, that that's a very hard problem. And so we had had a logical division there of how we would divide it. And now I was running this new organization. And so I had this really exciting, unique opportunity to figure out, well, what is the mission and what is the vision and what is the strategy for infrastructure. So instead of thinking necessarily about the feature level that I had before and the vision and the larger plan being set by the senior directors and VPs in that previous engineering organization, I was thus in those shoes. So I had to figure out what are the current challenges with our infrastructure? How are we scaling right now?

Starting point is 00:14:45 What's breaking? And how are we going to scale through the next huge jump and growth in our user base? What are the things that are important for us to work on, but not urgent? What are the fires that are burning? So I really had to deeply understand from this infrastructure perspective what was going on. And I had to create a compelling vision that resonated not only with the engineers, but with the senior executives, the CTO, the VP of engineering, even the CEO, Stuart Butterfield. I presented this vision to him as well. So it moved from feature level, again, to all of infrastructure. And now as a senior director, as my boss likes to tell me, my boss is Michael Lopp, who many of you may know him on the internet as RANS.

Starting point is 00:15:38 My role is not only to stay involved in infrastructure. I mean, I'm so, I love this team. I feel like it is such an incredible, incredible organization. But to think about all of engineering and the company as well. So engineering is now, when I joined two and a half years ago, was around 100 people. And now I think we're at around 350 people. So thinking at that larger scale, thinking about how we make decisions that impact across all of the organizations and impact other places of the company. So it's all about leveling up the scale at which you're thinking about. And when you do that, you can then have even greater impact. But the challenge, one of the hardest challenges with that is that now you need to influence. And again, you know, I deeply think one of the most profound lessons that I learned in my career

Starting point is 00:16:41 was when I became a product manager and I had to learn how to influence people, the people being the engineers, when I was not their manager and I did not have explicit authority to tell them what to do. And so the higher up that you go in management, your job is all about influence. Ultimately, the engineers in my organization and other organizations, they decide what code they're going to write that day and what code they're not going to write that day. They make all of the decisions. Now, I try to influence those decisions by giving them additional context, by giving them background, by talking about why what they're working on is so important. But at the end of the day, like they decide their destiny,

Starting point is 00:17:26 and I am there to help support and guide them. And the higher you go up in an organization, you have to be able to influence even more people in the organization. And that's incredibly, incredibly difficult to do. That last part about the dream part is, is something resonates with me. I've played the role of product manager for a bit. And that's such a truth where to be able to influence somebody, you have to share with them a dream to strive for. And when you don't have that explicit control over their day-to-day code they can write or even manage them to guide them that way, every step of the way, you have to be able to kind of cast some sort of vision or dream for them to follow because

Starting point is 00:18:03 otherwise they're just going to do what they have to do to ship code and keep it simple absolutely you have to inspire and compel folks to be aligned with where you think the organization should go with the exciting challenges you you need to be able to craft a message that really resonates with the team. This episode is brought to you by DigitalOcean. DigitalOcean is a cloud computing platform built with simplicity at the forefront. So managing infrastructure is easy. Whether you're a business running one single virtual machine or 10,000, DigitalOcean gets out of your way so teams can build, deploy, and scale cloud apps faster and more efficiently. Join the ranks of Docker, GitLab, Slack, HashiCorp.

Starting point is 00:19:05 We work fastly and more. Enjoy simple, predictable pricing. Sign up, deploy your app in seconds. Head to do.co slash changelog and our listeners get a free $100 credit to spend in your first 60 days. Try it free. Once again, head to do.co.changelog.

Starting point is 00:19:48 So, Julia, as I listen to you talk and I'm trying to think about, you know, trying to have takeaways of like what makes a great leader. And I'm always thinking in the context of a developer and like what makes a great developer slash leader or what turns a great developer into a great leader. And I'm thinking about your points about a communication. I think that's an obvious one and definitely the most paramount thing. If I think about the best developers I've met, we've interviewed a lot of them on the show, what makes them stand out? A, their ability to communicate, absolutely. Communicate their thoughts, right? So let's set that one aside and say, aside from a communication,

Starting point is 00:20:20 the next thing that I think of with great developers is their ability to kind of inhabit an entire system or to keep like the more of a system you can keep in your head the whole thing right holistically the better you are as a developer i believe and i've found and so what you were talking about was really um even transitioning from developer to leader or holding both roles is the ability to speak about that system at different levels, right? To communicate about it, you know, to speak up to people either above you in the organization or to speak down to people below you or in your employee. And that to me sounds like you have to be able to inhabit the system, maybe even at different levels, like conceptually in big picture, small picture.

Starting point is 00:21:12 A, is that the case? And then, okay, so yes. And then B, is that just a, is that a learned skill? Is that just a natural thing? Like, how do you get to be able to do that? I love it. Great, great question. So I am very, very much a systems thinker. You know, before I, the ship sailed on my coding career, you know, three years ago when I stopped writing code day to day, I am always thinking about systems and how systems interact with other systems. Now, the way that I employ that systems thinking now is I'm thinking about systems of people and how other people communicate with other people. So just like different systems have API contracts and they have different protocols with which they talk to each other. Humans are the same way. They have different preferences for how people talk with them.

Starting point is 00:22:14 The vocabulary that they use, that maps really nicely to the different protocols. And so, and I absolutely feel like these are learned skills. In the, oh my goodness, without, you know, revealing my age, in the many, many years, like over 15 years that I've been a post-graduate, I did my graduate work in computer science, and that was a long time ago, and in the time that I've been a professional software developer and now a manager I have learned these skills and so I think I also really deeply believe that anyone can learn anything and that if given the right environment and the right teachers that people can absolutely rise to the occasion and we can talk a little bit more about that, but I think this goes to the,

Starting point is 00:23:08 you had started the question around how do I employ those skills now? So I'm always, as I think about the systems of people and I think about the different relationships, I'm also thinking about how can I communicate in a way to those different audiences? And the analogy in software is then like, what language do I need to speak to this system? Or what vocabulary do I need to use to this human? And if I speak too fast, you know, I might need to get rate limited. So there's so many

Starting point is 00:23:46 analogies of how humans interact with some of the systems. And I don't mean to say that in a robotic way, but we all have preferences. And if you can understand someone's preferences, and that's, I think, a really important part of leadership is building that rapport and that connection with people so you can understand their preferences. Because if I start sending requests to a system in the wrong language or malformed requests, you know, they're going to be thrown away or I'm going to get error codes. And so in this world, I need to deeply understand the people just like I need to deeply understand the systems. And so that goes back to what makes, I think, not only

Starting point is 00:24:26 great developers, but also great leaders. And I really, I think it's important to note that the skills that make really senior developers great are often the same skills that make really senior managers great. You can be a leader in an organization if you're a manager, or if you're not a manager, if you're not a manager, if you're an individual contributor. Leadership comes from everywhere. But really great leaders, whether it be managers or individual contributors, they're really fantastic, as you highlighted, communicators. They know the protocols, the vocabularies, the error codes, the exit statuses. They know all these things. But they then can level up the people around them.

Starting point is 00:25:05 They can grow them by being able to teach them new things. So they can help the, let's say, next generation, the more junior developers, the more junior managers, the more senior folks teach them about the systems, the mental models that they've developed about how the humans interact or how the systems interact. And as those junior folks learn and grow and they tackle problems, they then can refine and grow their own mental model about how these systems and humans interact with one another. It's interesting. I mean, I agree with you specifically on the overlap in skills between a great developer and a great leader or perhaps manager if that's the same person. We do have this, you know, we do have this idea of the Peter Principle, you know, in management

Starting point is 00:25:51 where, you know, people tend to get, and I'll just summarize it, not the exact principle, but the way I think of it is people often get promoted to their level of incompetency, right? Like, they're very good at this thing. And so therefore a promotion comes and moves them into a role that they're not good at. And that's unfortunate at the time because they were really good at the other thing, but they need this new position in order to achieve, you know, a salary raise or something like that. So this happens a lot with developers. Like you're, you're a great developer and all of a sudden now you're a manager of developers. And that, even though there's overlap in those skills that doesn't mean you immediately recognize or can apply arguably software systems are easier to understand than people are right like we're way more

Starting point is 00:26:35 complicated in many ways people are non-deterministic yeah so do you have tips and tricks or like thoughts on you know people developers who find themselves in the position of manager or leader? All of a sudden, they feel like they don't have the chops to thrive in that position? Well, I would say I have so many different thoughts. I think, first, I do view management as a different job. Not a better job, not a worse job, but a different job. And so the challenge is, as a developer, do you want to change your job? And if the answer is no, you love your job, you love what you do, and you want to continue growing, then hopefully you can,

Starting point is 00:27:27 whether that be through promotion or through projects. Assume a role where you're able to teach a larger number of people, lead them through like a tech lead position, something along those lines, where like I think it's very, very important that companies have a track where very senior technologists do not have to become managers. Because, again, I do very deeply feel like it is a different job. Sometimes when I talk about what I do every day, which if you want to know the one-sentence version, it is I read and write English documents.

Starting point is 00:28:03 That is what I do all day. You know, a lot of developers would say, well, I don't want to do that. I want to continue to program. And, you know, I do read code occasionally, but those times are fewer and farther between. And so if you want to enter a world where you're spending more time in Google Docs than in Emacs or VI, then potentially management is for you. But in order to make that transition, it's so important that your organization can support you with trainings, with mentorship, with people who can give you feedback so you can learn and grow because you're doing a totally new and different job. At Slack, I've managed many people who tried out management, managing a small team, very senior technologists, technologists who had been programming professionally for 15, 20 years. They wanted to try management. And I do think that it

Starting point is 00:29:07 is important if people do want to try it and their intentions are pure, meaning their intention is not because they want more power, but they want to really foster and grow and help others. We tried it out. And some of those individuals have been exceptional managers. And some of those folks have realized that this is not their calling. And huge hats off to them, because it's so hard, it can be so hard once you embark down a different path to realize that it's not for you. And so some of those individuals have transitioned back to individual contributor roles. And I really want to highlight, it's not a demotion, it's a transition back to a different job, to a job where they're fantastic.

Starting point is 00:29:54 And so, in large part, some of the folks on my management team used to be very senior engineers and some of them transitioned to IC roles, and others have been managers for a decade or more. So I think it's important at a company to also be able to provide the no penalty, so to speak, for transitioning back to the role if you realize that it's not for you. Just as a point of clarification, when you say IC roles, what are you referring to? Oh, I'm sorry. Individual contributor roles. So roles where you're not managing people. So it's very important to emphasize what you said there when you go back to an IC role, that this isn't like a demotion or this isn't a step backwards in your career

Starting point is 00:30:40 because management is a different job, but not necessarily a higher calling, so to speak. Is that what you're trying to say? Yes. It is not a better job. It is not a more powerful job. It is not a more prestigious job. It is only a different job.

Starting point is 00:31:00 Yes. I'm sitting here thinking about this and I'm like, I'm that kind of person where I naturally, by my talents, just gravitate towards manager type roles because I can be in an IC role, an individual contributor, and I will just naturally want to lead. It's just something that comes out. It's not something I'm like, I've got it. It's just my DNA. And if you put me in an individual contributor role, I'll probably depressed or aloof or not that invested. But you give me an opportunity to influence and change and determine vision and where we're going to go. That's where I thrive. And I imagine there's a lot of developers out there who are like that as well. So how do you keep being a programmer but then also leverage those skills too?

Starting point is 00:31:53 I completely understand. To give you a quick story, I have a daughter and she's three and we go to music class every Saturday. And there's – at times my daughter is very aware of who's following the rules and who's not. And the music teacher said, and so I was very self-conscious about this because she likes to correct other people and she likes to like, you know, stand up there and be in charge. I say that in a positive way. And the teacher said, there's always a supervisor in every class. And so, and I think good for you to acknowledge that this is where you can be most successful and where you're deriving the most value for yourself and for others. It can be often difficult to know, especially I would say early in my career, I didn't know

Starting point is 00:32:41 what were the environments and the situations and the ways where I really was able to thrive. So good for you for knowing that. I think, so this goes back to the notion of even though management and individual contributor, like development, programmer roles, even though those are different jobs, the higher up that you go in both of them, the skills to some degree converge, where instead of you building the features, you're leading, communicating, growing, teaching others. And so instead of me communicating vision and strategy through written English words and presentations and PowerPoint decks, some of the very, very senior principal engineers and architects in my organization, they don't manage people. They do write a lot of documents, English documents, about how we're going to build a super complex, difficult feature.

Starting point is 00:33:44 But then they'll also lead discussions around different approaches. So one of the things that we do here at Slack is we have what's called the software design workshop. And any engineer in the organization, junior or senior, can write up a technical document about how they're going to approach a feature. And then they bring it to the workshop and people opt in, you know, if it's a topic they're interested in, because again, our engineering organization is quite large, but people can come and then we have a discussion, a spirited and I think fantastic discussion about how should we build this? Are there any interesting edge cases? And those discussions, I think it's actually, this is really important.

Starting point is 00:34:23 Those discussions are led by other engineers. They're not led by managers. Like I don't actually know that much anymore about what the edge cases are. But let me tell you, some of the very senior engineers in our organization, you bet they do because they've been around for a while and they've probably seen the different patterns and they've seen systems fail and what, and they have really great models for our systems mental models and so they then help facilitate the discussion which leads to the why communication is so important facilitate discussion ask questions but ensure that the this that the um the engineers presenting who might not who might be senior or might be junior that they have a safe space to present their ideas and walk away feeling

Starting point is 00:35:06 like they've learned something. So I think there's still such a huge need for very senior developers and programmers in organizations. Because let's face it, sometimes those developers and programmers, especially in senior roles, have more credibility than the managers do. Because they're in the trenches on the front line sometimes with the other engineers, and I am no longer in the trenches. And so I ask questions, but I'm not there debugging. If we have some sort of incident, that's the engineers are doing that. So Julie, you said in the trenches, now you're really speaking our language. These are common idioms and phrases that Adam and I often use. So as a manager, as a leader on the management side, as you said, you don't have that day-to-day in the trenches. You're day-to-day, but you don't

Starting point is 00:36:01 have like the, you're not in the debugger. How do you keep your street cred with your teams? Like how do you stay relevant and not become the pointy-haired boss that is so laughed at in the Dilbert comics? So I ask a lot of questions. And I think the way – early in my management career, and I see this with a lot of um kind of more junior managers they think their job is to have all the answers and they think that their job like in the a la Dilbert is to tell people what to do and uh that is I I very much believe that is not my job. I ask a lot of questions because I don't have all the answers. Ideally, there are very few hard decisions, honestly, that I'm making because I've created an environment with my team where they have the context,

Starting point is 00:37:03 they know what we're building, they know why we're building it, they know what why it's important, and then they decide how they're going to approach actually building things. I try to be, I try very hard not to be prescriptive. My job is not to tell people how to do things but to again set the context and let them run free. Because let me tell you, they're going to come up with much more innovative, interesting, creative solutions to things that I'm going to come up with. So you want to, as a manager, I manage what is the outcome? What do we want to achieve by building this?

Starting point is 00:37:43 And how do we build it? That is up to y'all run free be As y'all as as my engineers are in the debugger and I am not so again coming back to your question All all I do all day is ask questions. So if someone come let's say that someone comes to me and They're stuck. And this happens not regularly, but with like a somewhat normal cadence where maybe we're deadlocked on a decision. We don't know what to do. So when the team comes to me and they've decided like, we want Julia to weigh in because we don't

Starting point is 00:38:19 know what to do. So the first thing that I do in all of those discussions is I start asking a lot of questions because I probably don't have all the context. And most of the time, and it's almost like I'm rubber duck debugging the team. They come and I just start asking questions. And usually through all the questions that I ask, and I'm not being prescriptive, I'm not telling them what to do, they come to a logical conclusion of what, and they've fixed the problem, they've effectively, the team has decided how to effectively fix something themselves. And I think that's fantastic because in an ideal world, the team is able to function and make decisions and I manage myself out of a job, meaning they don't need me because they

Starting point is 00:39:07 understand what they're doing and they understand the business requirements and they deeply know the purpose of what we're building. Now, there's also a lot of situations where maybe I ask a lot of questions and it's a hard call and I have to make a decision. And part of what I do as a now senior leader is when I do have to make a decision, I can ask questions rapidly and then be able to make the decision quickly. Because the last thing I would want to do is to block a team from being able to do something. So the credibility comes through asking people and getting them to volunteer what they think the solution should be versus coming in and being that boss

Starting point is 00:40:00 who doesn't know what's going on, but is telling people to do things that those developers are actually diametrically opposed to. It's getting people to talk, really, right? I mean, in a lot of cases, unless you do that, the silence will come in and you'll pontificate rather than say, hey, where should this go? Here are the problems I'm seeing from this meeting or I'm information gathering here. I see this problem there. Here's the collective problem.

Starting point is 00:40:29 How does this impact you and how can we solve that? Is that what you mean by that? Absolutely. The last thing people want to hear me do is get up on a pedestal and give a speech about how they should solve a problem, the implementation details of a problem. It is all about exactly asking questions, getting them to talk, getting them at times to see things from a different perspective.

Starting point is 00:41:02 This episode is brought to you by our friends at GoCD. GoCD is an open source continuous delivery server built by ThoughtWorks. Check them out at GoCD.org or on GitHub at github.com slash GoCD. GoCD provides continuous delivery out of the box with its built-in pipelines, advanced traceability, and value stream visualization. With GoCD, you can easily model, orchestrate, and visualize complex workflows from end to end with no problem. They support Kubernetes and modern infrastructure with elastic on-demand agents and cloud deployments.

Starting point is 00:41:35 To learn more about GoCD, visit gocd.org slash changelog. It's free to use, and they have professional support and enterprise add-ons available from ThoughtWorks. Once again, go cd.org slash changelog. So Julia, you've been scaling up the team at the speed of the business, which as we mentioned earlier in the conversation has been very rapid. And you're from seven-ish to up to a 70-person team. I'm sure there's plenty of other teams. But your major goal is to keep up, keep the infrastructure up with the demand on the platform and the business. So give us some

Starting point is 00:42:34 insights into exactly the infrastructure of Slack, some of the technical hurdles y'all have been dealing with, maybe some success stories, maybe some, maybe some bad days. For sure. From a technical perspective, the founders of Slack previously had started Flickr, which was a photo sharing site that they then sold to Yahoo many years ago. So Cal Henderson, who I work with very closely, he's our CTO. During the Flickr days, him and co-founder Sergey, who is actually part of the infrastructure organization I directly managed, Sergey, for some time, like brilliant consumer web startup of tremendous, tremendous scale, how they went through that period. And so when Cal and Sergey and Stuart Butterfield, our CEO, when they went to start Slack, they knew how to scale PHP. So we have a large PHP monolith with many services that we've split out right now.

Starting point is 00:43:50 We recently hired, or actually all the years blend together now, but we have a chief architect, Keith Adams, and he came from Facebook. And so Facebook was also a really large PHP shop, and they then created Hack language and use HHVM, the hip-hop virtual machine. So we're now transitioning to using Hack and HHVM and fantastic, fantastic performance improvements there as well as typing, as well as like many, many of the affordances of modern languages. We tend to, at Slack, use boring technology. And part of the reason for that, and I say boring with so much love, is because we have to know how to operate the technologies that we use at incredible, incredible scale. And so we don't want to be on the bleeding edge because we have to have

Starting point is 00:44:46 incredibly high uptime because we have so many companies, you know, from the NASA's of the world to Capital One, to IBM, to eBay, all of these companies run the backbone of their business on Slack and we can never, ever go down. And so from in scaling infrastructure, we build a lot of the services that connect to the monolith. And so some of those services are written in Java, some of them are written in Go. And so I now manage a fantastic team of machine learning and search engineers. We have that office based out of New York. And so they're also experimenting with some Java and Go services that connect to the monolith. And we they're also experimenting with some Java and Go services to

Starting point is 00:45:25 connect to the monolith. And we're slowly, we're not, we do not have a microservices model, but we are split, when it makes sense, we split things off the monolith and potentially turn them into external services. So my, one of the things that we've done in my organization is, you know, in the early days of having services that we've either split off, or that were always separate, we would, we didn't deeply understand the SLAs around those services. And so as we've grown, one of the ways in which we've matured is understanding and having performance targets and also and again SLAs for all the services that we're building and these are not external SLAs these are SLAs for ourselves because infrastructure is a horizontal organization meaning we build of course common infrastructure that's used by 300, 350 engineers that are in product engineering,

Starting point is 00:46:28 like building on top of what we've built. Reminds me of a conversation we had a long time ago now. Man, time flies. 2014, we had Sarah Goldman, who worked at Facebook at the time. I believe she still does, but perhaps not. Come on the show. Talk all about the PHP language spec, making PHP awesome, the work they were doing with HHVM and hack. And there's been a whole bunch of engineering efforts by many companies now.

Starting point is 00:46:53 Happy to hear Slack contributing and using PHP and helping make it an awesome language of today. So we use Slack every day, almost all day every day. And you mentioned it always has to be up. And I can't think of a time. I think there was one time when Slack was down. I'm just trying to think if you had any real bad days. Now, another one of our often used services, Twitter, and they historically have had many bad days. And we even lovingly think of the farewell of of years past um they actually twitter had some downtime maybe last week and i noticed the farewell is gone it's like a weird

Starting point is 00:47:31 octocat looking thing instead i was like that's not yeah there's not that's not endearing give us the farewell but slack really hasn't had i mean adam can you think of a time where it's just like well slack's down i guess we'll email each other. No. No, I don't. The only thing I would really ever notice, and this isn't a dig, is just maybe slower service, not down service, which could be but isn't as bad. Service degradation. Yeah. Or like starting up the app.

Starting point is 00:48:00 It takes 10 seconds versus instant or closer to instant, those kinds of things. Or slow notifications, when you rely on notifications, iOS notifications, and you've already had the conversation. And then finally on your iOS device, you get a notification or two of the conversation you've already had, those kinds of things. I'm sure that they're not quite down, but they're like, it's sort of, you know, not relevant anymore. So how do you deal with the, you know, non-relevant, you know, distributed notifications that should have been closer to real time that are now just not important anymore? See, I think you're highlighting on a really interesting question, which I see the parallels in. Sometimes no internet is better than slow internet, where you want a service to be really quick. You want to ensure that your notifications show up instantaneously.

Starting point is 00:48:57 Imagine if you get a DM from your boss, you want to know. You want to be able to respond if that's your relationship with your boss. And so we think about this a lot. And we think about it especially with respect to we've grown so fast. over half the messages in Slack are sent outside of the U.S. We have to have an infrastructure that allows Slack to boot instantly everywhere around the world, meaning Houston, Omaha, since Adam, Jared, I know you're out there, but also in Japan, in Asia Pacific. And so we run Slack in the cloud, and we've been cloud from day one. And as part of the infrastructure organization, we've had to build a lot of tooling to understand what our performance is around the world. And also, you know, you were talking about notifications, and especially on mobile, you know, we use the infrastructure,

Starting point is 00:50:11 I believe it's APNS, which is the infrastructure provided by Apple to send notifications on iOS. There's also an analogous system on Android. And so it's one of the very, one of the most difficult things is providing a service that is used 24 hours a day, seven days a week, around the globe, that people need to do their jobs, that has to almost be more reliable than the internet backbone. And so what I mean by that is there are parts of the world where the internet backbone is less reliable, especially in Asia Pacific. Let's say that you get a DM and it doesn't come in fast enough. It seems delayed. Slack seems slow. You don't care that there's DNS issues that are

Starting point is 00:50:58 happening in your part of the world. You need Slack to be fast. And that's what you expect. And so one of the awesome challenges here is figuring out how to provide that level of service when we don't control the racked machines that are, we don't have our own data centers. So how do we do that when we don't have that low, low, low level of control? And the way that we do it and the way that we're figuring it out because we're, again, the scale is just tremendous, is by building software. And that makes me incredibly excited. So we're figuring out how to work with different vendors to build really resilient, fault-tolerant software that can provide you that level of experience. When fundamentally, the underlying infrastructure, the cloud providers that we run on, and then the internet backbone lines that they run on, do not provide the level of uptime that we need.

Starting point is 00:52:01 It's an interesting perspective to think about that, too,, I'll also say this, we're not paying you, you know, we're, we're, so we're obviously not complaining. I said, are you, do you mean that in terms of we're on the free version of Slack? Right, right. Exactly. So like we, you know, I think of this as an interesting problem because you have such a unique type of software where you have a lot of people using for free and a lot of people you know in your own terms we're not like ripping you off it's the way things work but the point is is like if we were that'd be a really bad uh confession right there by the way we're not paying you surprise i'm not beating on down your door and demanding it from you but you know we talk about uptime or downtime or reliability.

Starting point is 00:52:46 You know, I've never really seen Slack down, but I've seen it be slow or I've seen it be, you know, delayed. And you're right. I don't think like, hey, DNS isn't working properly here in Houston, Texas. I just think Slack is not working right. You know, I blame you, not the DNS, you know, or the other problems in the Internet backbone. Or when S3 went down. Or yeah, or S3 went down or something changed to make things not work right.

Starting point is 00:53:09 So we definitely have had, you know, so we run our business on Slack. Like we're Slack on Slack all day. And when Slack, when we do have service interruptions, when things are slow, you know, it really heavily impacts our ability to do work. And when you build software, we do everything we can to ensure we have unit testing and load testing. And we have linting and we have tooling. But, I mean, we're all engineers here. We make mistakes. We all wish that we could write perfect code and never deploy bugs. But, of course, we're all engineers here. We make mistakes. Like we all wish that we could write perfect code and like never deploy

Starting point is 00:53:47 blogs, but like, of course we do. And so the, the, like the, the challenge becomes so like, so we absolutely have had situations where, you know, Slack has gone down for, for periods of time. And so what we've done is ensure that we, when things happen, because they do, that we're able to recover and detect those problems instantaneously. And so in an ideal world, we release, we accidentally break something, or S3 goes down, or a huge storm in Northern Virginia impacts US East, the Amazon facility out there. And we're able to detect that and reroute the traffic or revert the bug and do that

Starting point is 00:54:37 without you ever noticing. And that's the world that we're moving towards is being able to detect and recover really, really, really rapidly so that you all will never know that anything happened. And you're doing that through relationships. You mentioned talking to vendors. So we talk with our vendors very, very regularly. We also build software that can handle network flakiness in case something does happen with the underlying network. And I feel like that's such a fascinating engineering challenge because it's like trying to understand ways in which something will fail. When you build software, there's often obvious edge cases, and then there's things that happen where you're like i never could have imagined that that ever would have happened and now we have another like claw like if else clause so how to handle that i think a lot about it and again like this is as someone who runs an infrastructure organization

Starting point is 00:55:36 i think a lot about um these types of the challenges at scale um that involve only vendors, that involve us building and baking in resiliency into our software. Another fascinating thing, at least to me, about Slack is we open, when you leave, most people have Slack open for 10 hours a day on the desktop. And they may not, like, they're probably not sending messages for 10 hours a day, but they have it passively open. And we open a WebSocket connection, and we're sending incremental diffs, if you will, across that WebSocket. Now, so the reason that at times, and we're very heavily working on this, the reason that that startup time might be low is because we need to send you a whole lot of data across the WebSocket. Now, remember, WebSockets are a bidirectional communication.

Starting point is 00:56:34 So we're sending, like, how many, is Jared in new channels? Has Adam gotten some new DMs? Has someone mentioned Jared or Adam? We're sending all of this information, the state of the world since you last connected across the WebSocket. And then once you're connected, we're able to send you smaller bits of information about things that have changed. Now, one of the things that happens that's particularly precarious is if we see millions of users or tens of millions get knocked offline, like let's say there's a storm. Let's say we deploy a bug, let's say that you know something

Starting point is 00:57:25 the internet backbone in Singapore has a blip and suddenly all those connect those users are knocked offline. They all immediately you know hit refresh or they wait and suddenly we've got millions, tens of millions of users trying to reconnect and those are the things that are really difficult. And so building systems that can handle those, what we call reconnection storms, that's really, really interesting and has been really hard because you really have to build infrastructure

Starting point is 00:57:57 that can handle so much greater than your current load. But that's not just sending the data, it's querying for the data. It is packaging it, getting it to clients, ensuring the clients can parse it efficiently, all of these things. And I think that's such an exciting challenge. I'm sure we can go much, much deeper on these challenges. And I think these are just probably never ending and probably not even the most, maybe fun for you to talk about, but always fun to reveal. Maybe before we want to ask you like one or two questions about your upcoming talk here at Velocity, but I think maybe share what you can just to give listeners kind of a scale of like how many users do you.

Starting point is 00:58:36 Is there any public information around like, you know, paid or unpaid users? You can kind of help the audience listen or understand, you know, what the scale you're actually operating at when it comes to a concurrent user base or something like that? Yeah, absolutely. So we have over 9 million weekly active users and over 6 million daily active users. There's over 2 million paid users. And what I think is super cool, so of those paid users, of the 2 million paid, 43% of Fortune 100 companies use Slack. So it's a lot of the companies that you think about, credit cards, Capital One, they're running their business on Slack. Super cool.

Starting point is 00:59:24 Ticketmaster, if you're buying tickets um on slack but also a lot of like fascinating and exciting like nasa i i had talked about them before um doing really really cool stuff on top of slack and of course there's a lot of technology companies the paypals um the linkedins um spotify Pinterest, they're all running their businesses on Slack. So that's the, and I think what is, not only do we have rapidly growing numbers of users, but I think the demands on the service in terms of we can never go down are really high. If you think about consumer internet businesses, for example, Twitter, we had talked about them earlier, Facebook, when those services go down,

Starting point is 01:00:10 of course, like that really sucks. And there's clearly a loss for those companies in ad revenue. You know, when Slack goes down, Capital One, the people at Capital One can't do their work. And that's terrible. If Ticketmaster goes down, then potentially they can't process orders. And so the reliability and scalability constraints are real. And I think that's really exciting because I think it means that we've built an incredible product that people love and that people rely on every day to do their job.

Starting point is 01:00:46 And ideally, we're in the background. Like, we just work. So the other, I think, segueing to you had asked, like, a few other numbers. When I first started using Slack, before I ever thought about working at the company, I had started a company. And I installed all the engineering integrations on top of Slack. So I did GitHub, continuous integration, pager duty. We also use Zendesk for our customer support tickets. And so we have a really, really active and vibrant developer community that builds on top of the APIs.

Starting point is 01:01:24 And so we've got something like 1,000 apps in our app directory. And the app directory was actually the first big launch that I was part of at Slack, which was super cool because you had to go, like, search for apps. And that's how I, like, in my early Slack days, how I found apps. I would go searching, you know, Google search, and now you can search in the app directory. And there's something like 100, this is, this is, I think, just so cool, because I built apps before I even joined the company, like I built integrations, so I could send data, pipe data from

Starting point is 01:01:57 our systems into Slack, so that I wouldn't like, if something was going wrong, I hooked up our error servers to Slack, so that I could see the channel light up versus waiting for the email or waiting for the page because I was in Slack all the time. There's something like 155,000 of these weekly active developers building on Slack. So that's a lot of people building on Slack. And I think that's so cool because they're building things that like we never could have imagined in a wonderful way. I think what's interesting is that you've got, I think you said two million ish paying users, but roughly nine million on a week in a week, right? Yeah. Is that to me, that's like just crazy because of what you're doing for your uptime and that a large majority of your users aren't paying you.

Starting point is 01:02:47 You're feeling guilty, aren't you? I'm just feeling like it's just the world we live in, but it's just like you kind of understand why services charge a higher premium for what they do because it takes a lot to run them. But you've got a large majority using a service for free, but get this you know maybe not the same but a very similar service we probably get a very similar service that you know some one of these companies that uses you that pays you yeah i mean i think they've done a good job set setting setting forth a it seems to be a solid business model with free versus paid And it seems like everybody's happy that way, at least so far. Right, Julia? So what I love is that all the changes that we make to ensure that the service is better for big customers, every single small customer and free team benefits too.

Starting point is 01:03:39 And so that's, I think that's really exciting because not only can, you know, the people who use Slack for work hopefully have a better experience, but then in the communities that you run, you'll also be able to benefit from all of those things as well. And so I think another, like, active user numbers also, you know, they vary. And as an enterprise software company, you know, we see there's periods of time when user group, like Mondays are really big days for us, for example, because everybody comes back online because they spent the weekend, you know, hopefully chilling out. And so we, those numbers fluctuate based on the calendar year. I think what's super interesting from my perspective was in the early days of Slack, usage would dip over the holidays because people weren't working. They'd take a week off or two weeks off for winter holidays and New Year's. But as we've grown, we see that less because there are more companies using Slack that don't have that dip in the holidays. So the companies that

Starting point is 01:04:55 don't are the ones like credit card processors, for example, or anyone in e-commerce. So as we start to see more and more folks, the quote unquote, like nice break we used to get off the holidays, like that doesn't exist anymore. It's been, it's been really, really cool to watch, um, where we have to, like, we used to be able to say like, you know, only one person is triaging bugs, but it's not, it's, that's not the case anymore. So, um, so that's been, that's been like wonderful, you know, the, the challenges of success and the challenges of growth. Well, Julie, I want to plug your talk here at Velocity here in a bit. We work closely with O'Reilly, especially around Velocity, Fluent, and OzCon conferences they put on.

Starting point is 01:05:36 And we're always happy to talk with speakers like yourself speaking at this conference. So you're giving a talk called Scaling Yourself During Hypergrowth. And I think we're actually going to title this podcast, scale all the things or scaling all the things. One of the two, um, you know, I'm excited about this,

Starting point is 01:05:52 this talk. We have some team members going to be there. Listeners, if you're checking this out and you're going to go to that conference, you'd like to, we can give you 20% off either a gold, silver, or bronze pass.

Starting point is 01:06:03 Use the code change log, check the show notes for a link. We'll also include a link to Julia's talk there as well. Maybe you can catch it. If not, maybe it'll be on YouTube. Who knows? But anything you want to share with us in closing to some of the things either in your talk or things we haven't covered that you want to say as we tail out? Thank you both so much for having me. If any of these challenges of scale, of growth,

Starting point is 01:06:29 resonate with any of you listeners out there and you're interested in learning more and working on some of these things, you can find me easily on Twitter. I really, I would love for- I love the Twitter handle, by the way. Oh, thank you. Julia.

Starting point is 01:06:49 You know, a last final story. When I was 18 years old and I went to college, I had to choose my email address, and it had to be more than five characters. And so J-U-L-I-A, how I spell my name, is five characters. It so J U L I A, how I spell my name is five characters. It wouldn't work. So I immediately, so, you know, as an 18 year old, I had to come up with this handle and you know what? Many decades later, it's still around. So any 18 year olds out there, the decisions live with you. So, um, so, you and now, of course, you can have five character or less. You know, the world is a different place, if only. So find me on Twitter.

Starting point is 01:07:31 The Talk. Come to the Talk. Velocity is such a great conference. Huge shout out to the O'Reilly folks who do an incredible, incredible job. I feel so honored to be able to talk about these things, both, I believe, in a keynote and a session. So we can dig deep there. And then it should, I think, be on YouTube later. So come find me.

Starting point is 01:07:49 I'd love to talk more. And I hope you are having a wonderful, delightful experience on all of your slacks. There you go. All your slacks. I'm on many slacks. Julia, thank you so much for your time. It's been a pleasure talking to you

Starting point is 01:08:04 and appreciate you coming on. Thank you both so much. All right. Thank you for tuning in today. If you enjoy the show, share with a friend. Rate us on Apple Podcasts, go on Overcast and favorite it. And of course, thank you to our sponsors, Airbrake, Digital Ocean and GoCD. Also, thanks to Fastly, our bandwidth partner.

Starting point is 01:08:26 Head to fastly.com to learn more and we move fast and fix things here at Changelog because of Rollbar check them out at Rollbar.com and we're hosted on Linode servers head to Linode.com slash Changelog check them out support this show the Changelog is hosted by myself Adam Stachowiak and Jared Santo editing is done by Tim

Starting point is 01:08:42 Smith music is by Breakmaster Cylinder and you can find more shows just like this at ChangeLog.com or wherever you get your podcasts. Thanks for tuning in. We'll see you next week.

The Changelog: Software Development, Open Source - Scaling all the things at Slack (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.