The Changelog: Software Development, Open Source - Segment's transition back to a monorepo (Interview)

Starting point is 00:00:00 Bandwidth for Changelog is provided by Fastly. Learn more at fastly.com. We move fast and fix things here at Changelog because of Rollbar. Check them out at rollbar.com and we're hosted on Linode servers. Head to linode.com slash changelog. This episode is brought to you by Indeed and I had a really interesting conversation with Darren Nix, the group manager at Indeed Assessments. And Darren is running a remote first team that operates like a startup inside Indeed. And you know Indeed, it's a huge company, lots of resources, solving big problems, lots of big data. And Darren's team is hiring. Take a listen. Darren, tell me about the big picture problem you're solving at Indeed Assessments. What our team does is we build tools so job seekers can show off their knowledge, skills

Starting point is 00:00:50 and abilities when they're trying to get a job way better than a resume can. And that lets employers find great hires a lot quicker too and makes the process better for everybody. So you're running a remote first team looking to hire pretty aggressively Java engineers, front end or React engineers, Ruby on Rails engineers, UX designers, business intelligence, and you operate Indeed Assessments like a startup that lives inside Indeed. Tell me more. Because we're basically a startup within Indeed, we get to hire folks all around the country, even if they're not in Austin or San Francisco or Seattle.

Starting point is 00:01:24 And that means we can hire really great engineers who want to be able to work from their home city, work on really big problems, but solve those problems in a startup-y way. You know, we host our code on GitHub or Rails and Redis. We use Postgres and React and we're push on green. So we deploy six times a day. So I've seen charts that say like, hey, we deployed 13 times this week. And I'm like, haha, we deployed like 78 times because we like to go fast.

Starting point is 00:01:49 And so what we're doing here at Indeed is finding ways to be able to continue to be startup-y, but solve really big problems and help hundreds of millions of people get jobs. So if helping out your fellow engineers get jobs sounds like an exciting problem and you like working on startup-y tools at a really big scale, send us a note, reach out. I actually interview every single person who comes to join our team. So I'll be meeting with you and I look

Starting point is 00:02:15 forward to hearing from you. So if you're looking to join a remote first team working on really big problems that will literally impact hundreds of millions of people head to indeed.jobs changelog to learn more and take that first step welcome back this is the changelog podcast featuring the hackers leaders and innovators of software development. I'm Adam Stachowiak, Editor-in-Chief of Changelog. On today's show, Jared and I are talking with two members of Segment's engineering team, co-founder and CTO Calvin French-Owen, as well as software engineer Alex Noonan about their journey from monorepo to microservices, back to monorepo, hundreds of problem children to one superstar child.

Starting point is 00:03:11 So we're here to tell the story of Segment going from monorepo to microservices and back again. So we've got Alexandra Noonan here and we've got Calvin, CTO and co-founder here. And so maybe let's open up since we have, normally we have one or two people, like one person on the show. Let's open it up with like kind of who you are a bit. So Alex, let's start with you. Yeah, sure.

Starting point is 00:03:35 So I'm Alex and I am a software engineer for Segment. I joined the engineering team about a year ago. And before that, I was actually working on Segment's customer success team, kind of solving tickets and teaching myself how to code so I could eventually move to engineering. And then before that, I was in school studying math. And that's pretty much brings us to where I am now. Awesome. And Calvin, you're the co-founder and CTO, is that right? Yeah, that's correct.

Starting point is 00:04:08 Originally, we started Segment about seven, a little over seven years ago now. And at the time, we started at a really different place. We were building different types of software. After about a year and a half of trying to find product market fit, we ended up on this analytics idea. And we've kind of been building out that infrastructure and that product ever since. And as I mentioned, we're here to share the story kind of like quite of a journey.

Starting point is 00:04:34 And Alex, this is penned by you. And from what I understand from behind the scenes, there's several people who led this effort. It was quite a bit of an effort to do so. Maybe let's open up with kind of the time frame. who led this effort. It was quite a bit of an effort to do so. Maybe let's open up with kind of the timeframe. I saw this, I think I logged this to our newsfeed the day of when I saw it, which was just last month, July 11th. Is that around the timeframe of this blog timeframe

Starting point is 00:05:02 or is it kind of go further back than that? Did it take you several weeks to write this or kind of give us some timeframe of this blog timeframe or is it kind of go further back than that? Did it take you several weeks to write this or kind of give us some timeframe of when this occurred? Yeah. It actually took me six months to write this post. Rick Branson, it was actually kind of his idea for the post because I was one of the engineers that was maintaining and trying to build these microservices. And then I helped with transition and then was also maintaining the monolith after for a bit. And he kind of came on and helped with transition a bit. And so he asked,

Starting point is 00:05:32 he thought it'd be a really interesting poster I write. And since I was one of the main engineers that kind of went through the entire experience, he asked if I'd be interested in writing it in January. And then I worked on it weekends, nights, and then got it to about 60% done, but wasn't totally happy with it. And then Calvin hosted an engineering blog week where all people did for a week was write a blog post. So I took that week to get it over the line,

Starting point is 00:05:58 which was probably the last week of June. And then I finished it then and was sick of reading it. So we released it. So this post made quite a splash. We saw it covered on InfoQ, as Adam said, ChangeLog News logged it. It was shared broadly, probably on Hacker News. I'm not sure if it made Hacker News, but I'm sure it probably did. Before we get into the actual move back, I mean, the reason why I think this made a big splash is because anytime you see a trend in software engineering and then you see kind of the first or maybe a couple counter trends right like this was going a certain direction and now we are moving away from the trend that's interesting to us and so as you start off the post saying that you know microservices are the architecture du jour and this is a circumstance wherein the architecture was not working out for segments.

Starting point is 00:06:50 So Calvin, maybe you can first give everybody kind of the big picture of what segment is and does and why it was a good fit. You guys started moving to microservices early on and only recently, maybe six months ago, maybe more, found out that it wasn't quite a fit for you guys' team. So tell us what Segment is, in terms of what technically it does, and then why it was a good fit,

Starting point is 00:07:14 at least at the time, for trying microservices. Sure. Segment, at its core, is a single API to collect the data about your users and your customers and take that data whether it's from your website if you're monitoring things like page views or recording users adding items their cart or app interactions basically adding some code to send that data once into our API and then letting us help handle the fan out and federation of

Starting point is 00:07:45 all that data into over 200 different analytics, email and marketing tools that you might be using. And actually, segment was kind of born out of our need as developers in the very beginning where we were trying to decide between these three analytics tools, Google Analytics, Kiss Metrics and Mixpanel, and we couldn't really figure out what the differences were between them or why would we want to use one versus another.

Starting point is 00:08:13 So what we did is we took kind of the lazy engineer's way out and we built this layer of abstraction that sits in front where you just send the data once in a common format and say, here's who my users are, here's what they're doing. And then we help take care of all the transformations and mapping that are particular to each API. And looking back on the history of the company, actually, we started with a very monolithic pattern ourselves.

Starting point is 00:08:40 There was one service, which was a node, which basically packaged up our API, our CDN we used to serve JavaScript, our web app, and it all used the same set of modules, the same single process. They were just running across multiple EC2 instances. and as we started growing the team and growing the number of developers we quickly realized that that single service wasn't going to hold up as we basically added more and more people to it where there are now more and more PRs happening against the repos there are more and more deploys happening every day

Starting point is 00:09:21 and we just started running into a bunch of reliability problems so to counter that and I think this was the heyday of when There are more and more deploys happening every day. We just started running into a bunch of reliability problems. So to counter that, and I think this was the heyday of when Node.js was all about these really tiny modules, the kind of like left pad sort of really just very small bits of code that could be reused in many places. That's when we started splitting up our repositories

Starting point is 00:09:44 into different repos and our services going from this monolith service to a bunch of different microservices. And I think even today, well, at that point, we had about 15 engineers and we started ending up with hundreds of different microservices. And even today, we're probably pretty far on the spectrum of having too many services per engineer, but we're starting to dial it back in a number of these key areas, which Alex can talk about. So one move I've seen a few teams or I've read a few teams make kind of between the

Starting point is 00:10:20 monolith and the microservice is introducing kind of a code only. I don't know if they call it service oriented architecture or if that's something else, cause I'm not up on the lingo, but this idea of like, we're going to introduce services into our architecture, but not necessarily separate them at the network layer. Was that something that was tried or considered along the way?

Starting point is 00:10:43 Or was it like, let's, let's just use HTTP everywhere and have these microservices right out of the bat? It's interesting for us because Segment is actually a little bit different than, let's say, your traditional web app, like an Instagram or a Facebook. most of our what we call services could actually be more like workers, where typically each one will ingest some data that it's reading off of a queue, either Kafka or NSQ. It will do some set of light transformations or enrichment on that data, or maybe pull some extra data from another database. And then it will typically republish that data either to a queue or make outbound HTTP requests to a third party API. And when you think about data pipelines in that way, it actually makes sense that you'd have kind of many different steps, each with different hops in between them. And if you want to change kind of one hop or one service, you could do that independently of the rest of the pipeline.

Starting point is 00:11:42 So I think that's more what pushed us to have these different services, which, like you said, we're actually running via separate code bases because they all did something a little bit different, but we also would run them in the same infrastructure and on the same network. The cool thing about Segment from my perspective,

Starting point is 00:12:02 just from a nerdy engineer thought life, is it's basically the adapter pattern for third-party services yeah exactly just like you would do for your database right like abstract a layer and that layer is segment and now you only write to segment and then it's going to front yeah google analytics optimizely mix panel kiss metrics all you know probably hundreds of them at this point. And because of that, it does have a unique architecture where it's basically at a service level, it's implementing the adapter pattern. And so it does break out, I think, mentally very well

Starting point is 00:12:37 because you have your analytics queue, you have one big queue, I'm assuming, and then you probably split that out and have kind of service level queues. And so mentally, I think that would make sense for microservices. Was that the thought process then? Yeah, that's a great way of phrasing it. So Alex, you have in your post kind of a drawing of this queue and description. And it sounds like there was some coupling and some performance problems that were happening.

Starting point is 00:13:02 Can you tell us more about that? Yeah. So when Simon was originally in a monolith for these destinations, one of the benefits of Segment is that we retry events. So for example, say we get 500s from a partner because they're experiencing a temporary issue. We want to send that event again, but with our old setup, everything was in one queue. And that included these retry events as well as live traffic. So if one destination was going down, for whatever reason, we had now that one single queue would be flooded with tons of retry events as well as live traffic. And we couldn't, we weren't set up at the time to scale up quick enough to be able to handle this increase in load.

Starting point is 00:13:45 And so one destination having issues would affect every single destination, which was not ideal. So that was the original motivation for breaking them all up. So we can have kind of this fault isolation between them all. So instead of having one queue and multiple destinations, you would have a queue per destination. And so these individual queues became individual repos, individual services. Exactly. And so now if whatever destination is experiencing an issue, only its specific queue would back up and everyone else

Starting point is 00:14:18 would be unaffected. So to me, that sounds like rainbows and unicorns. Like it sounds like you guys solved it. So that's where it gets to, that's where the plot thickens, right? Because that was working for a while and maybe, maybe we need a bigger picture again. We understand what segment is. Somebody give us maybe the team size, the company size,

Starting point is 00:14:36 maybe the growth metrics of like the engineers and help us understand because microservices, these architecture decisions, they change, they're wildly subjective um where and even in just in our last show we were talking about istio and you know i we talked about microservice a little bit and i was asking the question of like how do you know when to microservice when not to microservice and it's like really it's like that's like the ultimate

Starting point is 00:15:01 it depends you know which is basically most of software development is it depends. And so maybe, you know, these case studies are so interesting because they give us data points by which we can all make decisions better, you know, kind of as an industry individually. But you can only actually apply the data if you are a subject, right? If you're a comparable, it's like real estate sales, right? We need to find comparable houses. Well, we need to find comparable technical stacks, technical situations in order to say, okay, this might not work for us.

Starting point is 00:15:36 So help us understand Segment at a macro level, the team, the company size, et cetera. So Segment today, there's about 80 members of the engineering team. And overall, the company size is close to 300 people. When you ask the question about whether to adopt microservices or not, and it being there on a case by case basis or a decision that's made very particularly to your company. The way that I like to think about it is about whether you're ready to take on more operational overhead that comes from running many different services, where maybe each one has its own code base, it has its own set of monitoring and alerting that you have to be keeping track of. It has its own new deploy process, its own way of managing those services, etc.

Starting point is 00:16:32 And honestly, it's a lot of upfront work to run those sorts of microservices that I think if we'd started there from day one, honestly, the company wouldn't have gotten off the ground. And we've spent all our time in terms of tooling and infrastructure, and we wouldn't have made any progress on the actual product. But that

Starting point is 00:16:54 said, there are a lot of benefits to having microservices if you have those systems in place. For us, we run everything on top of AWS and we use Amazon's ECS, their Elastic Container Service,

Starting point is 00:17:11 to run all of our services and orchestrate them running in Docker containers. And for us, we've invested pretty significantly in building out the tooling around ECS, around spinning up a new service that automatically gives you a load balancer. It gives you auto-scaling. It gives you the ability to run this Docker image

Starting point is 00:17:32 as long as you built it via CI, which we've also invested a lot of tooling in. And I think given that we have that set of primitives, it's made it so that we have kind of this proliferation of services because it's just so easy to do. And it means that if you want to add a little piece to the pipeline,

Starting point is 00:17:52 you don't have to make a change which could potentially break the pipeline for everyone. You don't have to worry about adding a single slow component, which now might buffer in kind of this critical path, which is dealing with hundreds of thousands of requests per second. Instead,

Starting point is 00:18:08 you can think about your little compartmentalized piece and how that should perform and behave. And so, for us, I think that drove a lot of the decision towards moving towards these really tiny

Starting point is 00:18:23 surfaces where the surface area was small and compartmentalized and well maintained where if you had a single service that was acting up for some reason like let's say it's connecting to a database which starts timing out and starts sending back connection errors it doesn't then stall the rest of the pipeline in terms of delivering data. And so, like I said, we first adopted this when we were maybe 10 or 15 people, which looking back on it now, I'd say it was definitely on the early side. And we had to build a lot of operational excellence in terms of running these services.

Starting point is 00:19:02 I think we were some of the earliest ECS users. Today, I think we're some of their heaviest users, running about 16,000 containers total across all of our infrastructure. We basically had to build that muscle separately and put in more upfront cost, which then allowed us to scale a little bit more easily when it came to building out the pipeline. That said, it's not without costs.

Starting point is 00:19:28 At this point, we built so many of these little services and so many different code paths that it's actually difficult for individual developers to keep track of how they connect. If you make a change to one part of the pipeline, how it affects the rest, that sort of thing. So there's definitely other downsides that I think are maybe not as talked about as much, especially if you adopt microservices really early. This episode is brought to you by Linode, our cloud server of choice. It's so easy to get started. Head to linode.com slash changelog. Pick a plan, pick a distro, and pick a location, and in minutes, deploy your leno cloud server they have drool worthy hardware native ssd cloud

Starting point is 00:20:25 storage 40 gigabit network intel e5 processors simple easy control panel 99.9 uptime guarantee we are never down 24 7 customer support 10 data centers three regions anywhere in the world they got you covered head to leno.com slash changelog to get $20 in hosting credit. That's four months free. Once again, leno.com slash changelog. So Alex, one of the things that you say in this post is that the touted benefits of microservices are improved modularity, reducing test burden, better functional composition, environmental isolation, and development team autonomy. These are the ones that many of us have heard and talked about and kind of analyzed.

Starting point is 00:21:28 Definitely true. The opposite, you say, is a monolithic architecture where a large amount of functionality lives in a single service, which is tested, deployed, and scaled as a single unit. Now, we know monoliths can be majestic. They can also be monsters. But you had switched to microservices for this part of segment. And then you said in 2017, you started reaching a tipping point with this core piece of the product, which is the one that we're talking about. And I love this statement. You seemed as

Starting point is 00:21:55 if it was fall, if you were falling from the microservices tree, hitting every branch on the way down, which sounds painful to me. So tell us about that. Like when did, as I said before, it seems like rainbows and unicorns, there seems like a very good fit because of the infrastructure that y'all have, but it didn't quite work out. And so that's kind of the, where the plot thickens. Tell us, you know, what those branches on the microservices tree felt like and what happened there? So when I joined the team, I actually joined at the peak of when it was getting to be a little bit unbearable. And one of the first issues that we were running into was all these separate code bases were

Starting point is 00:22:37 becoming extremely difficult to maintain. So we'd written some shared libraries to help us with some basic HTTP request formatting, error message parsing that all of them used. But at some point in time, we made a major version update to that library, and we didn't have a good way to test and deploy these hundreds of services. So we just one service or one repo we updated to use the newest version, and now everybody hundreds of services. So we just, one service or one repo we updated to use the newest version, and now everybody else is behind. And that kept happening over time with our other shared libraries as well.

Starting point is 00:23:13 So now me going into a code base, I had to be like, okay, which version of this shared library is it on? What does this version do versus some of the newer versions? And having to keep track of that was incredibly time consuming and very confusing.

Starting point is 00:23:29 But, and it also caused like, we wouldn't be making, we wouldn't make big updates that we often needed in these shared libraries because we were like, there's no way we're going to test and deploy all these microservices that would take the entire team and usually up to a week to just do that. So that was one of the big issues with it. Another was we were actually seeing some serious performance issues. So now even though all the destinations were broken up into their own queue, so if one went down, it didn't affect the others.

Starting point is 00:24:01 The issue was they all had radically different load patterns. And so one of these destinations would be processing hundreds of thousands of events per second, while others would only process a handful per day. And so we tried to, we always were trying to reduce customization across these services to make them easier to maintain. So we applied blanket auto-scaling rules to all of them, but that didn't help with some of the smaller guys because nothing can really handle, there's no set of auto-scaling rules that can handle a sharp increase in load. And so for the little guys that were handling a handful per day, and then all of a sudden a customer turns them on and now they're handling hundreds of events per second, they can't scale up. So we're constantly getting paged to manually go in and scale up these little guys. And the

Starting point is 00:24:49 blanket auto-scaling rules also didn't work because they each had a pretty distinct load pattern in terms of CPU and memory. So some were much more CPU intensive, while others were more memory intensive. And so that also didn't help, which again, caused us to have to go in and manually be scaling these services up. So we were constantly getting paged because queues were backing up to have to scale these guys up, which was pretty frustrating. And like I said, we were literally losing sleep over it. It was very frustrating.

Starting point is 00:25:17 It sounds like it. So you mentioned that there's, you had three full-time engineers pretty much spending their time keeping the system alive. Is this, is this is this what you're referring to like having to go in and scale things up and down when certain services wouldn't keep up with the load exactly exactly so they we were we couldn't add any it was difficult for us at any new destinations because we were spending so much time maintaining the old ones and then we had a backlog of bugs building up on the old ones. And we just, we couldn't make any headway at all

Starting point is 00:25:47 because the performance issues and the maintenance burden was so painful with all these repos and services and queues. It was getting to be too much. So Calvin, tell us about this from your perspective, from a CTO side, when this is going on and you have a lot of bugs happening, you have a lot of manual intervention

Starting point is 00:26:05 by your engineers probably not what you want them spending their time doing was this something that kind of like came on all at once that was kind of a slow trickle that eventually broke the dam what did it look like from your angle? From my perspective honestly I was working on a lot of these same systems

Starting point is 00:26:22 along with Alex here so it's definitely not something that snuck up on us or felt like it was just a sudden deluge of paging and alerts and problems that happened. They sort of grew in intensity over time fairly slowly. I think at a certain point, we started having a few large customers who would consistently be batching data in ways that was actually disrupting quality of service for other customers. So you might imagine customer A is sending thousands of requests per second. Customer B is sending tens and maybe customer C is sending one request per second.

Starting point is 00:27:05 If we're being rate limited by a destination that we're sending data to, and we're just reading off of a queue, if we let those thousand messages in first and just sort of do like a FIFO, first in, first out kind of approach, then we're effectively limiting the amount of data we can deliver for customers B and C, even though they didn't do anything wrong.

Starting point is 00:27:30 And so we actually took a step back and said, hey, maybe we should rethink both all these individual services, which are scaling poorly, and we should rethink our entire queuing architecture for this problem of a high failure scenario, where approximately 10% of messages that are going out will fail for one reason or another, whether that's an API outage, a rate limit issue, or maybe just an ephemeral network connection. And at that point, we introduced this new set of architecture that we called Centrifuge, which was this kind of revolving set kick off to make sure that our customers are being treated fairly would actually be much easier if we had a single service that we're working with. So why don't we kind of do both projects

Starting point is 00:28:35 in sort of lockstep, where we transition these integrations to a single service, which should help a bunch of these different problems that Alex just talked through, as well as help the end customer make sure that their data is getting where it needs to go quickly and reliably. How are you managing time in this? Because I mean, I'm thinking startup customers, you need to move efficiently. And Alex, you mentioned that this post took you six months to write.

Starting point is 00:29:03 This is probably you've been on board for a year. A lot of this takes a lot of time. How do you manage, you know, maybe from a CTO level and maybe from your perspective, Alex, as an engineer, how do you dictate architecture and initiate the team to move forward and still please people and get things right? Yeah, maybe I can start off first from sort of the more global perspective and then transition to Alex for her perspective as well. When we think about segments core value proposition,

Starting point is 00:29:33 maybe two or three things that we do with customers' data. First is that we help them collect and organize that data. So we want to make sure that our ingestion endpoint is always up, that we're never dropping data, that we're giving them libraries with a good experience to send that data into our system. And then the second core tenant is that we are taking that data and making sure it gets to whatever tools it needs to be delivered to in a fast and timely manner, where fast would be something like under 10 seconds. And when we were looking at the current system, we'd kind of juiced it in one way or another and made all these tweaks to it, and it was still just not working. And when we would see these long delays, we're effectively violating that second value proposition of segment. If we're not getting

Starting point is 00:30:26 your data where it needs to go, and it's taking 20-30 minutes to get there, we're not doing our core job for all of our customers. So for us, it actually felt fairly well aligned to kick off this set of projects to deliver really what our customers wanted more than new features, what they wanted more than any sort of other new products we could launch. They just wanted our core product to work amazingly well. What about your perspective on the engineering side? How do you be on a team where you have to implement this, but you're making choices, you you know scaling your lab your your library is out you've got different versions of them you've got you're scaling your repos and it

Starting point is 00:31:10 seems like things are okay at first and then things start to fall down jared mentioned hitting the branches and maybe you can go a little further into what they look like and how that feels it's actually a little bit interesting so when i joined the team like i said we were at kind of a peak of something has to change. And I was so brand new to engineering then. So I kind of thought at that time that this was just how it was. And I didn't totally see anything wrong with it until after we moved to the monolith and I helped the team transition to everything. And then I, looking back on it, kind of realized how crazy and how much time we were spending scaling this and just maintaining them and how we couldn't make any headway.

Starting point is 00:31:50 But in the moment, I didn't, I don't know, I didn't see anything wrong with it just because I was so brand new to engineering and had no experience before. I was like, oh, this is kind of annoying, but this seems pretty normal. That's interesting. And I was coming from a perspective, you mentioned that you had previously taught yourself or self-taught developer. Is that what you said in the opening? Yes. Tell us how that feels. I mean, so one thing that's happened, kind of the metagame on this blog post, which was

Starting point is 00:32:15 made such a splash, is you've gotten a lot of attention. Like I said, InfoQ, us, you're on the changelog. You're going to be speaking at conferences about this as a as a self-taught developer i i know that that it's probably an obvious kind of yes answer so maybe this is a dumb question but there's so much intimidation out there um i'm also self-taught i've been i've been doing it for a very long time now so i don't feel like like i've gotten past a lot of that stuff self-taught in software development, I did go to school for general computery stuff. But you mentioned you were in school for math,

Starting point is 00:32:49 so related fields, but definitely a bigger transition than I made. How does it feel to put yourself out there and make this post, which is somewhat counterculture right now, countertrend? Very well, by the way, you mentioned it's six months. Very well thought out, very well reasoned,

Starting point is 00:33:04 and not flamebaity or clickbaity at all in its content. So congratulations on that. But just tell us, I guess, in the metagame sense, like your feels with the post, I, we had no idea it was going to be this crazy. We knew we were going to stir the pot a little bit. Um, but we had no idea the impact it was actually going to have. And I'd always wanted to write a post. So I thought this would be a cool one. I was just going to write about my experience, um, kind of as an engineer at segment. And then it got a crazy amount of attention and I probably had I think the worst imposter syndrome I've ever had on engineering but it's been pretty cool I when I first started engineering too I wasn't super involved in kind of the hacker news or the community um so I really didn't have an understanding of the impact it was gonna have

Starting point is 00:34:02 though Rick tried tried to warn me um but now actually with this post it's kind of helped me get into the community a little bit more so i've been doing more reading like on hacker news and listening to podcasts and it's been pretty cool and i think i understand now why it was such a big deal to people because for me it was just oh i'm just writing about my experience um what happened at Segment and why we did certain things. But it's been really fun. Definitely a little scary, lots of imposter syndrome, but it's been really cool to see.

Starting point is 00:34:36 It makes such an impact in the tech world. Well, as we say, the best thing to do with imposter syndrome is just to punch it in the face. You just got to face it. I love that one. I like that. I'll remember that. There you go. So curious, I mean, besides us, what's been the overarching response from the community?

Starting point is 00:34:53 Has it been a lot of pushback? Has there been negativity? People saying, y'all segment are crazy. You don't know what you're doing. Or has it been, wow, this is really interesting. Maybe we'll consider it. Because there's just a lot of tension around the whole monolith versus microservices debate.

Starting point is 00:35:09 Totally. And microservices have been around long enough now that we're starting to see people who have run them for a couple of years or a year. And so you can't really tell if a thing is like scalable, long-term net good, or if it's just like a series of trade-offs until you have some real world experiences. So maybe we're starting to see that. But curious, what has been your response?

Starting point is 00:35:34 In your eyes, what's the overall response been? Positive, negative, meh? Yeah. So I actually didn't read a single Hacker News comment because... Good for you. That's how you do it. Exactly. Well, I also thought when I released it, there was only going to be 50. And then I checked later in the day and there was over 700.

Starting point is 00:35:57 Oh, wow. That's awesome. 700 comments. Yeah, it was crazy. But the general feedback I feel after talking to other media and um kind of some of my friends that are in the industry that saw the post is that actually been relatively positive and people were just super curious as to like why we did that why we made this change because kind of again as you mentioned like the microservice i guess boom happened a few years ago

Starting point is 00:36:23 so some people are starting to kind of realize that this may have not been the right setup for them. So it's been, I would say, pretty positive. And more people are just really curious about why we did it and want to know more. And it's been really cool to see people wanting to be more educated and understand the details about why we did it. But I don't know what happened on Hacker News. I heard it was relatively good, but I didn't read a single one. That's a good, that's a good tactic for Hacker News is have a friend read it for you and then just kind of summarize. Yeah, pretty good. Exactly. That's basically what I did. Some people sent me like screenshots of really nice comments. i heard it got there was some negative feedback

Starting point is 00:37:05 but i i've heard that's also pretty typical with hacker news so i wasn't yeah i wasn't too worried about it and that didn't seem like the overall feedback so the the changelogs experience like our show's experience with hacker news over the years until recently has been whenever somebody posts us and we happen to make it on the homepage, one of our episodes, undoubtedly, like without a miss, somebody would say, this is lame. Where's the transcript? I just want to read it like every single time. Isn't that right? I'm like, somebody would say that.

Starting point is 00:37:34 I'm just like, can you give us a break? We're just doing a podcast. And now we have transcripts. And so they can't say that anymore. They don't say that anymore. We do it for accessibility, but we also kind of do it to shut up hacker news people. So it makes the transcripts worth it.

Starting point is 00:37:48 Here's what I'm kind of curious of, especially Jared, you mentioned earlier our conversation, which was yesterday. And these episodes will come out in different timeframes. So the Istio episode should already be out.

Starting point is 00:37:58 If you're listening to this, that episode should be out because this one's coming after that. But in light of that, I mean, like clearly there's something happening in like the service mesh and microservices area. So this is definitely subjective in terms of like your engineering and your culture.

Starting point is 00:38:10 So it works in places. And I'm curious because in the pre-call, Alex, you mentioned, and Calvin, I don't think you were on the call yet when we were kind of having some pre-call conversation. Just, you know, that you're a co-located team. You have two engineering offices, and maybe it makes sense where teams are completely, like I can't see Jared, we're on the same team, but he and I have no conversation with one another. Maybe that makes sense where microservices make sense and they don't make sense here because you have co-located offices and your teams can maybe interact more fluidly and

Starting point is 00:38:43 that kind of thing. Where do you think that breakdown really happened with this? Was it purely technical or is it because of the way your product and teams operate? At the time, we actually didn't have the Vancouver office. Okay. So one office. All engineers, exactly. All the engineers were in San Francisco. And I think it was, I mean, a mix of both, not a great answer, but part of it was burden on the team. And our productivity was down. And another part of it was these performance issues that we wanted to get rid of for customers.

Starting point is 00:39:17 So a combination. But Calvin, you can probably add a bit more color since you kind of were here for more of the microservice setup than I was. I'd say for us that it's definitely a combination. And I should also be clear, by no means are we abandoning microservices across segment. There are actually a lot of good reasons to use them across many pieces of our infrastructure. Within this one particular case, we found we had better luck moving over to a single service. So I'd say we're continuing to make this same trade off instead of balances. When you talk about service mesh, I think that is definitely something that we are following fairly closely and are super interested in.

Starting point is 00:39:59 And actually, Alex has started a project working to incorporate Envoy as part of this new future of service mesh within Segment, which we're monitoring going forward with. I think in our case, it was probably a bit of a combination of both. We had this team of engineers who were trying to wrangle 100- plus code bases across 100 plus services. And when all of them do a similar thing, that's really just hard to manage and you have to build a lot of tooling around it. And we figured, well, we'd rather take the relatively slow rate of changes being made to a single place versus having to manage this many code

Starting point is 00:40:45 bases and this many services. I think the one other thing that changed here as well, originally we had anticipated that third parties would be adding a lot of their own code into these integrations. So you might imagine we support Amplitude and Mixpanel as places that we send data. We were kind of expecting that we would have engineers from those teams actually making pull requests, contributing whenever they pushed updates to their APIs. And in practice, that didn't really turn out to be true. It ended up being a team here who was working on it.

Starting point is 00:41:22 So we said, well, we thought we'd get these supposed benefits. We're not seeing those. Let's move over. This episode is brought to you by our friends at GoCD. GoCD is an open source continuous delivery server built by ThoughtWorks. Check them out at GoCD.org or on GitHub at GitHub.com slash GoCD. GoCD provides continuous delivery out of the box with its built-in pipelines, advanced traceability, and value stream visualization. With GoCD, you can easily model, orchestrate, and visualize complex workflows from

Starting point is 00:42:05 end to end with no problem. They support Kubernetes and modern infrastructure with elastic on-demand agents and cloud deployments. To learn more about GoCD, visit gocd.org slash changelog. It's free to use, and they have professional support and enterprise add-ons available from ThoughtWorks. Once again, go cd.org slash changelog. Calvin, you had mentioned centrifuge as a core piece of engineering infrastructure that you did as part of this transition. Can both of you help us understand from the point that you decided, okay, and very well noted that this is not all of a segment that has moved, right? This is a specific section of segment that Alex's team has moved from microservice back to a single service. Take us step by step through that. Once the decision was made, okay, we're going to do this.

Starting point is 00:43:11 I know centrifuge is involved somehow, but please help us all understand very clearly step by step what it took to get to where you are today and to where you could write your post saying goodbye to microservices. Back in April of 2017, we kept hitting these delays with various parts of the pipeline where customers would see their data being delayed for 20, 30 minutes, while either our current queuing setup would block up with a single customer's data or particular destinations wouldn't scale appropriately, as Alex was just talking about. And at that point, we said, okay,

Starting point is 00:43:51 we need a bigger overhaul to the way that we actually deliver data outbound, which should rethink a bunch of the primitives that we built these individual queues per destination over the past two years and should hopefully help us scale for the upcoming three to five years as we 10 or 100x our volume. And once we kind of acknowledged this was a problem, Rick Branson, who Alex has talked about a bunch, spearheaded the effort to actually architect the system

Starting point is 00:44:25 that he called Centrifuge. And Centrifuge effectively replaces the single queues that we have. So one queue for Google Analytics with one queue for Mixpanel, one queue for Intercom, with what you can think of as almost being virtualized queues or individual sets of queues on a per

Starting point is 00:44:47 customer per destination basis. So we might have one queue for Google Analytics, which has all of Instacart's data, but another one with all of New Relic's data, and maybe another one with Fender's data. And this system, honestly, we hadn't seen any really good prior art for. I think network flows are about the closest that you get to it. But those give you back pressure in terms of being able to say, hey, there's too much data here,

Starting point is 00:45:17 like stop sending from the very TCP source that you have, which is something that we can't exactly enforce on our customers. So with this design in hand for Centrifuge, we started out on what actually turned into about a nine-month journey where we decided to roll Centrifuge out in production. And Centrifuge was responsible for all of the message delivery, the retries, and archiving of any data which wasn't delivered. And then separately, Centrifuge would make requests into this new integrations monoservice, which you could think of as being this intelligent proxy, which would take these raw data in, and depending on where it's going, make multiple requests to a third party endpoint. And for the rollout process there, like I said, we spent maybe a month or so designing it.

Starting point is 00:46:20 Then we began to actually consolidate the repo and move it into be a single mono service. We started building out the bones of centrifuge for another three or four months or so. And we started cutting over our first traffic after about a five month period. Now, when we started cutting over traffic, we had this interesting problem, right? Where if we're sending traffic via two pipelines, we have to test it end-to-end in whatever destination tool, if we both just mirror traffic and let them both go, we'll end up with double counts in Google Analytics or double counts in Mixpanel. So we actually added a kind of serialization point in the middle

Starting point is 00:47:01 that both the set of microservices would talk to as well as the monolith. And effectively, it would do kind of a first right wins type of scenario where it creates some locks and redis. And then only one of the messages would succeed through either pipeline. And we basically slowly ramp traffic in that that manner always checking the end-to-end metrics on it always making sure that no matter which pipeline we were using the delivery rates looked uh perfectly good uh and i'm sure actually alex can talk to more of that rollout period because it was definitely a little bit rocky in terms of how we rolled out the system.

Starting point is 00:47:46 But about two, three months after that, we'd fully tested all the scaling, cut over 100%, and we're feeling much better about the system stability. And looking at it today, it's actually a very rock solid and well-util utilized piece of infrastructure.

Starting point is 00:48:06 Alex, anything to add there? As you mentioned, the process to get to 100% was, I think, a little bit longer than we anticipated. I remember we'd be in planning meetings at the beginning of the week in Cabell and be like, okay, what do we need to cut traffic over 100% by like in two weeks? And we'd always be like, oh, we just need to fix this one performance issue and then we're good to start cutting over and then we'd try and cut some over and quickly realize that there was a lot more performance stuff we needed to tackle but now that it's all done

Starting point is 00:48:35 as Calvin mentioned it's a rock solid system and it's really cool and complex um so it was definitely worth a little bit of that migration pain, but now the system is very stable and can scale much greater. And we've been able to build cool products on top of it, which we couldn't have done before with our microservice architecture, which has also been really exciting to see. pain point with Segment, or one of the biggest, is that they don't get a lot of visibility into what happens when they send data to Segment and then when we send it on to a destination. So a product that I built with one of my other teammates at the time was we built something on top of Centrifuge to basically collect the responses and counts of metrics, whether an event they sent was successful to Google or got rejected and why, and then display that in the UI to users. But with the microservice queue setup, there wouldn't have been a good way for us to pass that information back and somehow store it

Starting point is 00:49:40 so that we could show that info to users. But with Centrifuge, since Centrifuge kind of is keeping track of all of this, it knows everything already. And we just kind of info to users. But with Centrifuge, since Centrifuge kind of is keeping track of all of this, it knows everything already. And we just kind of had to flush that data out to a queue and then store it from there. And now we have it in the UI. And I think we've had radically positive feedback on that feature that now customers can see,

Starting point is 00:49:59 okay, I sent an event to Segment. I see it in Segment. And now I see Segment sent it to Google Analytics and it was either successful or sent it to Google Analytics, and it was either successful or it failed for whatever reason, which they'd never had that insight before. They can only see that the event made it to Segment, and then they'd have to go check Google, see, okay, when their event's not there, they have no idea what happened. So that was a cool product that Centrifuge allowed us to build. This is one of my favorite products that we've launched all year and maybe ever.

Starting point is 00:50:26 It just provides essentially a status page for these hundreds of different downstream tools in a way that none of them or many of them do not do natively, where you can see exactly what is happening with your data and whether it's being rejected or accepted by each API

Starting point is 00:50:43 and how long it takes to get there. Just unparalleled level of insight. That was a great post on your blog written by you, Calvin, all about Centrifuge for people who are interested. Centrifuge, a reliable system for delivering billions of events per day. Is this laying out the infrastructure and architecture? Is this an announcement of some sort of open source project? What's the status of Centrifuge? Is it public use? Is it private to Segment?

Starting point is 00:51:08 Yeah, currently it is private to Segment, though this post goes into a lot of depth about the architecture, the choices that we've made, and how it's been to operate it in production. At some point, I would actually love to open source Centrifuge or at least the bones of it itself, because it seems really useful for anyone who's running a large online web service, particularly if they need to make web hooks out from some data that's inside that service, or they need to send a bunch of data out to many different endpoints, which might be flaky, might fail at any given time. That would be very cool to see that open source. So this sounds like the project that took nine months, but it sounds like you thought it was 90% done maybe a few months in and it just stayed at 90% as engineering problems tend to do. What you say, Alex, that weekly meeting when you're like, yeah, it's pretty much finished. Just a couple more weeks, you know?

Starting point is 00:52:01 I'd say so. So I transitioned onto um centrifuge team as they had already kind of had an initial prototype i was still helping maintain some of our microservices but when i joined i felt like every week we're like all right we're this close we're a month away and it dragged on for a few months but as you mentioned that's pretty natural like for every big migration and engineering undertaking. So what does this imply or inform with other parts of segment, Calvin? Is this switch back to a single service, something that's very specific to this part of segment? Is this something you're now considering for other parts of your product or engineering teams that are still working in the microservices world?

Starting point is 00:52:49 Or is this a one-off that fit this use case, but probably not your other ones? There are a few other places where we're considering consolidating services. And I think there's a couple of reasons for that. One is within the pipeline, there's sort of this natural entropy over time where systems will split up and break apart as people add and tack on new features to them. In terms of the pipeline itself, I think we want to make sure that we're making sure that it's easy to reason about, it's easy to find what you're looking for. And you can kind of go to a couple of key places that need to be independent services and understand everything that it's doing. I think the second piece that we're interested in consolidating around is actually cost. Obviously, every time you copy data over the network or republish to Kafka or have a system

Starting point is 00:53:36 which is deserializing JSON and then reserializing back up, it's much more expensive. So in order to keep costs low for all of our customers, we're interested in consolidating some there as well. I'm kind of curious if you can inform other CTOs that listen to the show or engineering teams or engineering managers

Starting point is 00:53:57 on maybe the process, because you mentioned Centrifuge isn't a public service yet or it's not open source or whatever your plans are with it. And Alex, you mentioned it took you six months to write this post. Like, I'm curious from a content perspective, like your motivations for these two posts in particular. Like, was it customer acquisition? Was it, you know, was it just telling the world how you do things?

Starting point is 00:54:21 Was it idea sharing? Was it to attract the right kind of talent? Like what are the motivations for being so thorough and so well done with your, with your engineering blog? I'd say the blog is actually a deeply cultural, uh, part of segment that kind of goes back to our founding days. Um, initially the four of us, uh, just all engineers. We had no users and we said like, oh, how do we get developers to try out our tool? How do we get people interested in this? How do we actually just

Starting point is 00:54:54 start getting our name out there? And I think the blog was the first thing that we turned to as something where we figured out how to write interesting content that was effectively stuff that we were already doing that we just wanted to share with the world. So actually, if you go back through even to some of the very early blog posts, they're constantly documenting either things that we learned or new ideas that we've had or sets of best practices that we've learned from what we've built.

Starting point is 00:55:30 And as that's grown over time, we've really seen it be impactful on a number of dimensions. One of those is around customers and brand. Obviously, Segment is kind of a developer tool. I think in order to have engineers and developers trust you, I know at least I'm reading other companies' blogs all the time, following folks on Twitter to understand what's coming next in terms of tech. And sharing what we're doing internally already out there, I think helps build a lot of that trust, particularly when it comes to onboarding and setting up segment for the first time. Sort of something you're a little bit already familiar with. Then there's definite benefits,

Starting point is 00:56:18 as you mentioned, on hiring as well. A number of teammates who end up joining the team all say, hey, I first saw you through the blog. That was the place where I first found out about segments. And then I was able to dig in more and understand what was really going on at the company. And it gave me more of a window than I would have had really any other way. So I think there's that too, in addition to just being an amazing way to share when we learn things, either about new parts of architecture or about switching between monoliths and microservices, as Alex has talked about as well. blogger on the segment blog uh home run out the box you know what's your experience you know with other team members even like what's your experience with you know getting a chance to share some deep interest and obviously quite quite a bit of passion six months to write it and you're on the team you're obviously doing great work you're you're passionate about what you're doing like what is it for you to share this through the blog it was a really cool experience i know a lot of people

Starting point is 00:57:22 at segment were kind of curious why we'd moved to Centrifuge and spent, invested so much time in this. So I had some engineers that had joined post Centrifuge ask me about, and this is the first post I've ever written about anything. So it was really cool to get to just share my experience and have it kind of take off, and knowing that a lot of people have read the post, and I think a lot of people have actually found value in it, which has been the coolest part, that we've had so many people reach out, interested and curious to learn more, you in it, which has been the coolest part that we've had so many people reach out, um, interested

Starting point is 00:58:05 and curious to learn more has been really exciting and eyeopening and to just inspire. I know a lot of women came up to me after and were really inspired by the fact that I had such a post that went so crazy on the internet. Cause you don't see a lot of posts from women in engineering cause there aren't many, but that part was also really, made me really happy. That's interesting. So one part to inform your counterparts inside of Segment, because you got 65 engineers. Some of them may be in the know, some may not. This is a way to inform internally.

Starting point is 00:58:40 One way to also inspire. And then another way to potentially hire. We often interface a lot with different engineering teams through just what we do. And I'm always curious why some of them don't put enough intention into their engineering blog. So since you do such a great job, I wanted to make sure I asked you that question before we close out because you do a phenomenal job at it. One, with the writing and then two, just with the design of it. It's easy to read. It's easy to browse.

Starting point is 00:59:11 And if you're listening to this, you get my stamp of approval to say, this is a blog you should look at to mirror or to mimic when trying to do it for your organization. Thank you. Did you do that rhyme on purpose to inform, inspire, and to hire? Was that all? Did you plan that out, Adam? I did. Sorry.

Starting point is 00:59:29 Yeah. It's a nice touch. It's a nice turn of phrase. And with that, you know, that's the show. My rhymes and the show. And I love that. So there you go. But any closing thoughts from either of you?

Starting point is 00:59:40 Any closing advice for, you know, those looking to your posting, you know, this is the Bible of information, whether we should go there and back again and then there again. Any closing advice for those listening to the show or anything to share as we close out? I think my one piece would be that it's really all about finding the right fit for your infrastructure and your team. A lot of people have reached out and been kind of nervous that they're going to make mistakes with their microservice setup. And we're curious to get my opinion on what I thought. And I think, again, it's all about what is the best for your team at the time. And we're a perfect example of that. You know, as Calvin mentioned, we started in a monolith because if we'd started with microservices, there would have been no way for us to get off the ground. And then we moved to

Starting point is 01:00:27 microservices and that was the perfect solution for the time. But then after a couple of years, it turned out not to be anymore. So I would say don't be afraid to make changes. And it's about finding the right solution for your team and your infrastructure. I would echo that 100%. Definitely just don't outsource your thinking. It's just important to talk about trade-offs on both sides, really, I'd say for any engineering decision you make, because if you don't explicitly acknowledge them, chances are there's some con or something that you're giving up that you might not notice later. Well, Alex and Calvin, thank you so much for taking the time

Starting point is 01:01:05 to walk us through some of the pros, cons, ins and outs of your journey. We appreciate your time. Yeah, thank you. Of course, thank you. All right, thank you for tuning into the show today. Love that you listened to this show. Do us a favor, if you enjoy the show,

Starting point is 01:01:22 tweet about it, blog about it, go into your favorite podcast app and favorite it, share it with a friend, tell somebody, you know how much you love this show and we'll keep doing the same. We'll keep producing awesome shows for you. I want to thank our sponsors, Indeed, Linode and GoCD. Also thanks to Fastly, our bandwidth partner. Head to fastly.com to learn more. And we catch our errors before our users do here at Changelog because of Rollbar. Check them out at rollbar.com slash changelog. We're hosted on Leno cloud servers.

Starting point is 01:01:51 Check them out at leno.com slash changelog. And the host for this show was myself, Adam Stachowiak, and Jared Santo. The mix and master is by Tim Smith. Music is by Breakmaster Cylinder. And you can find more shows just like this at changelog.com slash podcasts. While you're there, subscribe to Master, get all of our shows in one feed at the changelog.com slash master. Thanks for tuning in. We'll see you next week.

The Changelog: Software Development, Open Source - Segment's transition back to a monorepo (Interview)

Adam and Jerod talk with two members of Segment’s engineering team: Co-founder and CTO, Calvin French-Owen, as well as Software Engineer, Alex Noonan, about their journey from monorepo to microservi...ces back to monorepo. 100s of problem children to 1 superstar child.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.