The Changelog: Software Development, Open Source - Segment's transition back to a monorepo (Interview)
Episode Date: August 29, 2018Adam and Jerod talk with two members of Segment’s engineering team: Co-founder and CTO, Calvin French-Owen, as well as Software Engineer, Alex Noonan, about their journey from monorepo to microservi...ces back to monorepo. 100s of problem children to 1 superstar child.
Transcript
Discussion (0)
Bandwidth for Changelog is provided by Fastly. Learn more at fastly.com. We move fast and fix
things here at Changelog because of Rollbar. Check them out at rollbar.com and we're hosted
on Linode servers. Head to linode.com slash changelog. This episode is brought to you by
Indeed and I had a really interesting conversation with Darren Nix, the group manager at Indeed
Assessments. And Darren is running a remote
first team that operates like a startup inside Indeed. And you know Indeed, it's a huge company,
lots of resources, solving big problems, lots of big data. And Darren's team is hiring. Take a
listen. Darren, tell me about the big picture problem you're solving at Indeed Assessments. What our team does is we build tools so job seekers can show off their knowledge, skills
and abilities when they're trying to get a job way better than a resume can.
And that lets employers find great hires a lot quicker too and makes the process better
for everybody.
So you're running a remote first team looking to hire pretty aggressively Java engineers,
front end or React engineers, Ruby on Rails engineers, UX designers, business intelligence, and you operate Indeed Assessments like a startup that lives inside Indeed.
Tell me more.
Because we're basically a startup within Indeed, we get to hire folks all around the country,
even if they're not in Austin or San Francisco or Seattle.
And that means we can hire really great engineers who want to be able to work from their home city,
work on really big problems, but solve those problems in a startup-y way.
You know, we host our code on GitHub or Rails and Redis.
We use Postgres and React and we're push on green.
So we deploy six times a day.
So I've seen charts that say like, hey, we deployed 13 times this week.
And I'm like, haha, we deployed like 78 times
because we like to go fast.
And so what we're doing here at Indeed
is finding ways to be able to continue to be startup-y,
but solve really big problems
and help hundreds of millions of people get jobs.
So if helping out your fellow engineers get jobs
sounds like an exciting problem and you
like working on startup-y tools at a really big scale, send us a note, reach out. I actually
interview every single person who comes to join our team. So I'll be meeting with you and I look
forward to hearing from you. So if you're looking to join a remote first team working on really big
problems that will literally impact hundreds of millions of people head to indeed.jobs
changelog to learn more and take that first step welcome back this is the changelog podcast
featuring the hackers leaders and innovators of software development. I'm Adam Stachowiak, Editor-in-Chief of Changelog. On today's show,
Jared and I are talking with two members of Segment's engineering team, co-founder and
CTO Calvin French-Owen, as well as software engineer Alex Noonan about their
journey from monorepo to microservices, back to monorepo,
hundreds of problem children to one superstar child.
So we're here to tell the story of Segment going from monorepo to microservices and back again.
So we've got Alexandra Noonan here and
we've got Calvin, CTO and co-founder here. And so
maybe let's open up since we have, normally we have one or two people,
like one person on the show.
Let's open it up with like kind of who you are a bit.
So Alex, let's start with you.
Yeah, sure.
So I'm Alex and I am a software engineer for Segment.
I joined the engineering team about a year ago.
And before that, I was actually working on Segment's customer success team,
kind of solving tickets and teaching myself how to code so I could eventually move to engineering.
And then before that, I was in school studying math.
And that's pretty much brings us to where I am now.
Awesome. And Calvin, you're the co-founder and CTO, is that right?
Yeah, that's correct.
Originally, we started Segment about seven,
a little over seven years ago now.
And at the time, we started at a really different place.
We were building different types of software.
After about a year and a half of trying to find product market fit,
we ended up on this analytics idea.
And we've kind of been building out that infrastructure and that product ever since.
And as I mentioned, we're here to share the story kind of like quite of a journey.
And Alex, this is penned by you.
And from what I understand from behind the scenes, there's several people who led this effort.
It was quite a bit of an effort to do so.
Maybe let's open up with kind of the time frame. who led this effort. It was quite a bit of an effort to do so.
Maybe let's open up with kind of the timeframe. I saw this, I think I logged this to our newsfeed
the day of when I saw it,
which was just last month, July 11th.
Is that around the timeframe of this blog timeframe
or is it kind of go further back than that?
Did it take you several weeks to write this or kind of give us some timeframe of this blog timeframe or is it kind of go further back than that? Did it take you
several weeks to write this or kind of give us some timeframe of when this occurred?
Yeah. It actually took me six months to write this post. Rick Branson, it was actually kind
of his idea for the post because I was one of the engineers that was maintaining and trying
to build these microservices. And then I helped with transition and then was also maintaining the monolith after for a bit.
And he kind of came on and helped with transition a bit.
And so he asked,
he thought it'd be a really interesting poster I write.
And since I was one of the main engineers
that kind of went through the entire experience,
he asked if I'd be interested in writing it in January.
And then I worked on it weekends, nights,
and then got it to
about 60% done, but wasn't totally happy with it. And then Calvin hosted an engineering blog week
where all people did for a week was write a blog post. So I took that week to get it over the line,
which was probably the last week of June. And then I finished it then and was sick of reading it. So we released it.
So this post made quite a splash. We saw it covered on InfoQ, as Adam said, ChangeLog News
logged it. It was shared broadly, probably on Hacker News. I'm not sure if it made Hacker News,
but I'm sure it probably did. Before we get into the actual move back, I mean, the reason why I
think this made a big splash is because anytime you see a trend in software engineering and then you see kind of the first or maybe a
couple counter trends right like this was going a certain direction and now we are moving away from
the trend that's interesting to us and so as you start off the post saying that you know microservices
are the architecture du jour and this is a circumstance wherein the architecture was not working out for segments.
So Calvin, maybe you can first give everybody kind of the big picture of what segment is and does
and why it was a good fit.
You guys started moving to microservices early on and only recently, maybe six months ago,
maybe more, found out that it wasn't quite a fit
for you guys' team.
So tell us what Segment is,
in terms of what technically it does,
and then why it was a good fit,
at least at the time, for trying microservices.
Sure.
Segment, at its core, is a single API
to collect the data about your users and your customers and take that data
whether it's from your website if you're monitoring things like page views or
recording users adding items their cart or app interactions basically adding
some code to send that data once into our API and then letting us help handle
the fan out and federation of
all that data into over 200 different analytics, email and marketing tools that you might be
using.
And actually, segment was kind of born out of our need as developers in the very beginning
where we were trying to decide between these three analytics tools, Google Analytics, Kiss
Metrics and Mixpanel,
and we couldn't really figure out
what the differences were between them
or why would we want to use one versus another.
So what we did is we took kind of the lazy engineer's way out
and we built this layer of abstraction that sits in front
where you just send the data once in a common format
and say, here's who my users are,
here's what they're doing.
And then we help take care of all the transformations and mapping that are particular to each API.
And looking back on the history of the company,
actually, we started with a very monolithic pattern ourselves.
There was one service, which was a node, which basically packaged up our API, our CDN we used to serve JavaScript, our web app, and it all used the same set of modules, the same single process.
They were just running across multiple EC2 instances. and as we started growing the team and growing the number of developers
we quickly realized that that single service
wasn't going to hold up
as we basically added more and more people to it
where there are now more and more PRs
happening against the repos
there are more and more deploys happening every day
and we just started running into
a bunch of reliability problems
so to counter that and I think this was the heyday of when There are more and more deploys happening every day. We just started running into a bunch of reliability problems.
So to counter that,
and I think this was the heyday of when Node.js was all about these really tiny modules,
the kind of like left pad sort of really just very small bits of code
that could be reused in many places.
That's when we started splitting up our repositories
into different repos and
our services going from this monolith service to a bunch of different microservices.
And I think even today, well, at that point, we had about 15 engineers and we started ending
up with hundreds of different microservices.
And even today, we're probably pretty far on the spectrum of having too many services
per engineer, but we're starting to dial it back in a number of these key areas, which
Alex can talk about.
So one move I've seen a few teams or I've read a few teams make kind of between the
monolith and the microservice is introducing kind of a code only.
I don't know if they call it service oriented architecture or if that's
something else,
cause I'm not up on the lingo,
but this idea of like,
we're going to introduce services into our architecture,
but not necessarily separate them at the network layer.
Was that something that was tried or considered along the way?
Or was it like,
let's, let's just use HTTP everywhere and have these microservices right out of the bat? It's interesting for us because Segment is actually a little bit different than, let's say, your traditional web app, like an Instagram or a Facebook. most of our what we call services could actually be more like workers, where typically each one
will ingest some data that it's reading off of a queue, either Kafka or NSQ. It will do some set
of light transformations or enrichment on that data, or maybe pull some extra data from another
database. And then it will typically republish that data either to a queue or make outbound HTTP requests to a third party API.
And when you think about data pipelines in that way, it actually makes sense that you'd have
kind of many different steps, each with different hops in between them. And if you want to change
kind of one hop or one service, you could do that independently of the rest of the pipeline.
So I think that's more what pushed us
to have these different services,
which, like you said,
we're actually running via separate code bases
because they all did something a little bit different,
but we also would run them in the same infrastructure
and on the same network.
The cool thing about Segment from my perspective,
just from a nerdy engineer thought life, is it's basically the adapter pattern for third-party services yeah exactly
just like you would do for your database right like abstract a layer and that layer is segment
and now you only write to segment and then it's going to front yeah google analytics optimizely
mix panel kiss metrics all you know probably hundreds of them at this point.
And because of that, it does have a unique architecture
where it's basically at a service level,
it's implementing the adapter pattern.
And so it does break out, I think, mentally very well
because you have your analytics queue,
you have one big queue, I'm assuming,
and then you probably split that out
and have kind of service level queues. And so mentally, I think that would make sense
for microservices. Was that the thought process then?
Yeah, that's a great way of phrasing it.
So Alex, you have in your post kind of a drawing of this queue and description.
And it sounds like there was some coupling and some performance problems that were happening.
Can you tell us more about that?
Yeah. So when Simon was originally in a monolith for these destinations,
one of the benefits of Segment is that we retry events. So for example, say we get 500s
from a partner because they're experiencing a temporary issue. We want to send that event again,
but with our old setup, everything was in one queue.
And that included these retry events as well as live traffic. So if one destination was going down,
for whatever reason, we had now that one single queue would be flooded with tons of retry events as well as live traffic. And we couldn't, we weren't set up at the time to scale up quick
enough to be able to handle this increase in load.
And so one destination having issues would affect every single destination,
which was not ideal.
So that was the original motivation for breaking them all up.
So we can have kind of this fault isolation between them all.
So instead of having one queue and multiple destinations,
you would have a queue per destination. And so these individual
queues became individual repos, individual services. Exactly. And so now if whatever
destination is experiencing an issue, only its specific queue would back up and everyone else
would be unaffected. So to me, that sounds like rainbows and unicorns. Like it sounds like you
guys solved it. So that's where it gets to, that's where the plot thickens,
right?
Because that was working for a while and maybe,
maybe we need a bigger picture again.
We understand what segment is.
Somebody give us maybe the team size,
the company size,
maybe the growth metrics of like the engineers and help us understand because
microservices,
these architecture decisions,
they change,
they're wildly subjective
um where and even in just in our last show we were talking about istio and you know i we talked
about microservice a little bit and i was asking the question of like how do you know when to
microservice when not to microservice and it's like really it's like that's like the ultimate
it depends you know which is basically most of software development is it depends.
And so maybe, you know, these case studies are so interesting because they give us data
points by which we can all make decisions better, you know, kind of as an industry individually.
But you can only actually apply the data if you are a subject, right?
If you're a comparable, it's like real estate sales, right?
We need to find comparable houses.
Well, we need to find comparable technical stacks, technical situations in order to say,
okay, this might not work for us.
So help us understand Segment at a macro level, the team, the company size, et cetera.
So Segment today, there's about 80 members of the engineering team.
And overall, the company size is close to 300 people.
When you ask the question about whether to adopt microservices or not, and it being there
on a case by case basis or a decision that's made very particularly to your company.
The way that I like to think about it is about whether you're ready to take on more operational
overhead that comes from running many different services, where maybe each one has its own code
base, it has its own set of monitoring and alerting that you have to be keeping track of. It has its own new deploy process, its own way of managing those services, etc.
And honestly, it's a lot of upfront work to run those sorts of microservices
that I think if we'd started there from day one,
honestly, the company wouldn't have gotten off the ground.
And we've spent all
our time in terms of tooling and infrastructure,
and we wouldn't have
made any progress on the actual product.
But that
said,
there are a lot of benefits
to having microservices if you
have those systems in place.
For us,
we run everything on top of AWS
and we use Amazon's ECS,
their Elastic Container Service,
to run all of our services and orchestrate them
running in Docker containers.
And for us, we've invested pretty significantly
in building out the tooling around ECS,
around spinning up a new service
that automatically gives you a load balancer.
It gives you auto-scaling.
It gives you the ability to run this Docker image
as long as you built it via CI,
which we've also invested a lot of tooling in.
And I think given that we have that set of primitives,
it's made it so that we have
kind of this proliferation of services
because it's just so easy to do.
And it means that if you want to add
a little piece to the pipeline,
you don't have to make a change
which could potentially break the pipeline for everyone.
You don't have to worry about
adding a single slow component,
which now might buffer
in kind of this critical path,
which is dealing with hundreds of thousands of requests per second.
Instead,
you can think about your little
compartmentalized piece and
how that should perform
and behave.
And so, for us,
I think that
drove a lot of the decision towards moving
towards these really tiny
surfaces where the
surface area was small and compartmentalized and well maintained where if you had a single service
that was acting up for some reason like let's say it's connecting to a database which starts timing
out and starts sending back connection errors it doesn't then stall the rest of the pipeline in
terms of delivering data.
And so, like I said, we first adopted this when we were maybe 10 or 15 people,
which looking back on it now, I'd say it was definitely on the early side.
And we had to build a lot of operational excellence in terms of running these services.
I think we were some of the earliest ECS users.
Today, I think we're some of their heaviest users,
running about 16,000 containers total across all of our infrastructure.
We basically had to build that muscle separately
and put in more upfront cost,
which then allowed us to scale a little bit more easily
when it came to building out the pipeline.
That said, it's not without costs.
At this point, we built so many of these little services and so many different code paths
that it's actually difficult for individual developers to keep track of how they connect.
If you make a change to one part of the pipeline, how it affects the rest, that sort of thing.
So there's definitely other downsides that I think are maybe not as talked about as much,
especially if you adopt microservices really early. This episode is brought to you by Linode, our cloud server of choice.
It's so easy to get started.
Head to linode.com slash changelog.
Pick a plan, pick a distro, and pick a location, and in minutes, deploy your leno cloud server they have drool worthy hardware native ssd cloud
storage 40 gigabit network intel e5 processors simple easy control panel 99.9 uptime guarantee
we are never down 24 7 customer support 10 data centers three regions anywhere in the world they
got you covered head to leno.com slash changelog to get $20 in hosting credit.
That's four months free.
Once again, leno.com slash changelog. So Alex, one of the things that you say in this post is that the touted benefits of microservices
are improved modularity, reducing test burden, better functional composition, environmental
isolation, and development team autonomy.
These are the ones that many of us have heard and talked about and kind of analyzed.
Definitely true.
The opposite, you say, is a monolithic architecture where a large amount of functionality lives
in a single service, which is tested, deployed, and scaled as a single unit.
Now, we know monoliths can be majestic.
They can also be monsters.
But you had switched to microservices for this part of
segment. And then you said in 2017, you started reaching a tipping point with this core piece of
the product, which is the one that we're talking about. And I love this statement. You seemed as
if it was fall, if you were falling from the microservices tree, hitting every branch on the
way down, which sounds painful to me. So tell us about that. Like when did, as I said
before, it seems like rainbows and unicorns, there seems like a very good fit because of the
infrastructure that y'all have, but it didn't quite work out. And so that's kind of the,
where the plot thickens. Tell us, you know, what those branches on the microservices tree felt like
and what happened there? So when I joined the team, I actually joined at the peak of when it was getting to be a
little bit unbearable.
And one of the first issues that we were running into was all these separate code bases were
becoming extremely difficult to maintain.
So we'd written some shared libraries to help us with some basic HTTP request formatting, error message parsing that all of them used.
But at some point in time, we made a major version update to that library, and we didn't have a good way to test and deploy these hundreds of services.
So we just one service or one repo we updated to use the newest version, and now everybody hundreds of services. So we just, one service or one repo we updated
to use the newest version,
and now everybody else is behind.
And that kept happening over time
with our other shared libraries as well.
So now me going into a code base,
I had to be like, okay,
which version of this shared library is it on?
What does this version do
versus some of the newer versions?
And having to keep track of that
was incredibly
time consuming and very confusing.
But, and it also caused like, we wouldn't be making, we wouldn't make big updates that
we often needed in these shared libraries because we were like, there's no way we're
going to test and deploy all these microservices that would take the entire team and usually
up to a week to just do that.
So that was one of the big issues with it.
Another was we were actually seeing some serious performance issues.
So now even though all the destinations were broken up into their own queue,
so if one went down, it didn't affect the others.
The issue was they all had radically different load patterns.
And so one of these destinations would be processing hundreds of thousands of events per second, while others would only process a handful per day. And so we tried to, we always
were trying to reduce customization across these services to make them easier to maintain. So we
applied blanket auto-scaling rules to all of them,
but that didn't help with some of the smaller guys because nothing can really handle,
there's no set of auto-scaling rules that can handle a sharp increase in load. And so for the
little guys that were handling a handful per day, and then all of a sudden a customer turns them on
and now they're handling hundreds of events per second, they can't scale up. So we're constantly getting paged to manually go in and scale up these little guys. And the
blanket auto-scaling rules also didn't work because they each had a pretty distinct load
pattern in terms of CPU and memory. So some were much more CPU intensive, while others were more
memory intensive. And so that also didn't help, which again, caused us to have to go in and
manually be scaling these services up.
So we were constantly getting paged because queues were backing up to have to scale these
guys up, which was pretty frustrating.
And like I said, we were literally losing sleep over it.
It was very frustrating.
It sounds like it.
So you mentioned that there's, you had three full-time engineers pretty much spending their
time keeping the system alive.
Is this, is this is this
what you're referring to like having to go in and scale things up and down when certain services
wouldn't keep up with the load exactly exactly so they we were we couldn't add any it was difficult
for us at any new destinations because we were spending so much time maintaining the old ones
and then we had a backlog of bugs building up on the old ones. And we just, we couldn't make any headway at all
because the performance issues
and the maintenance burden was so painful
with all these repos and services and queues.
It was getting to be too much.
So Calvin, tell us about this from your perspective,
from a CTO side, when this is going on
and you have a lot of bugs happening,
you have a lot of manual intervention
by your engineers probably not what you want them spending
their time doing
was this something that kind of like came on all at once
that was kind of a slow trickle
that eventually broke the dam what did it look like
from your angle? From my perspective
honestly I was
working on a lot of these same systems
along with Alex here so
it's definitely not something that snuck up on us
or felt like it was just a sudden deluge of paging and alerts
and problems that happened.
They sort of grew in intensity over time fairly slowly.
I think at a certain point, we started having a few large customers who would consistently be batching data in ways that was actually disrupting quality of service for other customers.
So you might imagine customer A is sending thousands of requests per second.
Customer B is sending tens and maybe customer C is sending one request per second.
If we're being rate limited
by a destination that we're sending data to,
and we're just reading off of a queue,
if we let those thousand messages in first
and just sort of do like a FIFO,
first in, first out kind of approach,
then we're effectively limiting the amount of data
we can deliver for customers B and C, even though they didn't do anything wrong.
And so we actually took a step back and said, hey, maybe we should rethink both all these
individual services, which are scaling poorly, and we should rethink our entire queuing architecture for this problem of a high failure
scenario, where approximately 10% of messages that are going out will fail for one reason or another,
whether that's an API outage, a rate limit issue, or maybe just an ephemeral network connection.
And at that point, we introduced this new set of architecture that we called Centrifuge, which was this kind of revolving set kick off to make sure that our customers are being treated fairly
would actually be much easier
if we had a single service that we're working with.
So why don't we kind of do both projects
in sort of lockstep,
where we transition these integrations to a single service,
which should help a bunch of these different problems
that Alex just talked through,
as well as help the end customer make sure that their data is getting where it needs to go quickly and reliably.
How are you managing time in this?
Because I mean, I'm thinking startup customers, you need to move efficiently.
And Alex, you mentioned that this post took you six months to write.
This is probably you've been on board for a year.
A lot of this takes a lot of time.
How do you manage, you know, maybe from a CTO level and maybe from your perspective,
Alex, as an engineer, how do you dictate architecture and initiate the team to move
forward and still please people and get things right?
Yeah, maybe I can start off first from sort of the more global perspective
and then transition to Alex for her perspective as well.
When we think about segments core value proposition,
maybe two or three things that we do with customers' data.
First is that we help them collect and organize that data.
So we want to make sure that our ingestion endpoint
is always up, that we're never dropping data,
that we're giving them libraries with a good experience to send that data into our system.
And then the second core tenant is that we are taking that data and making sure it gets to whatever tools it needs to be delivered to in a fast and timely manner, where fast would be something like under 10 seconds. And when we were looking at the current system, we'd kind of juiced it in one way or another and made all these tweaks to it, and it
was still just not working. And when we would see these long delays, we're effectively violating
that second value proposition of segment. If we're not getting
your data where it needs to go, and it's taking 20-30 minutes to get there, we're not doing our
core job for all of our customers. So for us, it actually felt fairly well aligned to kick off this
set of projects to deliver really what our customers wanted more than new features, what they wanted more than any sort of other new products we could launch.
They just wanted our core product to work amazingly well.
What about your perspective on the engineering side?
How do you be on a team where you have to implement this,
but you're making choices, you you know scaling your lab your your
library is out you've got different versions of them you've got you're scaling your repos and it
seems like things are okay at first and then things start to fall down jared mentioned
hitting the branches and maybe you can go a little further into what they look like and how that
feels it's actually a little bit interesting so when i joined the team like i said we were at
kind of a peak of something has to change.
And I was so brand new to engineering then.
So I kind of thought at that time that this was just how it was.
And I didn't totally see anything wrong with it until after we moved to the monolith and I helped the team transition to everything. And then I, looking back on it, kind of realized how crazy and how much time we
were spending scaling this and just maintaining them and how we couldn't make any headway.
But in the moment, I didn't, I don't know, I didn't see anything wrong with it just because
I was so brand new to engineering and had no experience before. I was like, oh, this is
kind of annoying, but this seems pretty normal. That's interesting. And I was coming from a
perspective, you mentioned that you had previously taught yourself or self-taught developer.
Is that what you said in the opening?
Yes.
Tell us how that feels.
I mean, so one thing that's happened, kind of the metagame on this blog post, which was
made such a splash, is you've gotten a lot of attention.
Like I said, InfoQ, us, you're on the changelog.
You're going to be speaking at conferences about this as a as a
self-taught developer i i know that that it's probably an obvious kind of yes answer so maybe
this is a dumb question but there's so much intimidation out there um i'm also self-taught
i've been i've been doing it for a very long time now so i don't feel like like i've gotten past a
lot of that stuff self-taught in software development, I did go to school for general computery stuff.
But you mentioned you were in school for math,
so related fields,
but definitely a bigger transition than I made.
How does it feel to put yourself out there
and make this post,
which is somewhat counterculture right now,
countertrend?
Very well, by the way, you mentioned it's six months.
Very well thought out, very well reasoned,
and not flamebaity or clickbaity at all in its content. So congratulations on that. But just tell us, I guess, in the metagame sense, like your feels with the post, I, we had no idea it was going to
be this crazy. We knew we were going to stir the pot a little bit. Um, but we had no idea the
impact it was actually going to have. And I'd always wanted to write a post. So I thought this
would be a cool one. I was just going to write about my experience, um, kind of as an engineer
at segment. And then it got a crazy amount of attention and I probably had I
think the worst imposter syndrome I've ever had on engineering but it's been pretty cool I when I
first started engineering too I wasn't super involved in kind of the hacker news or the
community um so I really didn't have an understanding of the impact it was gonna have
though Rick tried tried to warn me um but now
actually with this post it's kind of helped me get into the community a little bit more so i've
been doing more reading like on hacker news and listening to podcasts and it's been pretty cool
and i think i understand now why it was such a big deal to people because for me it was just oh
i'm just writing about my experience um what happened at Segment and why we did certain things.
But it's been really fun.
Definitely a little scary, lots of imposter syndrome,
but it's been really cool to see.
It makes such an impact in the tech world.
Well, as we say, the best thing to do with imposter syndrome is just to punch it in the face.
You just got to face it.
I love that one.
I like that. I'll remember that.
There you go.
So curious, I mean, besides us,
what's been the overarching response from the community?
Has it been a lot of pushback?
Has there been negativity?
People saying, y'all segment are crazy.
You don't know what you're doing.
Or has it been, wow, this is really interesting.
Maybe we'll consider it.
Because there's just a lot of tension around the whole monolith
versus microservices debate.
Totally.
And microservices have been around long enough now
that we're starting to see people who have run them
for a couple of years or a year.
And so you can't really tell if a thing is like scalable,
long-term net good, or if it's just like a series of trade-offs until you have some real world experiences.
So maybe we're starting to see that.
But curious, what has been your response?
In your eyes, what's the overall response been?
Positive, negative, meh?
Yeah.
So I actually didn't read a single Hacker News comment because... Good for you.
That's how you do it.
Exactly.
Well, I also thought when I released it, there was only going to be 50.
And then I checked later in the day and there was over 700.
Oh, wow.
That's awesome.
700 comments.
Yeah, it was crazy.
But the general feedback I feel after talking to other media and um kind of
some of my friends that are in the industry that saw the post is that actually been relatively
positive and people were just super curious as to like why we did that why we made this change
because kind of again as you mentioned like the microservice i guess boom happened a few years ago
so some people are starting to kind of realize that this may have not been the right setup for them.
So it's been, I would say, pretty positive.
And more people are just really curious about why we did it and want to know more.
And it's been really cool to see people wanting to be more educated and understand the details about why we did it.
But I don't know what happened on Hacker News. I heard it was relatively good, but I didn't read a single one.
That's a good, that's a good tactic for Hacker News is have a friend read it for you and then
just kind of summarize. Yeah, pretty good. Exactly. That's basically what I did. Some
people sent me like screenshots of really nice comments. i heard it got there was some negative feedback
but i i've heard that's also pretty typical with hacker news so i wasn't yeah i wasn't too worried
about it and that didn't seem like the overall feedback so the the changelogs experience like
our show's experience with hacker news over the years until recently has been whenever somebody
posts us and we happen to make it on the homepage, one of our episodes, undoubtedly, like without a miss, somebody would say, this is lame.
Where's the transcript?
I just want to read it like every single time.
Isn't that right?
I'm like, somebody would say that.
I'm just like, can you give us a break?
We're just doing a podcast.
And now we have transcripts.
And so they can't say that anymore.
They don't say that anymore.
We do it for accessibility, but we also kind of do it
to shut up hacker news people.
So it makes the transcripts worth it.
Here's what I'm kind of curious of,
especially Jared,
you mentioned earlier our conversation,
which was yesterday.
And these episodes will come out
in different timeframes.
So the Istio episode
should already be out.
If you're listening to this,
that episode should be out
because this one's coming after that.
But in light of that,
I mean, like clearly
there's something happening
in like the service mesh and microservices area.
So this is definitely subjective in terms of like your engineering and your culture.
So it works in places.
And I'm curious because in the pre-call, Alex, you mentioned, and Calvin, I don't think you were on the call yet when we were kind of having some pre-call conversation.
Just, you know, that you're a co-located team.
You have two engineering offices, and
maybe it makes sense where teams are completely, like I can't see Jared, we're on the same
team, but he and I have no conversation with one another.
Maybe that makes sense where microservices make sense and they don't make sense here
because you have co-located offices and your teams can maybe interact more fluidly and
that kind of thing. Where do you
think that breakdown really happened with this? Was it purely technical or is it because of the
way your product and teams operate? At the time, we actually didn't have the Vancouver office.
Okay. So one office.
All engineers, exactly. All the engineers were in San Francisco. And I think it was,
I mean, a mix of both, not a great answer, but part of it was burden on the team.
And our productivity was down.
And another part of it was these performance issues that we wanted to get rid of for customers.
So a combination.
But Calvin, you can probably add a bit more color since you kind of were here for more of the microservice setup than I was.
I'd say for us that it's definitely a combination. And I should also be clear,
by no means are we abandoning microservices across segment. There are actually a lot of
good reasons to use them across many pieces of our infrastructure. Within this one particular case,
we found we had better luck moving over to a single service. So I'd say we're continuing to
make this same trade off instead of balances. When you talk about service mesh, I think that
is definitely something that we are following fairly closely and are super interested in.
And actually, Alex has started a project working to incorporate Envoy as part of this new future of service mesh within Segment,
which we're monitoring going forward with.
I think in our case, it was probably a bit of a combination of both.
We had this team of engineers who were trying to wrangle
100- plus code bases
across 100 plus services. And when all of them do a similar thing, that's really just hard to
manage and you have to build a lot of tooling around it. And we figured, well, we'd rather
take the relatively slow rate of changes being made to a single place versus having to manage this many code
bases and this many services.
I think the one other thing that changed here as well, originally we had anticipated that
third parties would be adding a lot of their own code into these integrations.
So you might imagine we support Amplitude and Mixpanel as places that we send data.
We were kind of expecting that we would have engineers from those teams actually making
pull requests, contributing whenever they pushed updates to their APIs.
And in practice, that didn't really turn out to be true.
It ended up being a team here who was working on it.
So we said, well, we thought we'd get these supposed benefits.
We're not seeing those.
Let's move over.
This episode is brought to you by our friends at GoCD.
GoCD is an open source continuous delivery server built by ThoughtWorks.
Check them out at GoCD.org or on GitHub at GitHub.com slash GoCD.
GoCD provides continuous delivery out of the box with its built-in pipelines, advanced traceability, and value stream visualization.
With GoCD, you can easily model, orchestrate, and visualize complex workflows from
end to end with no problem. They support Kubernetes and modern infrastructure with
elastic on-demand agents and cloud deployments. To learn more about GoCD, visit gocd.org slash
changelog. It's free to use, and they have professional support and enterprise add-ons
available from ThoughtWorks. Once again, go cd.org slash changelog.
Calvin, you had mentioned centrifuge as a core piece of engineering infrastructure that you did as part of this transition. Can both of you help us understand from the point that you decided, okay, and very well noted that this is not all of a segment that has moved, right?
This is a specific section of segment that Alex's team has moved from microservice back to a single service.
Take us step by step through that.
Once the decision was made, okay, we're going to do this.
I know centrifuge is involved somehow, but please help us all understand very clearly
step by step what it took to get to where you are today and to where you could write
your post saying goodbye to microservices. Back in April of 2017,
we kept hitting these delays with various parts of the pipeline where customers would see their
data being delayed for 20, 30 minutes, while either our current queuing setup would block
up with a single customer's data or particular destinations wouldn't scale appropriately,
as Alex was just talking about.
And at that point, we said, okay,
we need a bigger overhaul
to the way that we actually deliver data outbound,
which should rethink a bunch of the primitives
that we built these individual queues per destination
over the past
two years and should hopefully help us scale for the upcoming three to five years as we 10 or 100x
our volume. And once we kind of acknowledged this was a problem, Rick Branson, who Alex has talked
about a bunch, spearheaded the effort to actually architect the system
that he called Centrifuge.
And Centrifuge effectively replaces
the single queues that we have.
So one queue for Google Analytics
with one queue for Mixpanel,
one queue for Intercom,
with what you can think of as almost being virtualized queues
or individual sets of queues on a per
customer per destination basis. So we might have one queue for Google Analytics, which has all of
Instacart's data, but another one with all of New Relic's data, and maybe another one with
Fender's data. And this system, honestly, we hadn't seen any really good prior art for.
I think network flows are about the closest
that you get to it.
But those give you back pressure
in terms of being able to say,
hey, there's too much data here,
like stop sending from the very TCP source that you have,
which is something that we can't exactly
enforce on our customers.
So with this design in hand for Centrifuge, we started out on what actually turned into
about a nine-month journey where we decided to roll Centrifuge out in production.
And Centrifuge was responsible for all of the message delivery, the retries, and archiving of any data which wasn't delivered.
And then separately, Centrifuge would make requests into this new integrations monoservice, which you could think of as being this intelligent proxy, which would take these raw data in, and depending on where it's going, make multiple requests to a third party endpoint.
And for the rollout process there, like I said, we spent maybe a month or so designing it.
Then we began to actually consolidate the repo and move it into be a single mono service.
We started building out the bones of centrifuge for another three or four months or so.
And we started cutting over our first traffic after about a five month period.
Now, when we started cutting over traffic, we had this interesting problem, right?
Where if we're sending traffic via two pipelines, we have to test it end-to-end in whatever destination tool,
if we both just mirror traffic and let them both go,
we'll end up with double counts in Google Analytics or double counts in Mixpanel.
So we actually added a kind of serialization point in the middle
that both the set of microservices would talk to as well as the monolith.
And effectively, it would do kind of a first right wins type of scenario where it creates
some locks and redis.
And then only one of the messages would succeed through either pipeline.
And we basically slowly ramp traffic in that that manner always checking the end-to-end
metrics on it always making sure that no matter which pipeline we were using the delivery rates
looked uh perfectly good uh and i'm sure actually alex can talk to more of that rollout period
because it was definitely a little bit rocky in terms of how we rolled out the system.
But about two, three months after that,
we'd fully tested all the scaling,
cut over 100%,
and we're feeling much better
about the system stability.
And looking at it today,
it's actually a very rock solid
and well-util utilized piece of infrastructure.
Alex, anything to add there?
As you mentioned, the process to get to 100% was, I think, a little bit longer than we
anticipated.
I remember we'd be in planning meetings at the beginning of the week in Cabell and be
like, okay, what do we need to cut traffic over 100% by like in two weeks?
And we'd always be like, oh, we just need to fix this one performance issue and
then we're good to start cutting over and then we'd try and cut some over and quickly realize
that there was a lot more performance stuff we needed to tackle but now that it's all done
as Calvin mentioned it's a rock solid system and it's really cool and complex um so it was
definitely worth a little bit of that migration pain, but now the system is very stable and can scale much greater. And we've been able to build cool products on top of it, which we couldn't have done before with our microservice architecture, which has also been really exciting to see. pain point with Segment, or one of the biggest, is that they don't get a lot of visibility into
what happens when they send data to Segment and then when we send it on to a destination.
So a product that I built with one of my other teammates at the time was we built something on
top of Centrifuge to basically collect the responses and counts of metrics, whether an
event they sent was successful to Google
or got rejected and why, and then display that in the UI to users. But with the microservice queue
setup, there wouldn't have been a good way for us to pass that information back and somehow store it
so that we could show that info to users. But with Centrifuge, since Centrifuge kind of is keeping
track of all of this, it knows everything already. And we just kind of info to users. But with Centrifuge, since Centrifuge kind of is keeping track of all of this,
it knows everything already.
And we just kind of had to flush that data out to a queue
and then store it from there.
And now we have it in the UI.
And I think we've had radically positive feedback
on that feature that now customers can see,
okay, I sent an event to Segment.
I see it in Segment.
And now I see Segment sent it to Google Analytics
and it was either successful or sent it to Google Analytics,
and it was either successful or it failed for whatever reason, which they'd never had that insight before. They can only see that the event made it to Segment, and then they'd have to go
check Google, see, okay, when their event's not there, they have no idea what happened.
So that was a cool product that Centrifuge allowed us to build.
This is one of my favorite products that we've launched all year and maybe ever.
It just provides essentially a status page
for these hundreds of different downstream tools
in a way that none of them
or many of them do not do natively,
where you can see exactly what is happening
with your data
and whether it's being rejected
or accepted by each API
and how long it takes to get there.
Just unparalleled level of insight.
That was a great post on your blog written by you, Calvin, all about Centrifuge for people
who are interested.
Centrifuge, a reliable system for delivering billions of events per day.
Is this laying out the infrastructure and architecture?
Is this an announcement of some sort of open source project?
What's the status of Centrifuge? Is it public use? Is it private to Segment?
Yeah, currently it is private to Segment, though this post goes into a lot of depth about the
architecture, the choices that we've made, and how it's been to operate it in production.
At some point, I would actually love to open source Centrifuge or at least the bones of it itself, because it seems really useful for anyone who's running a large online web service, particularly if they need to make web hooks out from some data that's inside that service, or they need to send a bunch of data out to many different endpoints, which might be flaky, might fail at any given time. That would be very cool to see that open source.
So this sounds like the project that took nine months, but it sounds like you thought
it was 90% done maybe a few months in and it just stayed at 90% as engineering problems
tend to do.
What you say, Alex, that weekly meeting when you're like, yeah, it's pretty much finished.
Just a couple more weeks, you know?
I'd say so.
So I transitioned onto um centrifuge team
as they had already kind of had an initial prototype i was still helping maintain some
of our microservices but when i joined i felt like every week we're like all right we're this
close we're a month away and it dragged on for a few months but as you mentioned that's pretty
natural like for every big migration and engineering undertaking. So what does this imply or inform with other parts of segment, Calvin?
Is this switch back to a single service, something that's very specific to this part of segment?
Is this something you're now considering for other parts of your product or engineering teams that are still working in the microservices world?
Or is this a one-off that fit this use case, but probably not your other ones?
There are a few other places where we're considering consolidating services.
And I think there's a couple of reasons for that.
One is within the pipeline, there's sort of this natural entropy over time where systems will split up and break apart as people add and tack on new features to them.
In terms of the pipeline itself, I think we want to make sure that we're making sure that it's easy to reason about, it's easy to find what you're looking for. And you can kind of go to a couple
of key places that need to be independent services and understand everything that it's doing.
I think the second piece that we're interested in consolidating around is actually cost.
Obviously, every time you copy data over the network or republish to Kafka or have a system
which is deserializing JSON and then reserializing back up, it's much more expensive.
So in order to keep costs low for all of our customers,
we're interested in
consolidating some there as well.
I'm kind of curious if you can inform
other CTOs that listen to the
show or engineering teams
or engineering managers
on maybe the process, because you mentioned
Centrifuge isn't
a public service yet
or it's not open source or whatever your plans are with it.
And Alex, you mentioned it took you six months to write this post.
Like, I'm curious from a content perspective, like your motivations for these two posts in particular.
Like, was it customer acquisition?
Was it, you know, was it just telling the world how you do things?
Was it idea sharing?
Was it to attract the right kind of talent? Like what are the motivations for being so thorough and so well done with your, with your
engineering blog? I'd say the blog is actually a deeply cultural, uh, part of segment that kind of
goes back to our founding days. Um, initially the four of us, uh, just all engineers. We had no users and we said
like, oh, how do we get
developers to try out
our tool? How do we get people interested in this?
How do we actually just
start getting our name out there?
And I think
the blog was the first thing that
we turned to as something
where we figured
out how to write interesting content that was effectively stuff that we were already doing that we just wanted to share with the world.
So actually, if you go back through even to some of the very early blog posts, they're constantly documenting either things that we learned or new ideas that we've had or
sets of best practices that we've learned from what we've built.
And as that's grown over time, we've really seen it be impactful on a number of dimensions.
One of those is around customers and brand.
Obviously, Segment is kind of a developer tool. I think in order to have
engineers and developers trust you, I know at least I'm reading other companies' blogs all the
time, following folks on Twitter to understand what's coming next in terms of tech. And
sharing what we're doing internally already out there, I think helps build a lot of
that trust, particularly when it comes to onboarding and setting up segment for the first time.
Sort of something you're a little bit already familiar with. Then there's definite benefits,
as you mentioned, on hiring as well. A number of teammates who end up joining the team all say, hey, I first saw you through the blog.
That was the place where I first found out about segments. And then I was able to dig in more and
understand what was really going on at the company. And it gave me more of a window than I would have
had really any other way. So I think there's that too, in addition to just being an amazing way to share when we learn things, either about new parts of architecture or about switching between monoliths and microservices, as Alex has talked about as well. blogger on the segment blog uh home run out the box you know what's your experience you know with
other team members even like what's your experience with you know getting a chance to share some deep
interest and obviously quite quite a bit of passion six months to write it and you're on the team
you're obviously doing great work you're you're passionate about what you're doing like what is
it for you to share this through the blog it was a really cool experience i know a lot of people
at segment were kind of curious why we'd moved to Centrifuge and spent, invested so much time in this. So I had some engineers that had joined post Centrifuge ask me about, and this is the first post I've ever written about anything.
So it was really cool to get to just share my experience
and have it kind of take off,
and knowing that a lot of people have read the post,
and I think a lot of people have actually found value in it,
which has been the coolest part,
that we've had so many people reach out,
interested and curious to learn more, you in it, which has been the coolest part that we've had so many people reach out, um, interested
and curious to learn more has been really exciting and eyeopening and to just inspire. I know a lot
of women came up to me after and were really inspired by the fact that I had such a post that
went so crazy on the internet. Cause you don't see a lot of posts from women in engineering
cause there aren't many, but that part was also really, made me really happy.
That's interesting.
So one part to inform your counterparts inside of Segment, because you got 65 engineers.
Some of them may be in the know, some may not.
This is a way to inform internally.
One way to also inspire.
And then another way to potentially hire.
We often interface a lot with different engineering teams through just what we do.
And I'm always curious why some of them don't put enough intention into their engineering blog.
So since you do such a great job, I wanted to make sure I asked you that question before we close out because you do a phenomenal job at it.
One, with the writing and then two, just with the design of it.
It's easy to read.
It's easy to browse.
And if you're listening to this, you get my stamp of approval to say, this is a blog you should look at to mirror or to mimic
when trying to do it for your organization.
Thank you.
Did you do that rhyme on purpose to inform, inspire, and to hire?
Was that all?
Did you plan that out, Adam?
I did.
Sorry.
Yeah.
It's a nice touch.
It's a nice turn of phrase.
And with that, you know, that's the show.
My rhymes and the show.
And I love that.
So there you go.
But any closing thoughts from either of you?
Any closing advice for, you know, those looking to your posting, you know, this is the Bible of information, whether we should go there and back again and then there again.
Any closing advice for those listening to the show or anything to share as we close out?
I think my one piece would be that it's really all about finding the right fit for your infrastructure and your team. A lot of people
have reached out and been kind of nervous that they're going to make mistakes with their
microservice setup. And we're curious to get my opinion on what I thought. And I think, again,
it's all about what is the best for your team at the time. And we're a perfect example of that.
You know, as Calvin mentioned, we started in a monolith because if we'd started with microservices,
there would have been no way for us to get off the ground. And then we moved to
microservices and that was the perfect solution for the time. But then after a couple of years,
it turned out not to be anymore. So I would say don't be afraid to make changes. And
it's about finding the right solution for your team and your infrastructure.
I would echo that 100%. Definitely just don't outsource your thinking.
It's just important to talk about trade-offs on both sides, really, I'd say for any engineering
decision you make, because if you don't explicitly acknowledge them, chances are there's some
con or something that you're giving up that you might not notice later.
Well, Alex and Calvin, thank you so much for taking the time
to walk us through some of the pros,
cons, ins and outs of your journey.
We appreciate your time.
Yeah, thank you.
Of course, thank you.
All right, thank you for tuning into the show today.
Love that you listened to this show.
Do us a favor, if you enjoy the show,
tweet about it, blog about it,
go into your favorite podcast app and favorite it, share it with a friend, tell somebody,
you know how much you love this show and we'll keep doing the same. We'll keep producing awesome
shows for you. I want to thank our sponsors, Indeed, Linode and GoCD. Also thanks to Fastly,
our bandwidth partner. Head to fastly.com to learn more. And we catch our errors before our
users do here at Changelog because of Rollbar.
Check them out at rollbar.com slash changelog.
We're hosted on Leno cloud servers.
Check them out at leno.com slash changelog.
And the host for this show was myself, Adam Stachowiak, and Jared Santo.
The mix and master is by Tim Smith.
Music is by Breakmaster Cylinder.
And you can find more shows just like this at
changelog.com slash podcasts. While you're there, subscribe to Master, get all of our shows in one
feed at the changelog.com slash master. Thanks for tuning in. We'll see you next week.