The Data Stack Show - 89: Solving Microservice Orchestration Issues at Netflix with Viren Baraiya of Orkes
Episode Date: June 1, 2022Highlights from this week’s conversation include:Viren’s background and career journey (2:23)Engineering challenges in Netflix transitions (6:05)How Conductor changed the process (9:30)Building a ...lot more microservices (16:04)Open sourcing Conductor (17:38)Defining “orchestration” (22:05)Using an orchestrator written in Java (31:04)Building a cloud service around microservices (34:59)Differentiating product experiences (37:17)Orchestration platforms in new environments (42:15)Advice for those early on in their career (46:10)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Stack Show. We are talking with a fascinating guest today, another guest from Netflix, actually. We talked to someone
from Netflix actually early in the life of the show and had a great conversation. And
today we're going to talk with Viren. He actually created a technology in Netflix, open sourced
it there, and then came back to commercialize it later in his career, which is a fascinating journey.
And it's in the orchestration space, which is super interesting.
And we haven't talked a ton about on the show, Costas.
I know you have technical questions.
My question is going to be, you know, orchestration tooling is not necessarily something that's new.
So I want to know what specific conditions at Netflix were sort of the catalyst for actually building a brand new orchestration tool?
Because that's going to be really interesting to hear, especially from the Netflix perspective.
What problems are they facing? Where were they at as a company, et cetera?
So, yeah, that's what I'm going to ask about. How about you?
I think it's a great opportunity to get a little bit deeper into the definition of what orchestration is,
because orchestration means many different things for different people in software engineering.
And I think this is something that's going to be very useful for our audience that's primarily data engineers to hear about. So hopefully we're going to spend a good amount of time
talking about the different flavors of orchestration out there
and when and how we use them.
Absolutely. Well, let's dig in and talk with Viren.
Let's do it.
Viren, welcome to the Data Sack Show.
We're so excited to chat today.
Thank you. Thank you for having me here, Nadi.
All right. So we always start by excited to chat today. Thank you. Thank you for having me here, Anirik.
All right. So we always start by just getting a brief background on you.
So could you tell us, you know, where did you start your career?
How did you get into sort of data and engineering?
And then what led you to starting WorkIt?
Sure. Yeah. So kind of, I'll keep it short, but like essentially, you know,
I spent kind of in my early days of my career, you know, a decent working for firms in Wall Street lastly at Goldman Sachs and you know one thing that was kind of the case with typically all the you know
Wall Street firms is that data is their kind of secret sauce right especially in today's world
at some point of time I hadn't kind of each to kind of go a little bit more technical so you
know went on to kind of work at Netflix, which
was the early days of Netflix
in terms of their, you know, pivot from
being a pure streaming player and number
one at that point of time to becoming
a studio. And, you know, I'm going to work
with some really brilliant engineers there and
thought like, you know, there might be
an opportunity to kind of scale myself out
further, you know, spend some time at Google
afterwards, dealing with a couple of developer products, Firebase and Google Play, to be more precise myself out further, you know, spend some time at Google afterwards, dealing with a couple of developer products,
Firebase and Google Play, to be more precise.
And then, you know, one thing that I had done
while I was at Netflix was kind of,
had built out this organization platform called Conductor
and open-sourced it.
And we had seen a great momentum
in the open-source community
and even from the timing perspective
and felt it was the right time.
So, you know, I decided to kind of, you know, take a plunge, you know, and start building out a cloud hosted version of Conductor.
And started Orcus with a bunch of my colleagues from Netflix.
And yeah, here we are.
It's been almost three, four months old journey now.
Awesome.
Well, congratulations.
You're sort of just starting out, but that's super exciting.
Okay, I have to ask this question.
What was it like going from Wall Street to Netflix?
I mean, was there like both, you know, just from a data standpoint,
but also a cultural standpoint, it seems like that would be a huge shift.
Yeah, absolutely.
Like if you think about engineering practices, for example, in Wall Street, right?
And women's sites, to be honest, right, like prides itself on being very forward thinking,
very tech oriented firm in the Wall Street.
And they rely a lot more on open source compared to anybody else.
So in some ways, like, you know, engineering wise kind of the tech stack and everything was similar.
But how you think about kind of building things is very different. When you think about
companies like Netflix for example or any
tech companies,
the pace at which the innovation
happens is very different. It's very rapid
because here it's all about you have to be
always innovating for the future,
not for the current problem. So that was
one thing. And secondly, in terms of just
the cultural aspect of it, if you think
about it, tech companies tend to be a lot more open to new ideas taking
bold risk when it comes to technical investment and they and then you you essentially hire the
best engineers and let them do their best as opposed to kind of manage them from top down
so i think in terms of being able to do things there's a lot more freedom i would say and also
kind of the problem side,
like you are no longer in the second or third line
when it comes to working with the customers.
And in Wall Street,
you never work with the customers directly.
You directly work with the customers at times,
depending upon the team.
And you can see, and more importantly,
when your family or friends ask, what do you do?
You can tell them, I work for Netflix.
And if you do this, this is what I did.
Yeah.
Yeah, that's a lot easier easier like dinner parties and cocktail party absolutely absolutely yes yes uh very cool thanks
sir thanks for sharing some of that insight okay so let's let's go back to netflix so
what a fascinating time to be there when netflix is going you know sort of from a
content aggregator and distributor you know, sort of from a content aggregator
and distributor, you know, say, to being like a producer. Those are very different kinds of
companies. And you said Conductor was kind of born out of that transition. What were the
challenges that you faced as an engineering team that, you know, sort of were represented in that
transition? So I think basically, you know, I joined a team which,
and our mission was to kind of build out the platform team
to support the entire content engineering
and studio engineering organization at Netflix.
And Netflix, as you know, right,
like has historically invested very heavily into microservices,
almost they championed microservices, right?
And one side effect of microservices, as you could see,
is that like, you know,
you end up with so many,
this little kind of services
with a single responsibility principle, right?
But now, you know,
as your business processes get more complex
and this started to become especially true
in the studio world,
where, you know,
you are not only dealing with the data that you own
and the teams that you work with,
but also external partners,
external teams,
really long running process. Like to give you an example, right? Like before us, you know,
it could take months before a show is completed, right? Like in terms of its entire production
process. And, you know, you are managing this long running workflows all over the place.
And this was one of the need that we wanted to have and you know as as they say right like
you know sometimes the things that you know it is not because you know you thought of a cool idea
but rather there was a problem that was hitting you directly right and i was responsible for i
mean traditional way of solving this problem will be like you know you just end up building an
enterprise service bus or a pub sub system like sqs right and build everything on top of it and
that's exactly what we were doing and what we realized was that it worked well when your
processes were very very well defined and were kind of simple enough now there were two things
that were happening right one is that the number of processes were exploding the second thing was
that netflix is not a traditional holly company, right? It's a tech company.
And they think about problems in a very different
way. You also want to experiment
with processes and see what works, what doesn't work.
Which means you want to be able to rapidly change
things and test it out, saying whether this works
or not, right? And so that agility
was another situation, right? Like one
thing that we absolutely did not like
at Netflix was building monoliths.
But what we realized was that we Netflix was building monoliths. But what we realized was that
we were building distributed monoliths
because now the code was
there all over the place. And one
change meant I would go and talk to
100 engineers, beg them
to prioritize the change. And if a
product manager wanted to change something, you know, you
would go and talk to 100 engineers to figure out how the
process works. And this is where
we thought like, you know, this is not going to scale.
And we had to build something.
So that's where, basically, you know, we started thinking about Conductor.
We started with a very simple use case and it evolved very organically over the period
of time.
Can you give us an example of just one of those simple use cases and how to sort of
solve that across, like, some specific microservices?
Absolutely.
So, like, very first use case, right? That was very simple.
Like it was basically, you know, you have a bunch of that, you know, you have received
from marketing agency or, you know, you have used ML to produce from the video files.
We wanted to encode them in different formats, one for browser, for iOS, for Android, for
TV, and then deploy to a CDN.
Very simple, right?
You take an image, encode, deploy,
and then you test and see, you know, what works,
what doesn't work, and that's the format.
Is the PNG works better on iOS versus Android?
If not, they do the same thing, the entire process.
And it looked very straightforward and simple application
that we thought would be a very good test.
And that was the very first use case
that we actually built Conductor for.
And so how did Conductor change?
What was the process before?
And then how did Conductor change it?
So if you think about the original process
was this site like that,
there'll be an application
that is responsible for publishing images.
So, you know, now the person
who is building the application
is not necessarily the audio engineer
or video engineer, right?
Or the image or the, you know,
the engineer working with the images.
So now it's a different team.
They have a microservice.
So you call a microservice to say,
you know, give me the encodes in PNG format.
Then you wait.
You know, we relied very heavily on Spot.
Resources on AWS and Netflix
has done some fantastic work there.
So it could take some time.
You wait for it.
You would complete.
Then you go and deploy.
What if your deployment fails? You retry. And then this thing works. But then your product manager comes and says like, hey, but you know, what we realize is that this particular format does not work very well because latency issues, maybe because, you know, quality issues, or you go and ask that engineer who works on encoding team to say,
hey, what's the API endpoint where I can use it to encode in different format?
And you need to do this, right?
Like it's a very intensive process.
And sometimes we were changing this multiple times during a week, right,
to see what works, what doesn't work,
getting feedback from the users from A-B test and whatever not.
Or now I want to deploy 20 images instead of 10 images
because I have more A-B tests to run. So those were kind of starting to become a little bit
unmanageable. And this is a very simple example. But if you put something in between
to say, OK, depending upon the country now also, where we
are going to launch the show, I want to have a different format because some countries
want to probably need a lower bandwidth image.
It starts to get very complicated.
And as you could imagine, right?
So interesting.
So in the new world, Conductor sits on top
and sort of interacts with all of the various microservices
to streamline that process.
Absolutely.
So like in the new world,
essentially what happens is that
instead of writing all the code,
what you're saying is that I have a microservice that can do image encoding, and I have got 10 different one of them that each one is responsible for a different kind of format.
And then as an engineer or a product manager, you basically work with your product manager and say, okay, what's the flow that we want to see, right?
And it's like, okay, if the country code is this, these are the sort of images that we want to produce.
This is the CDN location that we want to produce.
You actually build out a diagram, a graph of what the whole thing
is going to look like. And when a new show
is ready to be published,
you call it. It does everything. You want
to change something, you go back and update the workflow
because the microservices are there. They are not
changing as much. It's just the flow that you
are tweaking and fine-tuning
and optimizing for.
And now, as an engineer also,
at some point in time, you can give it the whole thing
to the product manager
and say,
like,
you know,
why don't you just try it out
if you want,
separately,
and if you find something
that is missing,
I can build a microservice
and then you can plug it in,
right?
And it becomes a lot more
tight coupling
or rather tight conversation
between the engineer
and the PMs.
The expectation is not
that the PM is going to go
and manage these things.
Engineers are still responsible
for building these things.
But, you know, your work gets simplified, right?
And now you don't worry about, oh, I have
to put a retry logic. That's taken care by
a conductor. A conductor will take care of retries.
You just write, saying, you basically write
or let me put it this way.
You write for best case scenarios
and the conductor takes care of all the
edge cases, the failures,
the retries and everything else.
Fascinating.
Okay, I have one more question
because I know that Kostas says
his mind is probably exploding with questions.
And this may sound funny,
but was it an immediate success in terms of adoption
or did you have to,
because sometimes with those things,
it's like changing,
even if technology makes things better, like adoption can be difficult or like, oh, we don't necessarily want to change the way that we do this.
Or like, it's actually work to migrate our whole thing.
Like, how did that happen culturally inside of Netflix?
Actually, that's a very interesting question, because Netflix famously has the culture of freedom and responsibility, which means, you know, you don't have mandates to say, hey, you use this framework.
Right. You know what frameworks are there and you choose
what you want to use. Sure, yeah, like self-serve
with all these options. Exactly, yeah.
And nobody's going to tell you why did you
choose this, right? It's up to you
to decide to get the job done.
So that becomes a very challenging thing,
that you can't just build something and get
a VP or director to go and send out
an email to everyone saying, we didn't build this fantastic
new framework, everybody must use it.
Not to happen.
So now we need us to go and talk to everyone,
right? So one approach that we took right
from the beginning was that we, ourselves
were developers, right? So we understand the
pain points of developers having.
So we built it very much
like a democratized version, right?
We did not have a product manager.
We made a very conscious decision that we don't want to have a PM kind of shepherding the product, but rather let's talk to engineers.
What do they want?
So like every feature that we built was out of a necessity and a recommendation or a need from another engineer.
And that was one thing.
Second thing was we kept it very agile in terms of its development rather than trying to think about we have to build a very perfect system from the get-go,
we made it functional and we built the resiliency and everything kind of along the way
as we were testing with the internal users.
That was another way we kind of tried to evangelize within Netflix itself.
And of course, like, you know, there were always kind of skeptics or people who wanted
a different way right and we try to keep it like as open as possible that was one thing so
the side effect of that is like if you look at the current repository also it's very flexible system
it's pretty much plug and play you can plug and play like it's a lego block right in some ways
and that was one of the reasons why it turned out to be like that because you know we
wanted to be able to satisfy as much as possible like in some ways you can think about right that
it increases the complexity and effort but the advantage there was that you know everybody felt
that they had a stake in the game right most of the thing was to get them yeah invested in the
product and then you have someone who is happy.
Super interesting.
Okay, I lied.
I have just one more question.
I promise, Cassis.
Because it's always fun to hear about how these projects sort of form inside of a company
like Netflix with such an interesting culture.
I know that microservices were a catalyst for building Conductor to sort of make them easier to interact
with. But it wouldn't surprise me if actually microservices proliferated at a higher rate
after people started using Conductor because it was easier to manage a lot more. Did you see
people building a lot more microservices? Absolutely. Because see, what happens is that if you don't
have something like Conductor, then
you tend to take shortest
path, especially when you are under time pressure
or you have to deliver things.
If you were to build something of a complicated
business flow,
building five microservices, writing
orchestration logic on top of it, and
making all of these things work is more time effort
versus putting everything into a monolithic
block and get it out.
So in some ways, Conductor
kind of encouraged people that like, you know, break it
down because now you have another
side effect of it, right? The moment you break it down,
you have a lot more composability
to be able to change flows and everything.
So that was one thing that really
kind of inspired people to do that. The other thing
was that we built two critical aspects that everyone wants, right?
Traceability and controllability, right?
Like you can actually trace the entire execution graph visually and see the graph.
That just turned out to be a sleeper hit for us.
Like I had never thought that this is going to be the killer feature.
We thought the killer feature is going to be distributed
steadily, but no. Killer feature was that UI
that people loved it because I could see
exactly where things was wrong.
Because that's the problem you face. You want to go and look
at the logs everywhere and see what's going on.
I have a UI and just click on it and say,
this is what's wrong. Go fix the code,
retry and yeah, works.
So some of it was like that also.
That just encouraged. Now if you use connector, you get
these features that you otherwise wouldn't get it right and that encouraged people to write
for microservices so brian i have a question about the open source uh side of the project
so you how soon after you developed, you open sourced the project.
I think it was about six to eight months journey.
Like, we took it to the place where it had enough features that, you know, it did not look like a toy project.
We also had to decouple everything from internal non-OS side of the world.
And we wanted to put together some amount of governance process also, right?
Like my team,
we did not have any open source product
that we were managing ourselves.
So we had to kind of figure out
learning from other teams in Netflix, right?
So yeah, overall,
it took us about three quarters.
And the day we decided that we were open source,
it took about a couple of months
to get everything ready, right?
Prepared legal reviews, patent reviews, and all of these things.
Yeah, makes sense.
And how long it took after you open sourced the project to start getting engagements
and creating a community of adopters of Conductor out there?
I would say, you know, what I've seen is typically, you know, you have this initial bump, right?
Like open source people are excited.
They want to try it out. So there was
this initial bump. Then it starts
to taper off because, you know, there's
nothing new there. And it kind
of stayed there until very
recently, I would say. So what was
happening was that, so and
what we had done was that like we were doing
meetups at Netflix about Conductor
and also we're talking about Conductor and other
meetups and everything. So, you know,
as we kind of talked to people, it started to kind of
grow the momentum. The other thing was
like, you know, if the community
is always the consumer of the product,
right, it does not grow the community well, right?
Like, we kind of also
made it much easier for people to contribute
back to Conductor and
once people started contributing
back, right, it started to grow
further. Because now, again,
they have a stake in the game.
They have the ownership in the product itself
because they have contributed.
Of course. Makes a lot of sense.
And do you have any
use cases
that came out of the open source
community that surprised you?
Absolutely. So one use case that I, if you ask me today without learning about this, that came out of the open source community that surprised you? Yes, absolutely.
So one use case that I,
if you ask me today, right,
without learning about this,
I would not have ever thought about it in my wildest day, right,
was security.
Oh, okay.
People using Conductor
to orchestrate the security flows,
things like thread detection,
things like, for example,
take, for example,
let's say you upload a file in a S3 bucket.
Typically, you want to run some processes and checks
to ensure that you wouldn't upload a secret
by mistake or on purpose or whatever, right?
Or there is not a virus.
And you are going to run a bunch of workflows around it, right?
To some automated, some manual.
And this is all done by folks who are into the security space,
not necessarily writing microservices,
but content that turned out to be a very good
use case for them. So this is one
thing that surprised me, that
there's a strong use case here that I had not
thought about it. Yeah, that's
so fascinating. I would never
think about it.
Exactly.
But then more I think about it,
these are long-running flows, right?
It might take some time to scan object.
If people are putting thousands of objects in S3 bucket, for example, right,
it may not get a real-time treatment.
So, you know, you have to have a backlog.
And if you find something, maybe somebody has to do manual intervention to verify, right?
Then you have a human process involved in it.
You send an alert or something, wait for someone to reply.
So all this flows becomes, it becomes a pretty good use case.
That's what I realized.
Yeah, makes sense.
Okay, I'll probably get back to more open source
and company also related questions a little bit later.
But I'd like to discuss with you
about what orchestration means.
It's a term that it being used a lot in software engineering
and not necessarily in every discipline the same way.
We have orchestration that data engineers are talking about.
We have orchestration that has to do with microservices.
We have workflow orchestration.
Then we have orchestration on Kubernetes,
and probably like many other types of orchestration out there.
Can you help us define, let's say, a taxonomy
around orchestration tools out there
and understand better the differences
between the different tools and the use cases?
Yeah, absolutely.
I mean, as you say, orchestration is an over-related term. It has different meanings and the use cases? Yeah, absolutely. Yeah. I mean, as you say, right, orchestration is an overrode term, right?
Like it has different meanings to the people and use cases, right?
And, you know, having spent some time on this space, right?
Like what I realized is that like, you know, essentially,
if you look at from the persona, right?
Like who is looking at the word orchestration,
there's a kind of a different meaning to it, right?
And like if you go in a top to bottom right in a company so if you look at people let's say people on the business
side right business analysts product managers uh who are dealing with the business processes
that are high level right for them essentially if you think about when they think about orchestration
they are looking at how various you know business components are getting orchestrated. So in an e-commerce company, this might be how am I orchestrating between my payment
and shipping and delivery and tracking systems or fulfillment services and things like that.
But again, this is at a business level.
And again, when you think about also measuring the output of the orchestration,
SLS, any other key metric that might be defining, right, they look at it from that perspective also, right?
Like the time it takes to complete certain activities, meantime for failures and things like that, right?
How often they fail and where are the optimizations that they can make based on those data points, right?
They are not the ones who are actually building the systems.
Then that typically goes to the backend engineer.
And when they describe the same flow to the backend engineer,
essentially for them,
you could think about this individual things as kind of either microservices or other services, which, you know, for the lack of a better word, right?
We can just call it microservices.
And for them, when they think about orchestration, right?
It's about, I have a bunch of microservices and I have to build a flow around that.
How do I build that?
But now what I am looking at as an output from this is how do I handle
certain things? Like, for example, you know, in a distributed world,
those services are going to fail. Services are going to, you know,
have a different SLAs. How do I handle failures, retries,
different SLAs across them?
I want to be able to run some things
in parallel so i can optimize my you know time it takes to kind of complete the the entire process
or no resources and everything how do i kind of achieve that and if i'm doing some things in
parallel how do i wait for them to complete because you know you can't just always do that
otherwise right so that's how a backend engineer is thinking about orchestration then if you take it for the down one level right
in terms of the platform side of it right did and you look at even zoom into the set say an
individual microservice this microservice is typically getting deployed onto a container
this day is right or a vm or even a bare metal machine somewhere but you know not deploying one thing, right? You are deploying kind of whole bunch of things. And typically
you don't only deploy a service, you deploy a service with sometimes at least in the initial
phase, right? Some more semantics around it, right? Like the networking configuration
databases and everything. And that starts to get into more of the continuous deployment
side of it, right? Which is where the container orchestration, for example,
has become very mainstream with Kubernetes and Argo is another
one, for example, there, right? Where essentially it doesn't matter
what you are doing, your piece of code that you are deploying and it's scaling out and scaling
down, right? And that's what you are focusing on. That's another level of content orchestration that
is happening. And just to go back to the backend engineer, right?
Like there are different flavors
of backend engineers also,
right?
You have backend engineers
that are working
with product managers
to build an application.
You also have data engineers
who are dealing
with the massive amount
of data
and orchestrating,
right?
This is where
things like Airflow,
we are seeing
at NetPace,
ThunderCrosser
is being used
for similar purpose
where you have data
sitting in different places
and you are essentially
orchestrating that, right?
In a batch world, right?
Like, you know,
you are processing data,
aggregating that,
putting it into database,
maybe training
some machine learning model,
making inferences,
putting it into database.
The whole thing
is basically a flow.
An offline flow
is very well orchestrated
through something like
conductor or airflow
and similar systems.
And a slight variation of this is kind of real-time data platforms where you still have flows.
If you think about, let's say, I click on a button in my phone or a website,
and you are sending out a signal, an analytics data point back to the server.
Now, this has to go through kind of a certain journey.
Like you are waiting for it to do a streaming aggregation, but once it is aggregated, it goes through
maybe a couple of other systems where either it is being used to
do either further aggregation, get it into
more of an analytics store, or maybe you are doing kind of real-time
model training through machine learning, right? So that's
kind of another flavor where there's no start or end of a workflow.
It's continuously running pipeline, but you have a complicated flow that is built out.
I think Kafka or Confluent has some tooling around that, but I think that seems to be
right now still a very wide open space.
I would say it's still an unsolved problem.
Yeah, makes sense.
So just to give like an example,
because we have like our audiences,
like primarily like people that are working with data
and they are data engineers.
What's the difference between
like a system like Airflow
and Conductor?
And why I wouldn't use,
let's say Conductor
and I would prefer like Airflow
in order to orchestrate my pipelines. Absolutely. So like if you think
about Airflow, right, from its genesis and the kind of music it solves,
it's mostly about data, right? Like typically, you know, you have
data sitting in different buckets or databases like Hive
and you are processing, right? And these are typically batch jobs that you run
on an hourly basis, maybe twice a day,
three times a day, or daily, and things like that,
and runs through the data pipeline.
The other important part of a data pipeline
also is kind of the dependency management, right?
Like, you have to run in a specific sequence
because, you know, your data at a given step
depends upon the previous step.
Also, re-reliability in the context of data is very different from a
real liability in a microservices world right when you think about re-running some data you
are essentially running data for that particular date or a time frame right and you're only
processing that data alone you're never posting the latest data so that's that's kind of high
level use cases that i've seen airflow being used for and it does well, right? Also, if you think from
the users of Airflow, right?
These are mostly people dealing with data and
the language of the choice today for that
is Python. So, you know, Airflow DAGs
are written in Python and they tend to be simple in nature,
right? Like you have sequence of things
that you do, sometimes you fork
and you are done with it, right?
These pipelines are very stable, very fixed.
You don't change them every day. You don't do A-B testing of this pipeline. It doesn't make sense to with it, right? These pipelines are very stable, very fixed. You don't change them every day.
You don't do A-B testing of this pipeline.
It doesn't make sense to do that, right?
Connector kind of goes into the next stream.
It's more about flows which could be running
for a very long period of time, say months at the end,
or a very short one where you complete the entire flow
in a few hundred milliseconds,
and everything in between, right?
But instead of running a few executions a day, you could be running few hundreds
of millions of executions per day or even a billion executions per day depending upon your
use case. So the scale side of it is
very different. At the same time, a typical workflow is operating
on not petabytes of data. A step in the
workflow is typically dealing with a finite set of data. You know, a step in the workflow is typically dealing with
a finite set of data.
And sometimes you do,
like for example,
one use case that we had on Netflix
was the processing the video files
and a raw video file
could be petabyte in size.
Yeah.
In that case, you know,
you have to be processing
for a longer period of time.
The other thing is that
connector is very general purpose
and it is meant for
pretty much
the entire spectrum
of audiences
right so
it's very much
language agnostic
you know
we had workflows
where one step
was written in C++
another in Python
and third one in Java
and so forth
so it allows you
to kind of mix and match
depending upon
the owner of
the step
in the process
so connector
becomes very useful
in this kind of scenarios where, you know,
you have a very heterogeneous environment and the scale is another thing.
That's very interesting.
And it's like something that, as you were talking about Airflow and Python,
like I wanted, like, it came up like as a question to me.
So, Conductor is written in Java, right?
Yep.
How is it is
I mean you gave an answer but
I would like to
hear a little bit more about this like how is it
like for example for a team that is primarily
using Golang for example to create
microservices
to employ like
an orchestrator that's written
like in Java because probably you're not going to have
a team there that knows java right yep how does this work and which team is usually like
responsible for like the platform team like is it who is responsible for managing deploying and
taking care of the orchestrator so let me ask you the first question right so like i think the way
connector works essentially
is it exposes its API through HTTP and GRPC,
and that's how it becomes kind of language agnostic, right?
So let's say if you are a Golang shop,
you are writing your microservices in Golang,
and you are building your orchestration flow in Conductor.
Conductor also provides client API.
So there are two parts to Conductor.
You have a client or SDK and the server side.
Server side is in Java.
SDKs are written in different languages. parts to compile your client for sdk and the server side server side is in java sdks are
written in different languages so i think there are three right now java python and golang so you
know you use that sdk in that particular language to interact with conductor and grpc is great where
you know if you want to bindings for rust for example right you can do that using grpc compiler
so that's kind of how it works today.
And that's why it is language agnostic, because the entire model is that way.
The second part, who runs the network, it's a very interesting question.
I think I have seen kind of both sides of it,
in the sense that where there's a platform team that is responsible for running conductor.
This was exactly what we were doing at NetRace,
and my team was responsible for managing as a platform for that is responsible for running conductor. This was exactly what we were doing at Netflix and my team was responsible
for managing as a platform for
all the teams. But that was a model
at Netflix. You have a platform team and this
tends to be a lot more common in tech companies
where you have a platform
team responsible for all the components and
everybody else uses that. We have
also seen the other side where you
have business teams that owns the entire stack
by themselves and then they are responsible for running conductor on their side like so i think it in some
ways goes back to the culture of the company right how they are formed and you know what's their kind
of usage model for all the maintaining the products how that works yep yep that's that's
interesting and probably not solved yet.
As you said, it also has to do a lot
with the culture.
I hear a lot about
platform teams, but
it doesn't mean that every company
has a platform team.
You can just wake up one day
and be like, let's have a platform team now.
I mean, there are two
challenges. Building a good platform team is not easy.
Hiring the platform is even more difficult.
Like, hiring engineers are difficult.
Now we're talking about hiring platform engineers.
Like, it's made it exponentially harder
to, you know, build a team, right?
Yeah, yeah.
I think, in reality, like, what really works well
is that, like, you know, if you treat your platform team
as a mini cloud team in your organization, right?
So today, for example, if I
want to use an RDS database
from AWS,
I can go to console provision one for myself
and start using it. And AWS
takes care of everything else for me, right?
Provisioning, backup, restores, and
everything. So if you end up building
a platform team that can get to that stage
where, let's say, any product
for that matter, not just conductor, that they are
able to offer in a self-service mode
and they focus on building that platform out of
it, right? But then again, cloud
companies are offering more and more of this thing, right?
So, you know, the line between the
internal platform team and a cloud
provider becomes
thinner and thinner day by day, right?
100%.
Alright, so let's go back to
building a company right so
you open source the project
you started having like some traction out there
and at some point you decide like to build
a company around
the core technology
my first question is
when we are talking about
an orchestrator
who is going to be interacting with microservices,
and as you said, there are use cases there
where you might want to run millions of interactions
and the latency should be super low, right?
How do you build a cloud service around that?
And you make sure that
the microservices themselves that
the company is building are like
sharing the same resources
let's say or the same networks or like
all the stuff that's needed there to make sure that the latency
like remains as low as possible.
Yeah, I mean, I think the key to
that is essentially your deployment model, right?
Like how are you deploying those things?
Like, you know, lower the latency you want, right?
You want to be as much co-located as possible.
So like essentially what we have done is like,
we essentially have built out like two different models
of deployments where one deployment is where you are,
you have kind of connector running in a separate VPC
and your microservices in a separate VPC
and your VPC peering that allows you
to kind of communicate with each other.
And you try to kind of keep the affinity between availability zones.
So, you know, your network does not go through very heavy kind of hops.
The second model where you want really low latency, right,
is sometimes to kind of deploy conductor inside your own network
as close to the microservices as possible reducing kind of network because now
we are talking about in a few tens of millisecond latency differences right as possible or even
embed like the beautiful thing about connector is that you know it can run in a cloud environment
handle billions of flows if not billions every day or you can also embed it inside where now you are
running with a very low memory footprint,
pretty much running in the customer's edge environment, right?
Like small deployments.
So, you know, you essentially have to kind of make those set of decisions to figure out what are your requirements and how you deploy that.
And to be honest, like this was something that we had to kind of think it through
and figure out exactly how this is going to work and come up with a solution there.
But this always is an interesting challenge to solve.
So how is the product experience different between the
two deployment models? And the reason I'm asking is because
Eric is probably aware of
that because of Rudderstack
there were multiple different
deployment models although for different
reasons there it wasn't that much like the performance
it was like in many
cases had to do with like
let's say compliance
but building like
a product that has like consistent
experience regardless
of the deployment model like like super
super hard so how do you approach this problem i would say like when i think about the end users
of the system right like i would say there are two groups right so there are the engineers who
are actually using the product to build the applications, right? For them, there should be no difference. Like, you know, they are still dealing with the same
set of APIs and same set of constructs and everything. You know,
if they were to go to UI, you have URL where you go to UI and look at your workflows
and manage everything, right? So that experience must be
consistent no matter how things are deployed. The second set where
actually it matters is the people
who are actually responsible for the operational aspect of it and this could be a platform team
devops sres and this is where i think the key difference comes right and it's very similar to
running a relational database in a in a vm that you have provisioned yourself which is running
something like rds right yes where a fully hosted service gives you an experience where essentially
you don't even need that thing.
You know,
things are taken care of for you, right?
Like it scales automatically for you.
You don't have to worry about backups,
what kind of database I should be using
if I need to get this performance
or whatever or not.
You specify how much capacity
that you want to run with
and system kind of scales for you, right?
The other option,
essentially when you are running in your environment, right? You are also making those
decisions by yourself now that, you know, how big the instance should be, where
should my backups be? And if something goes wrong, how do I restore from the backup?
And I'm also responsible for my costs now, right? Like I can't just run
a thousand node cluster without having that show up on my annual or monthly
billing from AWS or
GCP. So I think
for those people, I think the experience
becomes slightly different.
I think ultimately the goal is still
to be kind of make it easy
in terms of the UI
interactions like the console side of
the world, right, where if you are
dealing with the conductor console in
the cloud to say, you know, provision me this cluster,
this is where my backups are,
and restore and everything,
it should be as frictionless as possible.
What's it say? Oh, here's a backup, you download it,
run this command,
just make it available.
So that part is, I think,
where the challenges are.
Yeah, makes a lot of sense.
And how, I mean, you started, you were going through like your journey and you talked like
about the differences between like working in the financial sector and then going like
to a B2C company like Netflix.
But if we take it like from Netflix until today, like you were at Netflix working like
with your customers being like inside Netflix obviously
then you open shorts the project and suddenly like you had like a much more open let's say
platform to experience there because people were like stuck using and giving feedback
and now you did another step forward and you started the company so how does it feel and like what's the difference
between like these steps that you have to go through i think yeah i think there are some
things which are common like for example you still care about the community you are still kind of
working with the community trying to build the community and grow the community. That part does not change much. Product,
in some sense also, you have the same amount of focus whether you are internal or external.
One key difference here is that internally,
typically, sometimes you have some other pressures in terms of
I need this feature because we have this thing that is coming up. So you prioritize.
As a company, your prioritization has a different kind of way of thinking about it, right?
It could be depending upon the customer pipeline, the features, and things like that.
The second thing is that as you kind of build a company, right, like you cannot just think about product alone.
Now, go sort of think about everything around it, right?
Like about the company, your investors, your customers,
and especially in a startup environment, right?
You are the engineer, you are the customer support person,
you are the marketing person, you are the revenue officer.
You are playing pretty much all the heads, right?
So, you know, you are doing probably 10x amount of work.
And you also have to make money, which is also like an important...
Absolutely.
Absolutely.
All right.
That's awesome.
One last question for me,
and then I'll give the microphone to Eric,
which is like a little bit of like a technical question,
but we talked about orchestration
and we're talking about like orchestration
of microservices, right?
There is like another kind of,
let's say like computation platform or model
that's becoming more and more popular lately,
which has to do with edge computing,
where you have, let's say, these functions that are pushed to the edge
and executed from there and all that stuff.
Do you see any kind of opportunities there for orchestration platforms to work in such environments?
And if yes, how?
Actually, that's an excellent question.
And to be honest, there is a huge opportunity there.
And the reason is this, right?
That what is happening is that as kind of, and this is again, like, you know, my interpretation, right?
So I could be a hundred miles off from the reality,
but hardware has become a lot more powerful, right?
So there's a lot more opportunity to push a lot of processing closer to where
the customers are. And this could be, for example, in the embedded devices,
right? Where like, you know, you are not running in cloud,
the whole thing runs on a customer environment,
like sometimes on premise, for example, right? And
if it does one thing,
that's fine. But usually,
again, you have multiple things that you are
coordinating and officiating against, right?
And
the concerns around reliability, fault tolerance,
detriability, failure ending, they
do not go away, right? It's because you push it to customer
environment. But it also now
is sharing that you have much less
visibility and control over this environment.
So you want this to be even more
reliable and be able to handle more
failures compared to anything else. So in
some sense, that's a huge opportunity.
And at the same time, there are some constraints,
right? Like, even though
hardware has become powerful, you
are still constrained with the memory, for example, right?
Put a lot of components that you can load because it's also running other things, right?
It's not just doing orchestration. But to be honest, we have seen
use cases for Conductor in this space, and there are some customers
using it in that particular area. Oh, that's very interesting.
Do you feel like there are also changes that
need to happen in how, let's say, an orchestrator
is architected in order to be more
efficiently working with the edge
computing environment?
Or we are fine
with how Conductor was
designed and implemented so far?
I mean,
no, it needs
some changes. For example,
a few things, right?
Like, you don't want, like, if you're running a cloud, right,
you can have a Cassandra and Elasticsearch and Redis
and a few other components working together, right?
And that's completely fine because, you know,
you have all these things at your disposal
and you can orchestrate that.
The moment you put it in the edge environment,
you want a lot more self-contained systems, right?
So, you know, you are kind of almost kind of going back
to drawing board and see, you know, what are the bare minimum components that you want a lot more self-contained systems all right so you know you are kind of almost kind of going back um to drawing board and see you know what are the bare minimum components that you want
what can be run as an embedded mode find the alternative and plug it plug in there right
one advantage that we had with connector was that because it was designed as a modular system
from the beginning it just made it possible to say okay we cannot use elastic search because it's
just too expensive or not possible to run in an edge environment.
It should be replaced with this embedded
database, right? And you would
implement those interfaces and get it done.
That was an advantage, but as you say,
it requires changes, right?
Yeah, awesome. Yeah, that's very interesting.
Hopefully we'll have the opportunity in the future to talk
more about that stuff. Eric, all yours.
This has been such a fun conversation.
Unfortunately, we're really close to time here, so we only have time for one question,
although we know that I always lie about only having one question. But, you know, I think in
many ways, a lot of us who work in the tech industry, you know, sort of being at a company
like Netflix, being instrumental in building a technology that sort of solves a
major problem, and then goes on to be open source. And then I think, you know, for some of us who are,
who are, you know, entrepreneurial in nature, like actually starting a company on that, I mean,
A, congratulations, that's, that's really just an incredible journey. But B, I think, just it's sort
of an aspirational story for a lot of us, right?
Do you have any advice, you know, for people who say, I mean, that's sort of like the pinnacle of, you know, the experience of being involved in engineering and cool open source projects and solving problems.
Like, I would just love for you to talk to maybe some of our listeners who are like early or mid in their career
and give them some advice that you learned along the way.
Yeah, sure. I i mean i'm still
early in my journey so we'll see you know how that kind of ends up being but like here's what
my thought process was right like that you know you can keep doing the same thing and keep polishing
like you know you can go from netflix to google to you know somewhere else like meta for example
and keep uh doing those things right but in the end the way
i think about it is that like you know unless your career progression gives you a kind of a
step function right it's not worthwhile and you ought to look for those step functions right
so yeah and you know that could be for example learning new technology you know coming up with
some new frameworks evangelizing those things and what's the next thing after that right like
it's maybe to prioritize that right and see how it works and it's the next thing after that right like it's maybe to
prioritize that right and see how it works and it's a very different kind of a experience right
like there's one thing about building a product where you know you are dealing with your id and
compiler and breaking your head with you know bugs and so many different ball game as compared to that
in terms of how do you go about raising money right because now you don't know actually even
before that right you would start a company by yourself,
you have to also find a co-founder. So first you have to
convince your potential co-founder
to say, hey, this is a great idea.
Once you convince them, you have to go
and find an investor, especially in the enterprise
world, right? Like you can't go swap,
you need some outside investment.
How do you kind of show the value that
what you are building makes sense? You have the right
skill sets and pedigree to kind of go and build this out.
So that's kind of the, like, you know the story,
but how do you tell the story in a compelling way, right?
That's the other part.
And then finally, once you have that, right,
how do you kind of go around building this out?
Like, where are you going to hire people, right?
How are you going to scale and all of those things, right?
And how are you going to find customers?
What's the go-to-market strategy and how do you actually implement?
It starts to kind of get into that, right?
So it's a very rewarding experience.
Like to me, when I was thinking about it,
what I realized was that no matter what's the outcome,
I'm going to come out on top.
Like that learning is going to be valuable.
And either way, it's going to be super useful in the career
yeah that's
I think that's such good advice for all of us knowing that
no matter the outcome if you learn
that's the ultimate
progress so thank you for that
and thank you so much for
really helping educate us on orchestration
sort of as a category
and all of the differences there
we had a great time on the show.
Thanks for joining us.
Yeah.
Thank you so much.
Thanks for all the insightful questions.
Yeah, it was really fun.
I don't know if I have a really insightful technical or sort of data-related takeaway
from this show, so forgive me.
But I just think it's really interesting to think about working on an infrastructure or on infrastructure at Netflix
while they're transforming the company from being a sort of content, you know, distributor to being
a content producer. And it was actually fascinating to hear about that problem described through the
lens of microservices, right? I mean, you wouldn't think about, you know,
in like a Harvard Business Review case study of like Netflix's pivot from,
you know, distributor to studio,
like they're not going to talk about microservices,
but that actually was a real pain point as they were making the transition.
And so I just really appreciated that perspective.
You know, you wouldn't really hear about that particular specific flavor of technical challenge in the process of a transition like that.
So it was really fun to get to get an insight there.
Yeah, and it seems like Netflix is one of these companies that they are really fueling the next wave of innovation right now.
I mean, there are like a couple of different products and companies.
They are actually coming from Netflix, which is great.
And it's like super interesting to see all these people, how they were together in Netflix
and now they're out there in the market and building companies and creating new products.
So they definitely did something right.
And I guess the Harvard Business Review
should look into it at some point.
But outside of this
and all the very interesting conversations
that we had for the technical details of orchestration,
I think one thing that I'll keep
and I would really love to learn more about is edge computing and orchestration. I think one thing that I'll keep and I would really love to learn more about is like edge
computing and orchestration, which is still like something it's early for this kind of
technologies.
But I think we are going to be hearing more and more about that like in the future.
So that's another thing that I'm keeping from the conversation we had.
For sure.
And if anyone is listening who is with the Harvard Business Review,
it's a little bit abnormal, but we're happy to do a cover story if you're interested. So
definitely hit us up and reach out to Brooks if you want to talk about that.
Lots of great shows coming up. We will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.