The Data Stack Show - 89: Solving Microservice Orchestration Issues at Netflix with Viren Baraiya of Orkes

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. We are talking with a fascinating guest today, another guest from Netflix, actually. We talked to someone from Netflix actually early in the life of the show and had a great conversation. And today we're going to talk with Viren. He actually created a technology in Netflix, open sourced

Starting point is 00:00:44 it there, and then came back to commercialize it later in his career, which is a fascinating journey. And it's in the orchestration space, which is super interesting. And we haven't talked a ton about on the show, Costas. I know you have technical questions. My question is going to be, you know, orchestration tooling is not necessarily something that's new. So I want to know what specific conditions at Netflix were sort of the catalyst for actually building a brand new orchestration tool? Because that's going to be really interesting to hear, especially from the Netflix perspective. What problems are they facing? Where were they at as a company, et cetera?

Starting point is 00:01:23 So, yeah, that's what I'm going to ask about. How about you? I think it's a great opportunity to get a little bit deeper into the definition of what orchestration is, because orchestration means many different things for different people in software engineering. And I think this is something that's going to be very useful for our audience that's primarily data engineers to hear about. So hopefully we're going to spend a good amount of time talking about the different flavors of orchestration out there and when and how we use them. Absolutely. Well, let's dig in and talk with Viren. Let's do it.

Starting point is 00:01:59 Viren, welcome to the Data Sack Show. We're so excited to chat today. Thank you. Thank you for having me here, Nadi. All right. So we always start by excited to chat today. Thank you. Thank you for having me here, Anirik. All right. So we always start by just getting a brief background on you. So could you tell us, you know, where did you start your career? How did you get into sort of data and engineering? And then what led you to starting WorkIt?

Starting point is 00:02:22 Sure. Yeah. So kind of, I'll keep it short, but like essentially, you know, I spent kind of in my early days of my career, you know, a decent working for firms in Wall Street lastly at Goldman Sachs and you know one thing that was kind of the case with typically all the you know Wall Street firms is that data is their kind of secret sauce right especially in today's world at some point of time I hadn't kind of each to kind of go a little bit more technical so you know went on to kind of work at Netflix, which was the early days of Netflix in terms of their, you know, pivot from being a pure streaming player and number

Starting point is 00:02:52 one at that point of time to becoming a studio. And, you know, I'm going to work with some really brilliant engineers there and thought like, you know, there might be an opportunity to kind of scale myself out further, you know, spend some time at Google afterwards, dealing with a couple of developer products, Firebase and Google Play, to be more precise myself out further, you know, spend some time at Google afterwards, dealing with a couple of developer products, Firebase and Google Play, to be more precise.

Starting point is 00:03:08 And then, you know, one thing that I had done while I was at Netflix was kind of, had built out this organization platform called Conductor and open-sourced it. And we had seen a great momentum in the open-source community and even from the timing perspective and felt it was the right time.

Starting point is 00:03:23 So, you know, I decided to kind of, you know, take a plunge, you know, and start building out a cloud hosted version of Conductor. And started Orcus with a bunch of my colleagues from Netflix. And yeah, here we are. It's been almost three, four months old journey now. Awesome. Well, congratulations. You're sort of just starting out, but that's super exciting. Okay, I have to ask this question.

Starting point is 00:03:45 What was it like going from Wall Street to Netflix? I mean, was there like both, you know, just from a data standpoint, but also a cultural standpoint, it seems like that would be a huge shift. Yeah, absolutely. Like if you think about engineering practices, for example, in Wall Street, right? And women's sites, to be honest, right, like prides itself on being very forward thinking, very tech oriented firm in the Wall Street. And they rely a lot more on open source compared to anybody else.

Starting point is 00:04:15 So in some ways, like, you know, engineering wise kind of the tech stack and everything was similar. But how you think about kind of building things is very different. When you think about companies like Netflix for example or any tech companies, the pace at which the innovation happens is very different. It's very rapid because here it's all about you have to be always innovating for the future,

Starting point is 00:04:38 not for the current problem. So that was one thing. And secondly, in terms of just the cultural aspect of it, if you think about it, tech companies tend to be a lot more open to new ideas taking bold risk when it comes to technical investment and they and then you you essentially hire the best engineers and let them do their best as opposed to kind of manage them from top down so i think in terms of being able to do things there's a lot more freedom i would say and also kind of the problem side,

Starting point is 00:05:05 like you are no longer in the second or third line when it comes to working with the customers. And in Wall Street, you never work with the customers directly. You directly work with the customers at times, depending upon the team. And you can see, and more importantly, when your family or friends ask, what do you do?

Starting point is 00:05:20 You can tell them, I work for Netflix. And if you do this, this is what I did. Yeah. Yeah, that's a lot easier easier like dinner parties and cocktail party absolutely absolutely yes yes uh very cool thanks sir thanks for sharing some of that insight okay so let's let's go back to netflix so what a fascinating time to be there when netflix is going you know sort of from a content aggregator and distributor you know, sort of from a content aggregator and distributor, you know, say, to being like a producer. Those are very different kinds of

Starting point is 00:05:50 companies. And you said Conductor was kind of born out of that transition. What were the challenges that you faced as an engineering team that, you know, sort of were represented in that transition? So I think basically, you know, I joined a team which, and our mission was to kind of build out the platform team to support the entire content engineering and studio engineering organization at Netflix. And Netflix, as you know, right, like has historically invested very heavily into microservices,

Starting point is 00:06:19 almost they championed microservices, right? And one side effect of microservices, as you could see, is that like, you know, you end up with so many, this little kind of services with a single responsibility principle, right? But now, you know, as your business processes get more complex

Starting point is 00:06:34 and this started to become especially true in the studio world, where, you know, you are not only dealing with the data that you own and the teams that you work with, but also external partners, external teams, really long running process. Like to give you an example, right? Like before us, you know,

Starting point is 00:06:50 it could take months before a show is completed, right? Like in terms of its entire production process. And, you know, you are managing this long running workflows all over the place. And this was one of the need that we wanted to have and you know as as they say right like you know sometimes the things that you know it is not because you know you thought of a cool idea but rather there was a problem that was hitting you directly right and i was responsible for i mean traditional way of solving this problem will be like you know you just end up building an enterprise service bus or a pub sub system like sqs right and build everything on top of it and that's exactly what we were doing and what we realized was that it worked well when your

Starting point is 00:07:29 processes were very very well defined and were kind of simple enough now there were two things that were happening right one is that the number of processes were exploding the second thing was that netflix is not a traditional holly company, right? It's a tech company. And they think about problems in a very different way. You also want to experiment with processes and see what works, what doesn't work. Which means you want to be able to rapidly change things and test it out, saying whether this works

Starting point is 00:07:55 or not, right? And so that agility was another situation, right? Like one thing that we absolutely did not like at Netflix was building monoliths. But what we realized was that we Netflix was building monoliths. But what we realized was that we were building distributed monoliths because now the code was there all over the place. And one

Starting point is 00:08:12 change meant I would go and talk to 100 engineers, beg them to prioritize the change. And if a product manager wanted to change something, you know, you would go and talk to 100 engineers to figure out how the process works. And this is where we thought like, you know, this is not going to scale. And we had to build something.

Starting point is 00:08:28 So that's where, basically, you know, we started thinking about Conductor. We started with a very simple use case and it evolved very organically over the period of time. Can you give us an example of just one of those simple use cases and how to sort of solve that across, like, some specific microservices? Absolutely. So, like, very first use case, right? That was very simple. Like it was basically, you know, you have a bunch of that, you know, you have received

Starting point is 00:08:50 from marketing agency or, you know, you have used ML to produce from the video files. We wanted to encode them in different formats, one for browser, for iOS, for Android, for TV, and then deploy to a CDN. Very simple, right? You take an image, encode, deploy, and then you test and see, you know, what works, what doesn't work, and that's the format. Is the PNG works better on iOS versus Android?

Starting point is 00:09:13 If not, they do the same thing, the entire process. And it looked very straightforward and simple application that we thought would be a very good test. And that was the very first use case that we actually built Conductor for. And so how did Conductor change? What was the process before? And then how did Conductor change it?

Starting point is 00:09:30 So if you think about the original process was this site like that, there'll be an application that is responsible for publishing images. So, you know, now the person who is building the application is not necessarily the audio engineer or video engineer, right?

Starting point is 00:09:42 Or the image or the, you know, the engineer working with the images. So now it's a different team. They have a microservice. So you call a microservice to say, you know, give me the encodes in PNG format. Then you wait. You know, we relied very heavily on Spot.

Starting point is 00:09:55 Resources on AWS and Netflix has done some fantastic work there. So it could take some time. You wait for it. You would complete. Then you go and deploy. What if your deployment fails? You retry. And then this thing works. But then your product manager comes and says like, hey, but you know, what we realize is that this particular format does not work very well because latency issues, maybe because, you know, quality issues, or you go and ask that engineer who works on encoding team to say, hey, what's the API endpoint where I can use it to encode in different format?

Starting point is 00:10:29 And you need to do this, right? Like it's a very intensive process. And sometimes we were changing this multiple times during a week, right, to see what works, what doesn't work, getting feedback from the users from A-B test and whatever not. Or now I want to deploy 20 images instead of 10 images because I have more A-B tests to run. So those were kind of starting to become a little bit unmanageable. And this is a very simple example. But if you put something in between

Starting point is 00:10:52 to say, OK, depending upon the country now also, where we are going to launch the show, I want to have a different format because some countries want to probably need a lower bandwidth image. It starts to get very complicated. And as you could imagine, right? So interesting. So in the new world, Conductor sits on top and sort of interacts with all of the various microservices

Starting point is 00:11:17 to streamline that process. Absolutely. So like in the new world, essentially what happens is that instead of writing all the code, what you're saying is that I have a microservice that can do image encoding, and I have got 10 different one of them that each one is responsible for a different kind of format. And then as an engineer or a product manager, you basically work with your product manager and say, okay, what's the flow that we want to see, right? And it's like, okay, if the country code is this, these are the sort of images that we want to produce.

Starting point is 00:11:40 This is the CDN location that we want to produce. You actually build out a diagram, a graph of what the whole thing is going to look like. And when a new show is ready to be published, you call it. It does everything. You want to change something, you go back and update the workflow because the microservices are there. They are not changing as much. It's just the flow that you

Starting point is 00:11:57 are tweaking and fine-tuning and optimizing for. And now, as an engineer also, at some point in time, you can give it the whole thing to the product manager and say, like, you know,

Starting point is 00:12:07 why don't you just try it out if you want, separately, and if you find something that is missing, I can build a microservice and then you can plug it in, right?

Starting point is 00:12:15 And it becomes a lot more tight coupling or rather tight conversation between the engineer and the PMs. The expectation is not that the PM is going to go and manage these things.

Starting point is 00:12:23 Engineers are still responsible for building these things. But, you know, your work gets simplified, right? And now you don't worry about, oh, I have to put a retry logic. That's taken care by a conductor. A conductor will take care of retries. You just write, saying, you basically write or let me put it this way.

Starting point is 00:12:38 You write for best case scenarios and the conductor takes care of all the edge cases, the failures, the retries and everything else. Fascinating. Okay, I have one more question because I know that Kostas says his mind is probably exploding with questions.

Starting point is 00:12:53 And this may sound funny, but was it an immediate success in terms of adoption or did you have to, because sometimes with those things, it's like changing, even if technology makes things better, like adoption can be difficult or like, oh, we don't necessarily want to change the way that we do this. Or like, it's actually work to migrate our whole thing. Like, how did that happen culturally inside of Netflix?

Starting point is 00:13:16 Actually, that's a very interesting question, because Netflix famously has the culture of freedom and responsibility, which means, you know, you don't have mandates to say, hey, you use this framework. Right. You know what frameworks are there and you choose what you want to use. Sure, yeah, like self-serve with all these options. Exactly, yeah. And nobody's going to tell you why did you choose this, right? It's up to you to decide to get the job done. So that becomes a very challenging thing,

Starting point is 00:13:39 that you can't just build something and get a VP or director to go and send out an email to everyone saying, we didn't build this fantastic new framework, everybody must use it. Not to happen. So now we need us to go and talk to everyone, right? So one approach that we took right from the beginning was that we, ourselves

Starting point is 00:13:56 were developers, right? So we understand the pain points of developers having. So we built it very much like a democratized version, right? We did not have a product manager. We made a very conscious decision that we don't want to have a PM kind of shepherding the product, but rather let's talk to engineers. What do they want? So like every feature that we built was out of a necessity and a recommendation or a need from another engineer.

Starting point is 00:14:19 And that was one thing. Second thing was we kept it very agile in terms of its development rather than trying to think about we have to build a very perfect system from the get-go, we made it functional and we built the resiliency and everything kind of along the way as we were testing with the internal users. That was another way we kind of tried to evangelize within Netflix itself. And of course, like, you know, there were always kind of skeptics or people who wanted a different way right and we try to keep it like as open as possible that was one thing so the side effect of that is like if you look at the current repository also it's very flexible system

Starting point is 00:14:56 it's pretty much plug and play you can plug and play like it's a lego block right in some ways and that was one of the reasons why it turned out to be like that because you know we wanted to be able to satisfy as much as possible like in some ways you can think about right that it increases the complexity and effort but the advantage there was that you know everybody felt that they had a stake in the game right most of the thing was to get them yeah invested in the product and then you have someone who is happy. Super interesting. Okay, I lied.

Starting point is 00:15:27 I have just one more question. I promise, Cassis. Because it's always fun to hear about how these projects sort of form inside of a company like Netflix with such an interesting culture. I know that microservices were a catalyst for building Conductor to sort of make them easier to interact with. But it wouldn't surprise me if actually microservices proliferated at a higher rate after people started using Conductor because it was easier to manage a lot more. Did you see people building a lot more microservices? Absolutely. Because see, what happens is that if you don't

Starting point is 00:16:06 have something like Conductor, then you tend to take shortest path, especially when you are under time pressure or you have to deliver things. If you were to build something of a complicated business flow, building five microservices, writing orchestration logic on top of it, and

Starting point is 00:16:22 making all of these things work is more time effort versus putting everything into a monolithic block and get it out. So in some ways, Conductor kind of encouraged people that like, you know, break it down because now you have another side effect of it, right? The moment you break it down, you have a lot more composability

Starting point is 00:16:37 to be able to change flows and everything. So that was one thing that really kind of inspired people to do that. The other thing was that we built two critical aspects that everyone wants, right? Traceability and controllability, right? Like you can actually trace the entire execution graph visually and see the graph. That just turned out to be a sleeper hit for us. Like I had never thought that this is going to be the killer feature.

Starting point is 00:17:03 We thought the killer feature is going to be distributed steadily, but no. Killer feature was that UI that people loved it because I could see exactly where things was wrong. Because that's the problem you face. You want to go and look at the logs everywhere and see what's going on. I have a UI and just click on it and say, this is what's wrong. Go fix the code,

Starting point is 00:17:18 retry and yeah, works. So some of it was like that also. That just encouraged. Now if you use connector, you get these features that you otherwise wouldn't get it right and that encouraged people to write for microservices so brian i have a question about the open source uh side of the project so you how soon after you developed, you open sourced the project. I think it was about six to eight months journey. Like, we took it to the place where it had enough features that, you know, it did not look like a toy project.

Starting point is 00:18:03 We also had to decouple everything from internal non-OS side of the world. And we wanted to put together some amount of governance process also, right? Like my team, we did not have any open source product that we were managing ourselves. So we had to kind of figure out learning from other teams in Netflix, right? So yeah, overall,

Starting point is 00:18:15 it took us about three quarters. And the day we decided that we were open source, it took about a couple of months to get everything ready, right? Prepared legal reviews, patent reviews, and all of these things. Yeah, makes sense. And how long it took after you open sourced the project to start getting engagements and creating a community of adopters of Conductor out there?

Starting point is 00:18:39 I would say, you know, what I've seen is typically, you know, you have this initial bump, right? Like open source people are excited. They want to try it out. So there was this initial bump. Then it starts to taper off because, you know, there's nothing new there. And it kind of stayed there until very recently, I would say. So what was

Starting point is 00:18:56 happening was that, so and what we had done was that like we were doing meetups at Netflix about Conductor and also we're talking about Conductor and other meetups and everything. So, you know, as we kind of talked to people, it started to kind of grow the momentum. The other thing was like, you know, if the community

Starting point is 00:19:12 is always the consumer of the product, right, it does not grow the community well, right? Like, we kind of also made it much easier for people to contribute back to Conductor and once people started contributing back, right, it started to grow further. Because now, again,

Starting point is 00:19:27 they have a stake in the game. They have the ownership in the product itself because they have contributed. Of course. Makes a lot of sense. And do you have any use cases that came out of the open source community that surprised you?

Starting point is 00:19:44 Absolutely. So one use case that I, if you ask me today without learning about this, that came out of the open source community that surprised you? Yes, absolutely. So one use case that I, if you ask me today, right, without learning about this, I would not have ever thought about it in my wildest day, right, was security. Oh, okay. People using Conductor

Starting point is 00:19:56 to orchestrate the security flows, things like thread detection, things like, for example, take, for example, let's say you upload a file in a S3 bucket. Typically, you want to run some processes and checks to ensure that you wouldn't upload a secret by mistake or on purpose or whatever, right?

Starting point is 00:20:11 Or there is not a virus. And you are going to run a bunch of workflows around it, right? To some automated, some manual. And this is all done by folks who are into the security space, not necessarily writing microservices, but content that turned out to be a very good use case for them. So this is one thing that surprised me, that

Starting point is 00:20:29 there's a strong use case here that I had not thought about it. Yeah, that's so fascinating. I would never think about it. Exactly. But then more I think about it, these are long-running flows, right? It might take some time to scan object.

Starting point is 00:20:45 If people are putting thousands of objects in S3 bucket, for example, right, it may not get a real-time treatment. So, you know, you have to have a backlog. And if you find something, maybe somebody has to do manual intervention to verify, right? Then you have a human process involved in it. You send an alert or something, wait for someone to reply. So all this flows becomes, it becomes a pretty good use case. That's what I realized.

Starting point is 00:21:06 Yeah, makes sense. Okay, I'll probably get back to more open source and company also related questions a little bit later. But I'd like to discuss with you about what orchestration means. It's a term that it being used a lot in software engineering and not necessarily in every discipline the same way. We have orchestration that data engineers are talking about.

Starting point is 00:21:37 We have orchestration that has to do with microservices. We have workflow orchestration. Then we have orchestration on Kubernetes, and probably like many other types of orchestration out there. Can you help us define, let's say, a taxonomy around orchestration tools out there and understand better the differences between the different tools and the use cases?

Starting point is 00:22:03 Yeah, absolutely. I mean, as you say, orchestration is an over-related term. It has different meanings and the use cases? Yeah, absolutely. Yeah. I mean, as you say, right, orchestration is an overrode term, right? Like it has different meanings to the people and use cases, right? And, you know, having spent some time on this space, right? Like what I realized is that like, you know, essentially, if you look at from the persona, right? Like who is looking at the word orchestration, there's a kind of a different meaning to it, right?

Starting point is 00:22:23 And like if you go in a top to bottom right in a company so if you look at people let's say people on the business side right business analysts product managers uh who are dealing with the business processes that are high level right for them essentially if you think about when they think about orchestration they are looking at how various you know business components are getting orchestrated. So in an e-commerce company, this might be how am I orchestrating between my payment and shipping and delivery and tracking systems or fulfillment services and things like that. But again, this is at a business level. And again, when you think about also measuring the output of the orchestration, SLS, any other key metric that might be defining, right, they look at it from that perspective also, right?

Starting point is 00:23:09 Like the time it takes to complete certain activities, meantime for failures and things like that, right? How often they fail and where are the optimizations that they can make based on those data points, right? They are not the ones who are actually building the systems. Then that typically goes to the backend engineer. And when they describe the same flow to the backend engineer, essentially for them, you could think about this individual things as kind of either microservices or other services, which, you know, for the lack of a better word, right? We can just call it microservices.

Starting point is 00:23:39 And for them, when they think about orchestration, right? It's about, I have a bunch of microservices and I have to build a flow around that. How do I build that? But now what I am looking at as an output from this is how do I handle certain things? Like, for example, you know, in a distributed world, those services are going to fail. Services are going to, you know, have a different SLAs. How do I handle failures, retries, different SLAs across them?

Starting point is 00:24:04 I want to be able to run some things in parallel so i can optimize my you know time it takes to kind of complete the the entire process or no resources and everything how do i kind of achieve that and if i'm doing some things in parallel how do i wait for them to complete because you know you can't just always do that otherwise right so that's how a backend engineer is thinking about orchestration then if you take it for the down one level right in terms of the platform side of it right did and you look at even zoom into the set say an individual microservice this microservice is typically getting deployed onto a container this day is right or a vm or even a bare metal machine somewhere but you know not deploying one thing, right? You are deploying kind of whole bunch of things. And typically

Starting point is 00:24:48 you don't only deploy a service, you deploy a service with sometimes at least in the initial phase, right? Some more semantics around it, right? Like the networking configuration databases and everything. And that starts to get into more of the continuous deployment side of it, right? Which is where the container orchestration, for example, has become very mainstream with Kubernetes and Argo is another one, for example, there, right? Where essentially it doesn't matter what you are doing, your piece of code that you are deploying and it's scaling out and scaling down, right? And that's what you are focusing on. That's another level of content orchestration that

Starting point is 00:25:20 is happening. And just to go back to the backend engineer, right? Like there are different flavors of backend engineers also, right? You have backend engineers that are working with product managers to build an application.

Starting point is 00:25:30 You also have data engineers who are dealing with the massive amount of data and orchestrating, right? This is where things like Airflow,

Starting point is 00:25:39 we are seeing at NetPace, ThunderCrosser is being used for similar purpose where you have data sitting in different places and you are essentially

Starting point is 00:25:46 orchestrating that, right? In a batch world, right? Like, you know, you are processing data, aggregating that, putting it into database, maybe training some machine learning model,

Starting point is 00:25:54 making inferences, putting it into database. The whole thing is basically a flow. An offline flow is very well orchestrated through something like conductor or airflow

Starting point is 00:26:03 and similar systems. And a slight variation of this is kind of real-time data platforms where you still have flows. If you think about, let's say, I click on a button in my phone or a website, and you are sending out a signal, an analytics data point back to the server. Now, this has to go through kind of a certain journey. Like you are waiting for it to do a streaming aggregation, but once it is aggregated, it goes through maybe a couple of other systems where either it is being used to do either further aggregation, get it into

Starting point is 00:26:35 more of an analytics store, or maybe you are doing kind of real-time model training through machine learning, right? So that's kind of another flavor where there's no start or end of a workflow. It's continuously running pipeline, but you have a complicated flow that is built out. I think Kafka or Confluent has some tooling around that, but I think that seems to be right now still a very wide open space. I would say it's still an unsolved problem. Yeah, makes sense.

Starting point is 00:27:02 So just to give like an example, because we have like our audiences, like primarily like people that are working with data and they are data engineers. What's the difference between like a system like Airflow and Conductor? And why I wouldn't use,

Starting point is 00:27:20 let's say Conductor and I would prefer like Airflow in order to orchestrate my pipelines. Absolutely. So like if you think about Airflow, right, from its genesis and the kind of music it solves, it's mostly about data, right? Like typically, you know, you have data sitting in different buckets or databases like Hive and you are processing, right? And these are typically batch jobs that you run on an hourly basis, maybe twice a day,

Starting point is 00:27:46 three times a day, or daily, and things like that, and runs through the data pipeline. The other important part of a data pipeline also is kind of the dependency management, right? Like, you have to run in a specific sequence because, you know, your data at a given step depends upon the previous step. Also, re-reliability in the context of data is very different from a

Starting point is 00:28:06 real liability in a microservices world right when you think about re-running some data you are essentially running data for that particular date or a time frame right and you're only processing that data alone you're never posting the latest data so that's that's kind of high level use cases that i've seen airflow being used for and it does well, right? Also, if you think from the users of Airflow, right? These are mostly people dealing with data and the language of the choice today for that is Python. So, you know, Airflow DAGs

Starting point is 00:28:33 are written in Python and they tend to be simple in nature, right? Like you have sequence of things that you do, sometimes you fork and you are done with it, right? These pipelines are very stable, very fixed. You don't change them every day. You don't do A-B testing of this pipeline. It doesn't make sense to with it, right? These pipelines are very stable, very fixed. You don't change them every day. You don't do A-B testing of this pipeline. It doesn't make sense to do that, right?

Starting point is 00:28:49 Connector kind of goes into the next stream. It's more about flows which could be running for a very long period of time, say months at the end, or a very short one where you complete the entire flow in a few hundred milliseconds, and everything in between, right? But instead of running a few executions a day, you could be running few hundreds of millions of executions per day or even a billion executions per day depending upon your

Starting point is 00:29:12 use case. So the scale side of it is very different. At the same time, a typical workflow is operating on not petabytes of data. A step in the workflow is typically dealing with a finite set of data. You know, a step in the workflow is typically dealing with a finite set of data. And sometimes you do, like for example, one use case that we had on Netflix

Starting point is 00:29:30 was the processing the video files and a raw video file could be petabyte in size. Yeah. In that case, you know, you have to be processing for a longer period of time. The other thing is that

Starting point is 00:29:41 connector is very general purpose and it is meant for pretty much the entire spectrum of audiences right so it's very much language agnostic

Starting point is 00:29:49 you know we had workflows where one step was written in C++ another in Python and third one in Java and so forth so it allows you

Starting point is 00:29:56 to kind of mix and match depending upon the owner of the step in the process so connector becomes very useful in this kind of scenarios where, you know,

Starting point is 00:30:06 you have a very heterogeneous environment and the scale is another thing. That's very interesting. And it's like something that, as you were talking about Airflow and Python, like I wanted, like, it came up like as a question to me. So, Conductor is written in Java, right? Yep. How is it is I mean you gave an answer but

Starting point is 00:30:29 I would like to hear a little bit more about this like how is it like for example for a team that is primarily using Golang for example to create microservices to employ like an orchestrator that's written like in Java because probably you're not going to have

Starting point is 00:30:46 a team there that knows java right yep how does this work and which team is usually like responsible for like the platform team like is it who is responsible for managing deploying and taking care of the orchestrator so let me ask you the first question right so like i think the way connector works essentially is it exposes its API through HTTP and GRPC, and that's how it becomes kind of language agnostic, right? So let's say if you are a Golang shop, you are writing your microservices in Golang,

Starting point is 00:31:15 and you are building your orchestration flow in Conductor. Conductor also provides client API. So there are two parts to Conductor. You have a client or SDK and the server side. Server side is in Java. SDKs are written in different languages. parts to compile your client for sdk and the server side server side is in java sdks are written in different languages so i think there are three right now java python and golang so you know you use that sdk in that particular language to interact with conductor and grpc is great where

Starting point is 00:31:37 you know if you want to bindings for rust for example right you can do that using grpc compiler so that's kind of how it works today. And that's why it is language agnostic, because the entire model is that way. The second part, who runs the network, it's a very interesting question. I think I have seen kind of both sides of it, in the sense that where there's a platform team that is responsible for running conductor. This was exactly what we were doing at NetRace, and my team was responsible for managing as a platform for that is responsible for running conductor. This was exactly what we were doing at Netflix and my team was responsible

Starting point is 00:32:05 for managing as a platform for all the teams. But that was a model at Netflix. You have a platform team and this tends to be a lot more common in tech companies where you have a platform team responsible for all the components and everybody else uses that. We have also seen the other side where you

Starting point is 00:32:22 have business teams that owns the entire stack by themselves and then they are responsible for running conductor on their side like so i think it in some ways goes back to the culture of the company right how they are formed and you know what's their kind of usage model for all the maintaining the products how that works yep yep that's that's interesting and probably not solved yet. As you said, it also has to do a lot with the culture. I hear a lot about

Starting point is 00:32:51 platform teams, but it doesn't mean that every company has a platform team. You can just wake up one day and be like, let's have a platform team now. I mean, there are two challenges. Building a good platform team is not easy. Hiring the platform is even more difficult.

Starting point is 00:33:09 Like, hiring engineers are difficult. Now we're talking about hiring platform engineers. Like, it's made it exponentially harder to, you know, build a team, right? Yeah, yeah. I think, in reality, like, what really works well is that, like, you know, if you treat your platform team as a mini cloud team in your organization, right?

Starting point is 00:33:24 So today, for example, if I want to use an RDS database from AWS, I can go to console provision one for myself and start using it. And AWS takes care of everything else for me, right? Provisioning, backup, restores, and everything. So if you end up building

Starting point is 00:33:39 a platform team that can get to that stage where, let's say, any product for that matter, not just conductor, that they are able to offer in a self-service mode and they focus on building that platform out of it, right? But then again, cloud companies are offering more and more of this thing, right? So, you know, the line between the

Starting point is 00:33:55 internal platform team and a cloud provider becomes thinner and thinner day by day, right? 100%. Alright, so let's go back to building a company right so you open source the project you started having like some traction out there

Starting point is 00:34:12 and at some point you decide like to build a company around the core technology my first question is when we are talking about an orchestrator who is going to be interacting with microservices, and as you said, there are use cases there

Starting point is 00:34:30 where you might want to run millions of interactions and the latency should be super low, right? How do you build a cloud service around that? And you make sure that the microservices themselves that the company is building are like sharing the same resources let's say or the same networks or like

Starting point is 00:34:53 all the stuff that's needed there to make sure that the latency like remains as low as possible. Yeah, I mean, I think the key to that is essentially your deployment model, right? Like how are you deploying those things? Like, you know, lower the latency you want, right? You want to be as much co-located as possible. So like essentially what we have done is like,

Starting point is 00:35:11 we essentially have built out like two different models of deployments where one deployment is where you are, you have kind of connector running in a separate VPC and your microservices in a separate VPC and your VPC peering that allows you to kind of communicate with each other. And you try to kind of keep the affinity between availability zones. So, you know, your network does not go through very heavy kind of hops.

Starting point is 00:35:35 The second model where you want really low latency, right, is sometimes to kind of deploy conductor inside your own network as close to the microservices as possible reducing kind of network because now we are talking about in a few tens of millisecond latency differences right as possible or even embed like the beautiful thing about connector is that you know it can run in a cloud environment handle billions of flows if not billions every day or you can also embed it inside where now you are running with a very low memory footprint, pretty much running in the customer's edge environment, right?

Starting point is 00:36:11 Like small deployments. So, you know, you essentially have to kind of make those set of decisions to figure out what are your requirements and how you deploy that. And to be honest, like this was something that we had to kind of think it through and figure out exactly how this is going to work and come up with a solution there. But this always is an interesting challenge to solve. So how is the product experience different between the two deployment models? And the reason I'm asking is because Eric is probably aware of

Starting point is 00:36:44 that because of Rudderstack there were multiple different deployment models although for different reasons there it wasn't that much like the performance it was like in many cases had to do with like let's say compliance but building like

Starting point is 00:37:00 a product that has like consistent experience regardless of the deployment model like like super super hard so how do you approach this problem i would say like when i think about the end users of the system right like i would say there are two groups right so there are the engineers who are actually using the product to build the applications, right? For them, there should be no difference. Like, you know, they are still dealing with the same set of APIs and same set of constructs and everything. You know, if they were to go to UI, you have URL where you go to UI and look at your workflows

Starting point is 00:37:35 and manage everything, right? So that experience must be consistent no matter how things are deployed. The second set where actually it matters is the people who are actually responsible for the operational aspect of it and this could be a platform team devops sres and this is where i think the key difference comes right and it's very similar to running a relational database in a in a vm that you have provisioned yourself which is running something like rds right yes where a fully hosted service gives you an experience where essentially you don't even need that thing.

Starting point is 00:38:06 You know, things are taken care of for you, right? Like it scales automatically for you. You don't have to worry about backups, what kind of database I should be using if I need to get this performance or whatever or not. You specify how much capacity

Starting point is 00:38:19 that you want to run with and system kind of scales for you, right? The other option, essentially when you are running in your environment, right? You are also making those decisions by yourself now that, you know, how big the instance should be, where should my backups be? And if something goes wrong, how do I restore from the backup? And I'm also responsible for my costs now, right? Like I can't just run a thousand node cluster without having that show up on my annual or monthly

Starting point is 00:38:43 billing from AWS or GCP. So I think for those people, I think the experience becomes slightly different. I think ultimately the goal is still to be kind of make it easy in terms of the UI interactions like the console side of

Starting point is 00:39:00 the world, right, where if you are dealing with the conductor console in the cloud to say, you know, provision me this cluster, this is where my backups are, and restore and everything, it should be as frictionless as possible. What's it say? Oh, here's a backup, you download it, run this command,

Starting point is 00:39:15 just make it available. So that part is, I think, where the challenges are. Yeah, makes a lot of sense. And how, I mean, you started, you were going through like your journey and you talked like about the differences between like working in the financial sector and then going like to a B2C company like Netflix. But if we take it like from Netflix until today, like you were at Netflix working like

Starting point is 00:39:44 with your customers being like inside Netflix obviously then you open shorts the project and suddenly like you had like a much more open let's say platform to experience there because people were like stuck using and giving feedback and now you did another step forward and you started the company so how does it feel and like what's the difference between like these steps that you have to go through i think yeah i think there are some things which are common like for example you still care about the community you are still kind of working with the community trying to build the community and grow the community. That part does not change much. Product, in some sense also, you have the same amount of focus whether you are internal or external.

Starting point is 00:40:32 One key difference here is that internally, typically, sometimes you have some other pressures in terms of I need this feature because we have this thing that is coming up. So you prioritize. As a company, your prioritization has a different kind of way of thinking about it, right? It could be depending upon the customer pipeline, the features, and things like that. The second thing is that as you kind of build a company, right, like you cannot just think about product alone. Now, go sort of think about everything around it, right? Like about the company, your investors, your customers,

Starting point is 00:41:07 and especially in a startup environment, right? You are the engineer, you are the customer support person, you are the marketing person, you are the revenue officer. You are playing pretty much all the heads, right? So, you know, you are doing probably 10x amount of work. And you also have to make money, which is also like an important... Absolutely. Absolutely.

Starting point is 00:41:29 All right. That's awesome. One last question for me, and then I'll give the microphone to Eric, which is like a little bit of like a technical question, but we talked about orchestration and we're talking about like orchestration of microservices, right?

Starting point is 00:41:42 There is like another kind of, let's say like computation platform or model that's becoming more and more popular lately, which has to do with edge computing, where you have, let's say, these functions that are pushed to the edge and executed from there and all that stuff. Do you see any kind of opportunities there for orchestration platforms to work in such environments? And if yes, how?

Starting point is 00:42:12 Actually, that's an excellent question. And to be honest, there is a huge opportunity there. And the reason is this, right? That what is happening is that as kind of, and this is again, like, you know, my interpretation, right? So I could be a hundred miles off from the reality, but hardware has become a lot more powerful, right? So there's a lot more opportunity to push a lot of processing closer to where the customers are. And this could be, for example, in the embedded devices,

Starting point is 00:42:38 right? Where like, you know, you are not running in cloud, the whole thing runs on a customer environment, like sometimes on premise, for example, right? And if it does one thing, that's fine. But usually, again, you have multiple things that you are coordinating and officiating against, right? And

Starting point is 00:42:54 the concerns around reliability, fault tolerance, detriability, failure ending, they do not go away, right? It's because you push it to customer environment. But it also now is sharing that you have much less visibility and control over this environment. So you want this to be even more reliable and be able to handle more

Starting point is 00:43:12 failures compared to anything else. So in some sense, that's a huge opportunity. And at the same time, there are some constraints, right? Like, even though hardware has become powerful, you are still constrained with the memory, for example, right? Put a lot of components that you can load because it's also running other things, right? It's not just doing orchestration. But to be honest, we have seen

Starting point is 00:43:31 use cases for Conductor in this space, and there are some customers using it in that particular area. Oh, that's very interesting. Do you feel like there are also changes that need to happen in how, let's say, an orchestrator is architected in order to be more efficiently working with the edge computing environment? Or we are fine

Starting point is 00:43:54 with how Conductor was designed and implemented so far? I mean, no, it needs some changes. For example, a few things, right? Like, you don't want, like, if you're running a cloud, right, you can have a Cassandra and Elasticsearch and Redis

Starting point is 00:44:11 and a few other components working together, right? And that's completely fine because, you know, you have all these things at your disposal and you can orchestrate that. The moment you put it in the edge environment, you want a lot more self-contained systems, right? So, you know, you are kind of almost kind of going back to drawing board and see, you know, what are the bare minimum components that you want a lot more self-contained systems all right so you know you are kind of almost kind of going back um to drawing board and see you know what are the bare minimum components that you want

Starting point is 00:44:28 what can be run as an embedded mode find the alternative and plug it plug in there right one advantage that we had with connector was that because it was designed as a modular system from the beginning it just made it possible to say okay we cannot use elastic search because it's just too expensive or not possible to run in an edge environment. It should be replaced with this embedded database, right? And you would implement those interfaces and get it done. That was an advantage, but as you say,

Starting point is 00:44:53 it requires changes, right? Yeah, awesome. Yeah, that's very interesting. Hopefully we'll have the opportunity in the future to talk more about that stuff. Eric, all yours. This has been such a fun conversation. Unfortunately, we're really close to time here, so we only have time for one question, although we know that I always lie about only having one question. But, you know, I think in many ways, a lot of us who work in the tech industry, you know, sort of being at a company

Starting point is 00:45:20 like Netflix, being instrumental in building a technology that sort of solves a major problem, and then goes on to be open source. And then I think, you know, for some of us who are, who are, you know, entrepreneurial in nature, like actually starting a company on that, I mean, A, congratulations, that's, that's really just an incredible journey. But B, I think, just it's sort of an aspirational story for a lot of us, right? Do you have any advice, you know, for people who say, I mean, that's sort of like the pinnacle of, you know, the experience of being involved in engineering and cool open source projects and solving problems. Like, I would just love for you to talk to maybe some of our listeners who are like early or mid in their career and give them some advice that you learned along the way.

Starting point is 00:46:04 Yeah, sure. I i mean i'm still early in my journey so we'll see you know how that kind of ends up being but like here's what my thought process was right like that you know you can keep doing the same thing and keep polishing like you know you can go from netflix to google to you know somewhere else like meta for example and keep uh doing those things right but in the end the way i think about it is that like you know unless your career progression gives you a kind of a step function right it's not worthwhile and you ought to look for those step functions right so yeah and you know that could be for example learning new technology you know coming up with

Starting point is 00:46:38 some new frameworks evangelizing those things and what's the next thing after that right like it's maybe to prioritize that right and see how it works and it's the next thing after that right like it's maybe to prioritize that right and see how it works and it's a very different kind of a experience right like there's one thing about building a product where you know you are dealing with your id and compiler and breaking your head with you know bugs and so many different ball game as compared to that in terms of how do you go about raising money right because now you don't know actually even before that right you would start a company by yourself, you have to also find a co-founder. So first you have to

Starting point is 00:47:08 convince your potential co-founder to say, hey, this is a great idea. Once you convince them, you have to go and find an investor, especially in the enterprise world, right? Like you can't go swap, you need some outside investment. How do you kind of show the value that what you are building makes sense? You have the right

Starting point is 00:47:24 skill sets and pedigree to kind of go and build this out. So that's kind of the, like, you know the story, but how do you tell the story in a compelling way, right? That's the other part. And then finally, once you have that, right, how do you kind of go around building this out? Like, where are you going to hire people, right? How are you going to scale and all of those things, right?

Starting point is 00:47:43 And how are you going to find customers? What's the go-to-market strategy and how do you actually implement? It starts to kind of get into that, right? So it's a very rewarding experience. Like to me, when I was thinking about it, what I realized was that no matter what's the outcome, I'm going to come out on top. Like that learning is going to be valuable.

Starting point is 00:48:01 And either way, it's going to be super useful in the career yeah that's I think that's such good advice for all of us knowing that no matter the outcome if you learn that's the ultimate progress so thank you for that and thank you so much for really helping educate us on orchestration

Starting point is 00:48:20 sort of as a category and all of the differences there we had a great time on the show. Thanks for joining us. Yeah. Thank you so much. Thanks for all the insightful questions. Yeah, it was really fun.

Starting point is 00:48:30 I don't know if I have a really insightful technical or sort of data-related takeaway from this show, so forgive me. But I just think it's really interesting to think about working on an infrastructure or on infrastructure at Netflix while they're transforming the company from being a sort of content, you know, distributor to being a content producer. And it was actually fascinating to hear about that problem described through the lens of microservices, right? I mean, you wouldn't think about, you know, in like a Harvard Business Review case study of like Netflix's pivot from, you know, distributor to studio,

Starting point is 00:49:13 like they're not going to talk about microservices, but that actually was a real pain point as they were making the transition. And so I just really appreciated that perspective. You know, you wouldn't really hear about that particular specific flavor of technical challenge in the process of a transition like that. So it was really fun to get to get an insight there. Yeah, and it seems like Netflix is one of these companies that they are really fueling the next wave of innovation right now. I mean, there are like a couple of different products and companies. They are actually coming from Netflix, which is great.

Starting point is 00:49:51 And it's like super interesting to see all these people, how they were together in Netflix and now they're out there in the market and building companies and creating new products. So they definitely did something right. And I guess the Harvard Business Review should look into it at some point. But outside of this and all the very interesting conversations that we had for the technical details of orchestration,

Starting point is 00:50:21 I think one thing that I'll keep and I would really love to learn more about is edge computing and orchestration. I think one thing that I'll keep and I would really love to learn more about is like edge computing and orchestration, which is still like something it's early for this kind of technologies. But I think we are going to be hearing more and more about that like in the future. So that's another thing that I'm keeping from the conversation we had. For sure. And if anyone is listening who is with the Harvard Business Review,

Starting point is 00:50:45 it's a little bit abnormal, but we're happy to do a cover story if you're interested. So definitely hit us up and reach out to Brooks if you want to talk about that. Lots of great shows coming up. We will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.

Starting point is 00:51:24 Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 89: Solving Microservice Orchestration Issues at Netflix with Viren Baraiya of Orkes

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 89: Solving Microservice Orchestration Issues at Netflix with Viren Baraiya of Orkes

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.