Software at Scale - Software at Scale 6 - Distributed Systems with Indranil Gupta
Episode Date: January 16, 2021Indranil Gupta (Indy) is a Professor of Computer Science at the University of Illinois, Urbana Champaign. He leads the DPRG (Distributed Protocols Research Group) and runs popular Cloud Computing MOOC...s on Coursera. His work has inspired software that runs in many production services, like Serf in Nomad by Hashicorp. and Uber’s Ringpop.Apple Podcasts | Spotify | Google PodcastsWe discussed how academia drives progress in distributed systems, a bit of blockchains, distributed systems and machine learning, doing quality research, working as a visiting researcher in industry, cluster schedulers, and how to choose between going into industry vs. academia.Highlights0:00 - Going into academia. “Make a list of pros and cons, throw away that list, and go with your gut”.6:30 - The acceptance of distributed systems in early 2000s vs. today7:30 - The emergence of blockchain and how the world treats it. “Re-looking the wheel, vs re-inventing the wheel”12:30 - Differences in Computer Science research in industry and academia, especially with distributed systems.21:00 - The inspiration for solving the reconfiguration problem came from open source bug reports. “Directions that seem daunting for folks in industry since the solutions are unclear can be tackled by academia”. Making progress towards solving online reconfiguration problem.The similarity of research to intern projects.25:30 - How to pick an area to do research in.31:00 - Writing a good paper is “90% good communication and 90% good ideas”.31:20 - SWIM - a paper that’s had an outsized industry impact due to good ideas that were well explained38:00 - “What are you excited about in distributed systems today?” - Distributed Systems + X. Machine Learning, Agriculture, etc. For examples, dealing with malicious workers in distributed machine learning (example)47:00 - Training and inference with new machine learning systems like GPT3 becomes a distributed systems problem. Partitioning of data in Tensorflow.56:00 - How can industry and academia collaborate so that industry produces more research?66:00 - What Indy learnt by working in industry for a year (at Google)? Borg at Google (a predecessor to Kubernetes). Omega + Mesos73:00 - Advice for a student evaluating a choice between academia and industry This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Thanks folks for joining the Software at Scale podcast.
I have Professor Indranil Gupta, also known as Indy, on the show today.
And Indy is a professor of computer science at the University
of Illinois at Urbana-Champaign. And he leads the DPRG group, which is the Distributed Protocols
Research Group. And the BOCCE, Cloud Center, what does BOCCE stand for?
The BO stands for blue and orange. And that was a cloud center that I started several years ago.
It's a bit defunct right now.
So there's more efforts right now, including a new center called Just Infrastructures that I'm working on right now.
Okay.
And blue and orange is the color of the University of Illinois.
And that's where I studied.
And I had a class with Professor Indy four years ago. And that's where my interest in distributed systems definitely increased, especially through the projects that we had.
So thanks for being a guest on the show today.
Glad to be here.
Yeah.
And so you do a bunch of different things.
You teach, but you also are, of course, a researcher.
And you also give an online MOOC right on cloud
computing and distributed systems. And just in general you were from, so you've been in academia
for a long time right you did your bachelor's and then you straight went for like a PhD
and that's where you worked in distributed systems. How did you make that decision of
going into academia versus going into industry, especially in a field like distributed systems, where it seems like a lot of companies, at least initially, like Google, Facebook, not Facebook at the time, but definitely Google were working on large systems.
How did you make that tradeoff?
How do you think about that?
So there is two parts to my answer. First off, when I was an undergrad
in IIT Madras in India, I decided to take up research in my junior year, just on a lark,
as something that I just wanted to try. And pretty soon, you know, I fell in love with just the creative aspect of research,
being able to come up with ideas and realize them in practice,
see them working in front of my eyes.
And also just the slightly competitive nature that is inherent in research
where we try to do better than what has already been done before.
So that's when I got into research and that made it fairly, I would say, straightforward
for me to apply for a PhD. When I applied to grad school, I had offers for
direct PhD programs and then these MS slash PhD programs.
And for me, the decision was obvious.
The direct PhD was what I wanted to go for.
And so for me, so that's the first part to your question.
The area of distributed systems is what I worked on from the very beginning.
And what excited me about distributed systems as a whole is the fact that you can have the same or similar code working
at different processes, different machines, different nodes in your system. And when you
had so many different copies of the same machine, what we sometimes call as a replicated state
machine, right? You replicate the same state machine at multiple nodes, multiple processes.
The behavior of the overall system would be something unexpected. It would be what we call as emergent behavior.
That was kind of exciting to me.
Just the feel of that was exciting to me when I was a very young researcher.
Then fast forward to when I was finishing my PhD and applying to jobs, my mind was not made
up on whether I would go to industry or a research lab or become a professor.
And I applied to a bunch of companies and a bunch of universities as well.
It was during the process of interviewing that my mind kind of got made up in terms
of where I wanted to end up at. And that's why I stayed in research and stayed in academia.
So in the interview process, so what was it like, concretely, from these interviews that
you learned the kind of work that companies were doing at the time?
So yeah, so I think, for me, the trade off was, and I I've seen this many times as I've worked in companies,
like you mentioned in Google as well, but I was in sabbatical a few years ago.
I've seen the same trade-offs in companies.
So it's not an apples and oranges comparison, really.
There are a lot of positives that working in a company provides to you
and some downsides. And then similarly in academia, there are a lot of positives and
then there are some downsides. So it's just a question of evaluating personally for whoever
is making the decision, which ones will... You write down the pros and cons. This is
advice I give to all the students who are
considering multiple job offers. Write down your list of pros and cons for your options. Think
about them very deeply and then throw away your list and go with where your heart tells you or
your gut tells you to go to because you want to be happy no matter where you go. That's kind of
what I did. And for me, you know, the biggest thing was that in academia, essentially, you do not have a boss
and you're free to roam the fields wherever your heart takes you.
And I just couldn't get away from that, just that freedom, the academic freedom of going
in whatever direction you want to go to.
Some of that freedom does exist in companies, especially in big companies and a few smaller companies where you're higher
up in hierarchy, but largely you don't have the freedom that you do in academia anywhere
else.
Yeah.
So a lot of people go for entrepreneurship and you decided to go for professorship in
that process.
Yeah.
I mean, back in the early 2000s, and I'm old enough now to say that phrase back in the
old days.
Back in the early 2000s, distributed systems was still a very niche field.
Today, distributed systems is all around us.
But back then, when we would publish papers on topics
like consensus and we'd go to conferences, researchers who did core systems, operating
systems research, they would look at us and they would say, oh, there go those guys who
do this consensus or this strange kind of group communication research.
Fast forward 17 years or two decades, Distribute Systems are all around us. Blockchain, well, you know, 17 years or two decades, distributed systems are all around
us.
Blockchain?
Well, that's consensus.
IoT?
Well, you're dealing with distributed systems.
You're talking about edge or the cloud?
Well, distributed systems.
So you cannot get away from distributed systems now.
It's really become a lot more mainstream.
So entrepreneurship back in those days in areas around distributed systems was much
harder than it is nowadays.
Rajat Mittal Yeah.
As somebody who's been in the field for a while, when you first saw the emergence of
a blockchain and a Bitcoin, what was your thought process?
Because I'm sure that would have been such a strange application of everything that you've
learned and studied for a really long time.
Yeah. I was not surprised by the emergence of it, personally, because maybe it's because of
how excited I am by the distributed systems area. I know that these ideas will eventually make it
into practice. I was not particularly surprised by the implementation of it either. The way blockchains
implement consensus is slightly different from the ways in which traditional consensus algorithms,
like Lamport or others who have worked on it for many decades, how they're implemented. It's just
a different implementation. What did surprise me was the surprise with which industry and the world in general would treat the area of blockchain.
It's like, oh, you know, consensus, this new thing, new kid on the block.
And a lot of us researchers who have worked on it for years are like, yeah, it's not really a new kid.
It's an old kid, but it's a pretty mature kid.
And I see that a lot, not just on top of of consensus, but this kind of, I don't want
to say reinventing the wheel, but just re-looking at the wheel, which happens a lot in industry,
which I think is a very good thing.
You know, reinventing the wheel is a bad thing, but re-looking at the wheel is a good thing
because new generations and new workloads need new systems and old systems don't always
work.
Okay. So what are some examples of that? So I'm thinking with Bitcoin,
it's a different implementation of consensus algorithm where the problem space is different,
where you treat every single actor as potentially malicious versus with traditional systems. Most
of them are fine, but sometimes you want to know they're bad because the system
clock is bad or something like that, and you want to make sure.
What are some other things that you think the industry is re-looking at?
Yeah.
So a great example is databases and key value stores, right?
So Idle databases, relational databases have existed for a long, long time, and they've
provided very strong properties.
At the same time, in the late 2000s or so, folks in industry started realizing that some
of these databases are much slower and have way too many features than they actually need.
And that's partly what led to the development of key value stores and NoSQL storage systems, because
they were faster, nimbler, and easier to change.
Of course, what's happened over time is that people have re-looked at even key value stores,
and they've looked at, oh, do we really need such weak guarantees?
Don't we really need transactional properties?
And so they've actually augmented key value stores and brought them back into the transactional properties. And so they've actually augmented key value stores
and brought them back into the transactional world.
And today, a lot of the databases that run,
even these new generation databases that run
actually support transactions
at high throughput, low latency.
So the re-looking at relational databases in the beginning
and then re-looking at key
value stores again actually led us to come full circle, but we didn't come back to where
we were.
We came back to a better place than where we were.
So, I think that's one of the advantages of not just academia, but even industry, re-looking
at things.
Yeah.
I think I can share a story from Dropbox, which is basically exactly this,
right?
We started with a MySQL database.
The database was pretty much running out of space, and we needed to figure out some way
of sharding it.
And on top of that, from what I've heard, that every time a product engineer wanted
to add a new feature, they would have to run a really expensive migration
on this database with all of user data.
So somebody came up with this brilliant idea.
Why don't we come up with a client-side library
that just stores JSON blobs in the MySQL database?
So that way, every single migration
is just adding another field to this JSON blob.
And that's how we came up with our graph database,
EdgeStore, which is kind of inspired by Facebook style. And then we could easily put that onto more
and more databases and we didn't have to think about one of them. But of course, eventually,
we needed transactions and we needed cross-shared transactions, and then we rebuilt that.
And now we're thinking, can we come up with something that's from the ground up? Or can we just use something like Cloud Spanner or something like that?
Because there's so many options out there.
So, yeah, that's definitely what we've seen.
And I think it's definitely a good thing that industry is continuously re-looking at how to make developers more productive and how to have faster databases and all that.
How would you give advice to somebody
who's evaluating whether they want to go into
the academia industry?
Because both are doing meaningful work.
And if you browse subreddits today,
they'll say, oh, you don't need a PhD in CS.
You can get a job without that.
And I think
that's true, but what are we missing? What are the software engineers missing? What's your perspective?
Yeah. So I think folks who say that you don't need a PhD to do research are absolutely right.
Anyone can do research, really. So I see the difference, let me take a step back.
So I see the difference between what industry does
in terms of research and what academia does
in terms of research as being similar activities,
but with different goals.
In industry, one does research
so that one can improve the company's systems.
One can get better throughput, better latency,
all that stuff. So that it's tied to whatever the end goals of that particular company, whatever
benefits the shareholders of the company, if it's a public company or whatever, it will get them
towards more funding and an IPO if they're a private company. So for us in Stealth Mode.
What I described is kind of similar to what we do in academia too.
You know, we build systems
so that we are better than previous systems
with a key difference that we are doing it
for the purpose of science,
which means that the end goal of our effort in academia
is not just to build the system and then that's it,
but rather it's to go a little bit underneath
and understand what are the intellectual problems underneath and understand what are the intellectual problems
underneath this, what are the generic problems underneath this, and what are the intellectual
contributions from this activity, from this project that we have, which may be useful
in other scenarios.
And this means that evaluation, experimental evaluation, both mathematical as well as real workload
evaluation on real clusters is really, really critical in academic research.
Because we want to evaluate, we want to push our systems to the extreme and see how far
they can go in terms of solving the problem that we want to solve, or perhaps even addressing
other things.
So, I think that's a very key difference.
Some industry and especially research labs do do science. But when the upper administration
in a company is looking at a research lab, they don't necessarily see the science as the first
class product from a research group. They see monetary value.
They also see visibility in the community,
which is of course eventually translates to monetary value.
As scientists in academia, however,
we are funded largely by government funds
at least here in the US.
And so we are essentially producing results for, not just for humanity, but also for the
US taxpayer, which translates to ideas that perhaps can be used by other companies as
well.
So I think that's a key difference here.
So the way that translates is in a company, in a big company like Google, for instance, or Apple or Facebook,
when you write code, the researchers or the developers try to adhere to style guides.
There's a lot of peer reviewing.
In academia, less so.
Style guides are rarer to hear of.
Peer reviewing is also fairly rare, though it's picking up nowadays.
My students sometimes ask me about this.
Why don't we do peer reviewing?
Why don't we have a style guide?
And the reason is,
one of the big reasons is
the more rules you impose on writing code,
the more it hinders creativity a little bit.
You want to build a quick prototype system
and then quickly pivot in an agile and nimble manner
without having to turn a big wheel.
You want to turn a small wheel in order to pivot a little bit.
So that's part of the scientific exploration of ideas
and the space of ideas,
which makes academia a lot more exciting.
It's a lot more solution-driven than industry.
So that's one of the big, one of the differences,
one of the key differences I see.
And there are, of course, others as well.
Yeah, it's a very libertarian mindset towards developing new solutions.
Yeah, but distributed systems is also an interesting space
where a lot of the large big data problems
are inside corporations.
So I've seen a lot of the research that your group does, you embed inside another company
like LinkedIn, and then you publish a paper together.
And then you kind of get best of both worlds.
Yeah.
Yeah.
I think that's right, that compared to 20 years ago, in the field of distributed systems, at least a lot more research today and a lot more publishable research happens in companies.
Not just big companies, but also, you know, smaller companies. If you look at any of these systems conferences, there are a lot of publications by relatively smaller companies. And the publications are, they're on their industry scale systems, but they're also well
evaluated and they fit, they adhere to all the requirements that we have from a research paper.
And part of what's driving that is just the scale of the data and the workloads at these companies, and of course the diversity of the workloads as well.
That's partly it.
And overall, I think that's a healthy thing
for both industry and academia to be involved
in similar kinds of research
and overlapping sub areas of research.
It's not always the way it used to be.
Earlier, we'd have companies that would do their own thing
and then academia would do its own thing.
And companies, I'm talking very generically here,
companies would look at academia papers and like,
maybe that's their own thing.
Maybe we can't really use it.
But I think that gap has narrowed a little bit.
Gap has come down a little bit.
So industry and academia are a lot more closer to each other nowadays in terms of working on similar kinds of problems, similar types of problems.
And so that's led to flow of information, ideas, both ways from academia into, and also from industry into academia.
Having said that, there are still academics who choose to stay largely away from industry and kind of do their own thing in their own ivory tower. And there is some value to some of that,
right? Because scientific research does require some blue sky thinking at some level.
And then, of know, there are companies
and groups within companies that do their own thing that don't read a single paper during a
given year. And that's fine too, you know, there is, there is reason to do that as well.
But I would certainly say the gap has, has narrowed and in a very healthy way.
And, you know, I think the challenge for academic researchers
is to see how we can partner with the industry
in order to have our ideas be realized
more easily in practice.
And also so that we are working on relevant problems.
I like to say that it's very easy to work in academia
on a problem pulled out of thin air.
That's one of the easiest things to do, right. But it's not a very productive thing to do. You want to work on real problems
that affect the real world, but not necessarily design solutions that are usable right away.
In the sense that you want to be a little bit forward-looking and look at the deep scientific
value, come up with ideas that can be used right away,
but that's not the end goal.
You look at also the science
and perhaps extensions of these ideas
that are applicable in other scenarios.
So something like you can pull out problems
that existing companies might be facing,
but your solutions shouldn't be constrained
by the constraints of the existing corporation
that's dealing with that problem.
Yeah, so I'll give you an example.
So in around 2011 or 12, I started looking,
that's when key value stores
and the stream processing systems were really coming up,
all these open source systems like Cassandra, React, a bunch of others.
I started looking at their JIRA pages to see what kinds of issues they were facing.
That's one of my favorite activities, by the way.
I just love looking at pages that list bug reports, however minor, however major, however resolved or not, or will not resolve, doesn't really matter.
I just love looking at them. And while looking at them, several things jumped up at me. And you wouldn't get this unless you were just browsing through these pages with completely no aim in mind
rather than just understanding it. So what jumped up at me was that a lot of
bug reports that went unaddressed related to reconfiguration, which means that you had a
bunch of data instead of servers, and now
you wanted to change something about the way the data was structured, perhaps change the
primary key, or perhaps you wanted to change the set of servers on which the data was stored.
And essentially the way in which most companies back then were doing it in the early 2000s
was, they would shut down the data center or their servers, they would migrate the data,
and then they would spin it back up.
And this is terrible for downtime, right?
And so that pattern kind of emerged
out of all the reports that I saw.
And so we said, why don't we try to solve that problem,
the reconfiguration problem,
where you migrate data online,
but still keep the service active.
And we try to assure as low latency
as if there were no migration going on.
And so that's what led us to several systems, not just in the key value store space, but also in the stream processing space and a bunch of other subdomains as well.
Even though the GDRI reports I was looking at were only for key value stores, we were able to think about that problem in other domains as well. So that's, you know,
that's an example of something that a company may not be interested in,
but that academia could contribute to.
And so that's super fascinating because you didn't,
it's not like somebody told you about this problem.
You were just looking at not open source is maybe not the right way, but just open bug reports of open source components.
And you said, oh, it seems like people in the industry are dealing with this problem. And you
just went ahead and solved it. And maybe they can take some of those ideas and apply them.
So did they finally get applied to any of those systems?
Yeah. So when we thought of the reconfiguration problem and we were thinking
about it, and this was the early days of what would essentially
be a four or five year NSF, long NSF project.
We weren't even sure that this was something
that companies thought about, right?
But then when we published the paper,
actually, even in the reviews to the paper,
which are anonymous reviews,
even the reviews, folks said,
oh, this is a problem we face,
but it's a very hard problem to solve.
So I'm glad someone is making a scratch at it.
And that's kind of, you know,
what underscores why we do research in academia, right?
Because directions which seem daunting to industry
and so they don't even want to scratch it
because they don't know how far they can go.
We can make big dents in directions like that in academia.
And then, you know, once we make dents, then others can build
on our ideas. And by others, I also include folks in industry. And so part of the work that we did
back then was also in collaboration with companies. And so some of that made it into
production systems in companies like Yahoo, for instance.
Yeah, because that's the kind of problem that companies wouldn't be incentivized to solve,
at least from being inside a company. It's just such a big problem. And you don't know what a timeline for that would look like. It's not solving an immediate business need, but it is a really
nice service. It's the kind of problem I'd like to think of as like an intern project for the, but in a much bigger sense, like where it's really
nice to have if they solve it, but if they don't solve it, it's not the end of the world.
Yeah. Yeah. And you know, internship projects are what seems like small hacks are great starting
points for research projects. You know, you think, oh, this is just a small
hack. It fixes my issue. But an interesting exercise to do there, something that I recommend
everyone to do when they're trying, quote unquote, a small hack is to think,
is there something wiser underneath this hack that I'm doing? Same thing goes with internships.
So a lot of internships are like, oh, write this piece of code and it won't take you three months, and then your internship is done.
But if you think a little bit deeper underneath whatever you're doing,
sometimes you uncover gems.
It just needs that little bit of think time.
And this goes back to why someone likes a particular area
in terms of doing research.
So usually what I tell people is, and this is something that one of my professors told
me when I asked him, you know, I was just in grad school and asked him, how should I
pick an area to do research?
And the advice he gave me, which I'll narrate here is pick an area that you like thinking
about in the back of your mind, when you're walking to work or walking from work or driving to work, driving back from work,
and that doesn't feel like work. Thinking about it doesn't feel like work.
That's a very important criteria to have for research, right? Because it's in those idle
moments that interesting ideas come and interesting thoughts come.
And that's what keeps you engaged.
As one of my students likes to say, if you love your work, if you love your job, you never have to work for the rest of your life.
That's kind of what summarizes life in academia.
You know, a lot of us are very passionate about what we do.
And so sometimes I think, oh, you know, this university is paying me salary to do stuff I love doing.
There's some scam here, right?
This can't be true, right?
And I've had this feeling since the day I was a grad student.
You know, I'm getting paid to do research.
How is this possible?
But there is a value in doing research and there's a scientific value in doing
research, which, which in the U S certainly,
and in many other countries too is still appreciated because it contributes
directly to the economy, contributes to companies.
And so there is, there is space there.
So something that we hear from outside academia
is that it could also be stressful in terms of,
you know, there's like,
you have to think about how you get funding
for your particular research area and all that.
Is that all like overblown or is that like a real concern?
Like how would you encourage someone to think about that if that's what they're worried about?
Somebody who's thinking, I really want to go into academia, but I'm worried about not
getting professorship or fighting about funding all the time.
Yeah.
Yeah.
The fear of funding or fear of writing grant proposals is one fear.
The other fear associated with becoming a professor is just the fear of not getting
tenure.
My philosophy to both has been to ignore them completely, to just do whatever I feel I like to do.
And I think that's easier to do nowadays in general. Look, you go into academia, let's say, as a professor, and you work on things that you're
passionate about, the chances are good that things will work out.
Tenure rates at top places are very high.
Funding rates for people who are productive are generally very
good.
People who are productive in terms of coming up with good ideas are generally good.
It's still competitive, but I think if you're passionate about what you do and stick with
what you are passionate about, I think things will work out.
There's no guarantee, of course. So yeah, that's my philosophy. I think
funding is competitive. Acceptance rates for proposals in the National Science Foundation are,
last I checked, they were around 10% or even lower than that. And that's significantly
more competitive than inside companies.
Like if you're running a team inside a company,
getting funding for whatever you want to do
is something that gets worked out
via discussions and with your managers
and things like that.
So it's not really a competitive process.
It's very rare for two teams
to be up against each other
when they're trying to do the same thing. It's kind of a competitive process. It's very rare for two teams to be up against each other when they're trying to do the same thing.
It's kind of a different game in academia.
But one of the things that comes out in academia
is no ideas are bad ideas.
So dealing with rejection is something that we do all the time,
build a thick skin.
And some of the best papers that have been written
have been rejected multiple times
before they were written.
And this is not just true of computer science,
it's also true of physics.
A lot of Einstein's papers were rejected, for instance,
before they eventually became really famous.
And rejection is actually a blessing in disguise. It helps you
rethink this idea of re-looking at things. Again, it makes you rethink how you were communicating
your idea. The idea itself and your execution of it may be fantastic, but communication is
also part of writing the paper, writing proposals, or writing whatever.
So unfortunately, human beings have a very low, very
broad rate in communicating with each other.
We have to write words and things like that.
And we have to write it a specific way
so that it's communicated and it's understood by others.
So that's part of our role as well, communicating.
So sometimes students ask me, oh, Indy, you're talking about communication and also coming
up with new ideas.
What percentage would you put each as being important in a paper?
How much importance does communication have in a paper? 50% and the
remaining 50% is the idea itself, or is it more like 60-30? And I like to say, well, a good paper
is 90% good communication and 90% good ideas. Anik Debele, Ph.D.:
Yeah. And I think an example, like an excellent example of that is the swim paper that you worked on with like a graduate student, I would say 15 years ago now.
And that paper has become really, like I shouldn't say famous, but we're probably using some kind of application that's using some of the ideas in the SWIM paper.
So do you mind just sharing a little bit about what that paper is and why I know about it?
Yeah.
Yeah.
So the story of Swim is actually kind of interesting too.
So when I was in grad school, I decided to do an internship at IBM Research, T.J.
Watson in New York City.
And the internship was kind of unstructured.
And I was very fortunate in being able to talk with not just my manager,
but also my manager's manager.
I would just walk down to his office and talk with him.
And during the internship, I was doing something in the foreground,
but in the background, I was thinking,
what can I do to turn this internship into something
that could be included in my thesis?
So I ended up writing this kind of highly theoretical paper.
I was doing a bit more theory than systems back then.
I ended up writing this paper that was a little bit theoretical on failure detection and a
good, a very close to optimal way of doing failure detection.
Just to interrupt, failure detection is when you're trying to figure out whether
a machine is up or not, or some kind of node
is alive or not in a distributed system, right? Correct. Yes. So you have a collection of servers
and the servers may fail or crash at any point of time. And you want to detect the crash of any
server, which can happen at any point of time fairly quickly, but also very accurately, meaning
that you don't want to make mistakes. You don't want to mark servers as failed when they actually
haven't failed.
I know this failure detection is a very important
sub problem for pretty much all distributed systems to solve.
Everyone needs to solve it.
If you're running a database, for instance,
if you don't do failure detection, you might lose data
or you might lose transactions, you might lose operations.
And so failure detection is a core problem
that every single distributed system needs to solve.
Thank you.
After writing this theoretical paper, I just put it away and it got published in the top
distributed algorithms conference.
A year later, I was back in grad school at Cornell and a couple of my... Actually, my
office mate, who was a first year PhD student,
then I was in my second year at the point, maybe third year, but anyway, he was a junior PhD
student. He came to me and said, no, we have to do this course project and we're just looking for
an idea. And I went, you know, there's this theory paper I wrote, why don't you just read it and see
if it looks interesting. So he read it and he and his project partner read it and they implemented a first version
of it.
And I was pretty impressed by the fact that they actually first picked that paper to implement
and by the fact that they did it so quickly.
And then I sat down with them a little bit and went over some of the designs.
And while we were discussing, some new things jumped up in terms of the systems design portion
of it.
And eventually, that became a paper that is a swim paper.
And it holds a very special place in my heart, the swim paper,
because it was the first paper I wrote in grad school
that did not have any faculty members as co-authors on it.
It was just me and two other students.
You weren't a faculty member at that time.
No, I was a PhD student back then, right?
So it was me and another PhD student, junior to me,
and then another master student. So it was very special, very interesting that, you know, something like that could be
pulled off with zero faculty involvement in a sense. It also gave me the first taste for what
it felt like to work with other students. So that was the swim paper. And then, you know,
we published the paper. We never open sourced the code,
because you know how student code is, right? Going back to the discussion of why we don't
adhere to style guides and things like that. So we forgot about it.
And then many years later, I got to know... This was when I had become a faculty member. I got to
know that a bunch of companies had adopted it, that Uber used a version of Swim
in their internal system, that HashiCorp had implemented Swim in their open source systems,
and those were widely used by other companies.
I spoke with some of the developers and I asked them, why did you pick Swim?
There's a bunch of other.
They never reached out to you?
No, no.
They never reached out to me, but one of my former students ended up working at their pick Swim. There's a bunch of other failure detection. No, no.
They never reached out to me, but, you know,
one of my former students ended up working at their company and then,
you know, that's how we made contact.
And so, you know,
I ended up meeting with the developers and asked them, you know, why did you pick this paper at all?
Was it because of it's the most scalable or was it because it was the
fastest or perhaps the most fault tolerant?
And I was very surprised when they told me, no, it was because your paper was easier to read.
And that's it, right? So it goes back to communication, that communicating ideas in a
clean kind of intuitive, understandable way is really, really, really important as well.
And this is true, not just in academia, right? I mean, it's also true inside a company.
You might do something spectacular,
but the presentation of it
and how it is seen and is visible
to others inside the company counts for a lot.
Yeah.
And for those listeners,
Numad by HashiCorp uses Swim.
I think till today, I just checked on it recently.
So a lot of systems are implemented using that.
Yeah, I think HashiCorp Surf and Console
also used versions of Swim in their RingPop system.
I think Uber's system is RingPop.
Yeah, yeah.
I forget what HashiCorp's system is called.
Yeah.
So yeah, I think that speaks to, as you said, right, the
importance of communication. And ultimately,
the work in academia and industry,
especially in something like this, is not
remarkably different, right? It's
presenting
your work well, as well
as coming up with new productive ideas
is what gets you ahead.
Yeah, I think that's
fascinating. And
in terms of failure detectors,
or maybe even just like
distributed systems in general,
what do you think has evolved?
So that is, I think, a 2003 paper,
and then it got implemented in 2010.
So what are some things that are new
that's like a 2020 research paper
that you're excited about?
Yeah, so I think there is a few things. that are new, that's like a 2020 research paper that you're excited about?
Yeah. So I think there's a few things. So I think there's a lot of very interesting problems in the area of distributed machine learning. And in general, the area of distributed systems plus X,
where X takes on values such as machine learning or Internet of Things, IoT, or X
is human-computer interaction.
How do humans interact, users really interact with distributed systems?
Distributed systems plus agriculture gives you this entire area of smart agriculture
or digital agriculture.
So the way I look at distributed systems is that it's a fairly mature field now.
And companies have largely figured out how to implement all the basic building blocks
for distributed systems, be it failure detection, be it consensus, be it multicast, or be it
any of these, leader election, mutual exclusion, all these classic problems. So there are some niche sub-areas where researchers could still contribute ideas and have it be
relevant to industry.
Blockchain is one of those areas.
And there is still some activity going on in the consensus area.
But by and large, if I come up with a new leader election algorithm for classical wired
data centers, companies are not going to be interested lead reduction algorithm for classical wired data centers,
companies are not going to be interested in that. They're like, yeah, Google's going to
be like, yeah, we have Chubby or some successor to Chubby and that works. We don't care.
A lot of the open source folks are going to be like, ah, Apache Zookeeper works just fine.
Why do we care about your lead reduction or your mutual exclusion algorithm?
So I think a lot of the interesting action nowadays is in the Distributed Systems plus X space,
where when you relook or rethink
about Distributed Systems problems
in a completely new light, under a new set of assumptions,
then the old solutions no longer work as well,
and you have to rethink new solutions.
So I'll give you an example. We recently started working on what happens when you have a set of
workers and they're doing distributed machine learning, but some of the workers are malicious.
They're behaving in what we call a Byzantine mode, and that's the classical distributed systems term. And it turns out
that in classical distributed systems work to reach consensus, consensus algorithms can only
provably tolerate only up to one third of your workers being malicious. If you have more than
a third plus one being malicious, then bets are off. Things can go really bad.
But in distributed machine learning, it turns out that you can actually tolerate
as many as 50% of your workers being bad.
And in fact, in practice,
you can tolerate a lot more bad workers than that.
And so when we proved this formally,
it was very surprising to us.
Like, why is this happening?
But when you think about it intuitively,
it makes sense because machine learning
is naturally very noise tolerant.
There's already a lot of noise built in in data and the machine learning that you see,
whether it's stochastic gradient descent or any other machine learning, is already naturally
noise tolerant to some extent of noise that is inherent in the data.
And so if you have techniques that essentially treat the malicious nodes sent data or gradients
as part of this noise, then you're getting some
of the same behavior. That noise tolerance is also inherited by the Byzantine tolerance.
And that's what gives you a little bit more Byzantine tolerance in distributed machine
learning than you would get in distributed consensus. So that's one of the scientific
insights, which you wouldn't get if you were doing this
like in a company or something.
And it also opens new doors, right?
Now you can think about Byzantine tolerant machine learning
with a lot more confidence.
There are many other kinds of security attacks.
Byzantine is just one.
Poisoning attacks and a bunch of other attacks
are a lot more insidious.
But it opens the doors to thinking about them and aiming for higher, not just aiming for what the classical distributed systems results
told you. So the same problem that had well-known results in classical distributed systems
has completely different bounds, has completely different solutions in
distributed machine learning systems. So I would say distributed machine learning is one area where
there's a lot of action.
I think serverless versions of many classical systems are another area. Serverless or actor based versions. In general, I think when we build systems, we don't necessarily pay as much
attention to what the users experience. So So thinking about user metrics and user experiences, but lower down in the system stack, I think
that's a very important part.
That's an area where I think industry has a much harder time making progress.
That's where I think academia can make a lot of inroads.
Distributed machine learning, distributed IoT, other distributed systems areas are areas where industry will
have as much stake as academia.
But areas like distributed systems plus HCI, it's something that academia may be able to
contribute a little more.
It's fascinating that distributed machine learning is much more resilient to Byzantine
nodes compared to classical algorithms.
I didn't fully
grasp that, so if you could just go
a little deeper into
when there's a system that
is like,
so what's the difference?
You mentioned that
it could be, even with a system
that has more than 50% of nodes
being bad. At that point,
wouldn't you rate the bad nodes' data as less noisy compared to the good nodes,
even though that's not correct? Or what am I missing here? Yeah.
So the first big difference is that the problem, the consensus quote unquote problem
is slightly different in the classical systems world
where, and in the machine learning world.
In the classical systems world,
essentially in the consensus problem,
you're trying to make a clear decision
between a one or a zero, right?
Should I accept this right sent by a client
or should I accept another client's right
as the next one in my order, right, for instance. While in machine learning, if you look at stochastic gradient descent,
for those listeners who know how it works, or at least have heard of it, essentially,
all it means is that you're doing a gradient descent and you want to make sure that you're
descending towards the same minima and that you're descending at the same rate. Those are the only things that matter. So that's a lot more continuous mathy
than the discrete math version
of the discrete consensus problem, right?
So essentially all we needed to prove
to show that the Byzantine nodes would not have an effect
was to show that the minima towards which
we were converging is the same, one.
And two, the rate of convergence is the same
as if you had no failures at all.
And so the problem itself is slightly different.
Okay, and you can just run like a bunch of simulations
to prove something.
That's right, yeah, yeah, yeah.
So, and not surprisingly, it turns out that,
you can prove something,
but usually the bound is not tight,
which means that when you simulate it,
you can actually tolerate a lot more bad nodes. And then the effort, of course, is to go back and
try to tighten your bound and try to prove that what you saw in the simulation is actually also
true in theory as well. Though we have not been able to do that, but yeah.
And with distributed machine learning, it seems like there's so many different things you have
to think of, right? You have partitioning your data in a particular way. It seems like there's so many different things you have to think of, right? You have partitioning
your data in a particular way. It seems like you would have to think about it in a completely
different way versus the standard way you would shard data in a distributed system.
Right. Yeah. And there's many different kinds of distributed machine learning, right? So the
results that I talked about, about Byzantine tolerance applied to distributed versions of
stochastic gradient descent, or what we call it SGD. But there is deep neural nets, DNNs, there is recurrent neural nets,
and all these different categories of machine learning. And some of them have distributed
versions of it, some of them don't. So there's a big challenge of how do you make distributed versions of these? And I have a large neural network and it won't fit on my eight gigabyte GPU.
And I need to split it across multiple GPUs.
How do I split it best?
So there's first that step you need to solve.
And after you solve that step, then you come to the question of, oh, now what happens if
some of these workers go offline or some of these workers misbehave?
So that essentially you're kind of revisiting the distributed systems workflow,
but for a completely new domain.
Yeah. It seems like a rabbit hole you can go really deep into.
Yeah. Yeah.
Because everything is easy as long as it fits onto one machine.
Yeah.
But then as soon as that stops happening, you're stuck.
Because if you think about all of these different things.
Yeah.
And with the advent of new models like GPT-3, they just won't fit on regular GPUs.
In fact, even well-known DNNs like Inception V3, ResNet, these are the standard things
that researchers look at.
They won't fit on a GPU that has a couple of gigabytes of RAM.
Just won't fit.
Just imagine trying to do training or inference on cell phones, which don't have a lot of
memory, with any of these models.
And it's just nearly impossible today.
Forget the power-hungryness of these neural networks.
Even the memory constraints are very hard to satisfy.
So I think that's
another area where distributed systems plus X, you look at distributed systems plus IoT plus machine
learning, it has a lot of open problems. How do you take large neural networks and make them fit
and run on smaller devices? Yeah, there's so much. And now I'm just curious, and I want to look up some papers. What is the recent work in this field? several years, several companies, including Jeff Dean's team at Google, they looked at this problem.
What happens if I have this large neural network, but I have multiple GPUs and I need to split the
neural network across these GPUs, right? And essentially it's a problem of graph partitioning.
But the partitioning has to be done carefully so that, you know, you don't end up, first off,
you partition quickly so that you come up with a plan quickly.
And second, once you have partitioned,
then the actual execution time of the placed,
the partition model is also pretty fast.
So a lot of the work that existed when we looked at this problem two years ago,
we said, okay, let's look at this partitioning problem.
It's actually called, in machine learning,
it's called model parallelism.
That's the technical term.
So a lot of the solutions that existed, including two papers from Google,
essentially were using reinforcement learning to do the placement itself.
So they would profile how it ran, and then they would profile it quite a bit,
and they would use reinforcement learning as the scheduling algorithm.
And the schedules they came up with were very good.
They were comparable to schedules generated by experts.
But the time to generate the schedule would be hours
or in some cases, like two days.
And that's too slow for, imagine a developer
that's trying to quickly play around with a few models
and you tell them, oh, now you have to wait for two days
for your next iteration.
That just doesn't work.
Just to plan how to partition the data.
Yeah.
Yeah.
Just how to partition the data.
So we said, and this is, again, an example of what you may not be able to do in industry
because it's so open-ended that it's not clear you would succeed.
We said, let's look at very old classical literature that's looked at parallel computing,
that has looked at DAGs or directed acyclic graphs of tasks.
And let's see if those partitioning algorithms can be applied.
No reinforcement learning, just good old-fashioned classical algorithms.
And so we actually found a couple of algorithms that were published and that were provably
close to optimal.
It's an NP-hard problem, so you can't really solve it optimally, but you have heuristics
that are close to it.
And we applied those and we actually implemented those in TensorFlow.
We're like, okay, now that looks like a paper.
And it was crap.
It didn't work well at all.
It was like, what's going on?
And then we had to look under the hood and see what constraints TensorFlow required us
to solve.
There were a lot of constraints related to co-placing operators, forward and backward
operators on the same device, grouping operators so that the communication was minimized, a
whole set of heuristics and a whole set of optimizations, some of which were essential, some of which were needed for efficient operations.
And this entire activity took several grad students, including the lead student, who
was a very persistent student, I would say up to a man year, one year worth of effort.
And at no point along the way were we ever confident that we will have a system at the end of this or we will have a paper.
It was just so completely open-ended.
And I still remember the day the student came and said, Indy, we have a positive result.
And he said, you know, remember those neural networks that were being placed in two days by Google's paper?
We are able to place that in 10 seconds. And the placed model is also comparable to the experts.
And so that's an example of how academia can help you re-look at directions. In this case,
it was like, oh, you don't need to apply reinforcement learning. Reinforcement learning
looks like a big hammer, but maybe it's too big a hammer to swat a fly with.
Maybe a fly swatter is just perfect.
Was the kind of problem just that there were so many things that had to be solved?
Or was there just a few key insights that had to be made by those grad students in order to solve this?
And did they ultimately open source their work?
Do you know if it if they used in production? Yeah.
Yeah. So the work is open sourced and we publish all of our systems and papers
and code for it. We publish it on my group's webpage, the DPRG page,
dprg.cs.duiuc.edu. It's still a relatively new paper. So this paper was published last summer.
So the inroads to industry are... It takes a while to make inroads into industry.
A lot of the problems that we needed to solve were very
TensorFlow problems. TensorFlow is a complex system to work with. Every system is complex
to work with. Every system is its own beast. TensorFlow has a lot of quirky things.
PyTorch and other systems has its own similar but different quirky things.
And part of our effort was to solve some of those challenges,
but also think consciously about whether those challenges were TensorFlow-specific
or whether they were also generalizable.
So I think that's part of the challenge there as well.
I think some of the techniques we came up with,
especially optimizations where we group operators together,
those would be applicable
no matter what your base system was,
whether it was TensorFlow or PyTorch or something else.
But there were a few other things,
especially related to avoiding cycle, for instance.
Our placements might sometimes generate cycles
among the operators, which was unacceptable to TensorFlow. And so those would be TensorFlow-specific
optimizations. There were a few others. But looking back on it, I would say it was a great
adventure going through all those steps. And there were many things we tried that didn't
work out at all. That's part of doing these. You try something and it's like, no.
A month later, you're like, no, that didn't work out.
Let's backtrack.
Let's put on the other part of the state space search.
And I think that unpredictability would be just very hard to do in industry.
Research labs, a research group might be able to do it in a research lab. But even there, you know, they have deadlines and things like that.
It's much harder. Yeah. group might be able to do it in a research lab. But even there, they have deadlines and things like that.
Yeah, as somebody who goes into distributed systems research,
you don't think that,
oh, I'm going to be mucking around with
TensorFlow code base for a year.
But that's the kind of stuff you work on.
Yeah, and it needs persistence,
right? So this is one of those things,
all those experiences where
the adage of you must really
love it in order to make progress and it comes back so uh this student was very persistent and
i was very impressed by how persistent uh they were um a different student may have given up at
some point of time may have been like i'm wasting so many months of my life on this i don't know if
this is going to succeed or not and part of it is just having confidence that it's going to succeed. And I think just that persistence pays off a lot. So,
you know, persistence pays off, not just negatively, it pays off everywhere.
But, you know, sometimes you have to, you have to cut your losses and run, right? It's kind of like,
you know, you're in a movie, you're paid some money, you're in a theater, you're paid some money and you're like,
this movie is kind of
sucky. Should I go out or should
I stick with it?
And, you know,
if you're wise and you're
persistent, you'll realize that it's a movie
that you can actually change, that you have
control over how the movie goes
and that you could turn it into an exciting
movie.
One of those movies where you can just change your interpretation
and have fun even though it's a garbage movie.
Oh, yeah, yeah.
Like The Room.
Yeah, yeah, yeah, exactly.
Yeah, yeah, yeah.
Yeah, and I guess somewhat related to this is,
I know that you're passionate about this topic,
which is how do you
get industry to collaborate better with academia right yeah definitely the bigger groups like
google and facebook they publish papers on like a regular basis but the standard software engineer
especially in a smaller company or even in different parts of larger companies they're
not really incentivized to publish papers. I know that
I've thought about it and I just wrote a tech blog and I gave up. It's a lot of work. I don't
know how the review standards would be. I don't read too many papers in the space that I'm in,
but how does one get to change that? Or do you have any ideas on what could industry be doing differently? Yeah. So I think it's understandable that many companies and many groups that write industry systems don't ever publish their work.
You know, when you think about how someone would spend their time, like a few hours in a typical day. You could spend it writing code to earn your company a million dollars.
You could spend the hours alternately looking at logs of a recent outage and understand
some of the causes of that outage.
That's kind of exciting too.
Or you could fix the outage or you could troubleshoot.
Or you could just have a nice lunch or you could take a walk.
Or you could spend time with family or friends. Or you could just have a nice lunch. Or you could take a walk. Or you could spend time with family or friends.
Or you could write.
That's like the most boring of all of the options that I mentioned.
It seems like the least productive.
So that kind of indicates that there is an obstacle to publishing or cutting-edge papers.
But at the same time, many of the systems that
have been built, whether it's in large companies or small companies, and that have had some degree
of success, even if they are only industry systems, and even if they do not solve a
core scientific problem, even then they are worth publishing because in many cases, the reason the
system is attractive may not be visible to
those who developed it. For instance, maybe they're handling scales of users at a much
higher scale than has ever been handled. Or maybe they're handling a diversity of use cases,
much different than has been handled. Or maybe just the workload that they're handling is just so unique that it will be interesting
to the broader community.
So I think there is a lot of value to publishing papers that are inside a company, a system
that is deep down in the bowels of a company that users will not see.
I'm not talking about the YouTubes or the Gmades.
I'm talking about Google's core scheduler for the data centers, which by the
way, was never published.
Borg has been around inside Google's data center since the early 2000s, and the Borg
paper was only written in the mid-2010s.
So it ran for a decade and a half before it ever saw the light of day. And so I think, you know, I certainly hope that a lot more industry practitioners think about
publishing their systems. And also, I hope that academics and academic researchers are
willing and take the step to partner with some of these companies in order to help them evaluate the
systems, analyze their systems,
because I think that's part of the hard work that you need to do to write the paper, right?
For which sometimes the industry practitioners don't necessarily have time, developers don't
necessarily have time. So that's something that we were able to do with a few companies. With
Microsoft Service Fabric, we work with the Service Fabric team and help them evaluate their system and kind of bring
together the different parts of the system all in one paper.
And same thing with a few other companies that we have had to do.
So it's a very interesting exercise.
It's a very different exercise than typical academic research.
Typical academic research is like, oh, what's the new idea I have?
Or what is the new problem I'm solving? Here, the goal is more like, what did this company do?
Or what did this group do? And what did the system do? And what use cases are they solving?
What workloads are they looking at? And how do we tell the story? How do we communicate this
in a way that reflects what the company was trying to do. So I think that's kind of important.
Companies sometimes are more reticent to publish the systems that are deep down in their bowels
because they see a competitive advantage to them being secret.
But to that, I would actually say that publishing papers is a long-term investment strategy. If a company is visible in a conference,
a well-known systems conference, the company is able to attract better employees, researchers
who are grad students, and they're like, oh, Company X is doing cool research in this topic,
and maybe that's a company I want to work in. So it's a recruitment tool.
It's also a way of addressing competition. If you're publishing on a system that you've been running for a while, you're not going to publish on the latest version of
the system. So you're already well ahead of what you're publishing. So you're already well ahead
of your competitors. So you'll have to look at what Google did, for instance, with MapReduce.
So when they wrote the MapReduce paper, Google has never open sourced its MapReduce implementation
to this day. They only wrote a paper on it. But the paper was then picked up by folks
at Yahoo, who then wrote Hadoop based on it, and then HDFS and a bunch of other systems.
And that essentially is what kicked off this entire big data systems revolution back in
the mid-2000s. And that revolution has benefited a lot of companies, including Google, which wrote the
original MapReduce paper.
And it has helped attract as well, helped attract folks to join the company, Google.
So that's just an example.
Not every paper you write is going to span or spawn a new sub area of research.
Some of them may have more limited use, but everywhere you write brings visibility to
the company and gives you an edge over your competitors.
Yeah.
At least it seems like software engineers do want to share what they're working on.
There's corporate tech blogs and then there's presentations.
I think there's also a lot of uncertainty, right?
I'm writing this paper.
I have no idea whether it's going to get published or not or what the requirements are.
So definitely that sense of community or that sense of help from academia that, oh, I'm
working with a grad student who's in this space and they think I have something valuable
to share.
Yeah.
I think that would go like a long way, especially if you'd see a case where a company would
publish an informal blog post saying, this is the system that we've built. And maybe
somebody from academia could be like, oh, that looks like a very interesting candidate
for a research paper. And that might just be the small nudge required to go from a blog post to a research
paper. Yeah. And sometimes the developers and companies who are even building these systems
may not even realize that their systems are a significant step forward in the research landscape.
So I definitely encourage developers to talk with academics as well. You don't need to repeat all
the details of what you're doing, but you can talk at a high level of what you're doing and
that might be enough feedback for you to see that whatever you're working on is really a big
forward step. So I think it can help both ways. So I think researchers can help both ways, right? So I think researchers can benefit by understanding what companies are doing so that in academia,
we work on problems that are still current and relevant.
And, you know, we are not solving 10-year-old problems, which have already been addressed.
And then companies and industry groups can also benefit from talking with us because
it helps them calibrate whether or not they can give more
visibility to whatever they're doing right now. So I think that's one of the advantages of going
to a conference. So I encourage people from industry, academia, everyone to go to any of
these systems conferences. Even if you feel you might be outside your comfort zone, that's a good
thing. You go outside your comfort zone and that teaches you something. And just talk with people.
These conferences,
it's not just the presentations,
but it's the corridor conversations
you have with completely random people
that lead to some very interesting insights.
I cannot tell you
how many research projects
I've started because of
completely random conversations
and corridors.
There's just so many of them.
Yeah.
As soon as this pandemic gets over,
that's one of the first set of things that I'm excited to do.
Much harder to do corridor conversations over Zoom.
Set up a time.
I think that's one of the ways in which I think research in general has suffered a little
bit in this pandemic era or the Zoom era.
It's much harder to have impromptu conversations.
Everything has to be scheduled.
Plus, you know, there is always the issue that you can't really be on Zoom for too long, right?
There are downsides.
The downside might not be really obvious from the get-go, but there are definitely downsides,
let alone like the social interaction and just building relationships and all of that.
So you mentioned right at the start that you were a visiting researcher or like you're
taking a sabbatical to work in industry for a year.
So were there some, and you'd been in the field for more than like 10, 12 years at that
time.
So what were some things that you learned that were interesting by going into industry? Yeah. So 2011 August to 2012 August was the one year that I spent
at Google. I was a part of their infrastructure team. And I think just seeing the discipline that exists inside the company in terms of how they write
code, how they peer review code, and just the style guides they adhere to.
These are things that are known openly.
I think that was eye-opening for me.
Also seeing the kinds of problems that they care about, the kinds of problems they work on was useful
and it benefited my research.
Though I didn't end up working on exactly those problems
after I returned,
it was kind of a change in the mind frame,
the way in which to approach problems.
There were problems that I looked at before I went to Google
and my reaction was, ah,
that looks like an interesting problem.
And then after I came back from Google, my reaction to the same problem was, no, that
has an easy existing solution.
That's not worth looking at.
But this other problem, that looks more interesting.
So it's more of a grounding, I think.
The grounding to reality, I think, was really, really critical.
So I think that mind frame shift was really really critical so i think that
that mind frame shift was really the most valuable thing that i got from there um plus of course you
know i i was very fortunate to work in a team which had uh amazing people uh and i got to
interact with a lot more people beyond that current team uh and um i you know those are still
colleagues and collaborators to this day on those other
projects and similar projects.
So I think that's part of one of the side effects of being in academia that, you know,
you get to meet and work with so many different kinds and different sets of people that you
make friends over the years, more professional friends and personal friends over the years.
So I think that was, yeah.
So I would certainly say that my one year at Google
changed the way I look at problems,
that changed the way in which I pick problems to work on
because it really, really grounded me.
And I definitely recommend, you know,
something similar for all faculty members,
for all folks who are in academic research,
and to do it periodically,
because industry is so fast moving in the systems era
that you have to do something like this every seven, eight years
in order to stay grounded.
Are you okay with sharing what exactly you worked on or is
that covered under some sort of NDA? Yeah, those things are covered under NDA.
There is a patent we wrote, which I think people can find from my homepage. So it was related to
how you do, how Google's core scheduler, the Borg scheduler, runs in the data centers.
And Borg is pretty much the predecessor to Kubernetes.
Correct. Yeah. And Borg still runs today, right? So Borg still runs internally in the data centers
in Google. And it was developed very early in the days of Google, as the Borg paper published 2015
says. I think there's a more recent paper on Borg as well. I think John Wilkes, who was
my mentor when I was the one year in Google, John Wilkes has been very instrumental in
working with folks inside Google to write papers on not just Borg, but
on other things. This thing that I was saying earlier, where academics need to work with
industry. Sometimes the folks, developers inside industry sometimes do that. And John
Luke has been fantastic in doing that inside of Google since he joined there. So a couple
of papers have been written on Borg, one in 2015, and then I think there was one more last year, I believe. Also in Urosys, if I'm not wrong. I would say those papers don't talk about
the entire 100% of Borg design, but they talk about its important parts.
So having seen Google's inside implementation of Borg and having seen the papers as well,
which is an external phase, I can see what's missing
and I can see what's new, what experiments are new.
And it all makes sense, right?
So there's still value in publishing those papers.
Yeah, you definitely need those shepherds.
It seems like John Wilkes to help push that along.
And I can also imagine how hard it would be to migrate from a cluster scheduler
to something like Kubernetes.
No matter how mature Kubernetes seems from the outside,
the amount of bug fixes that would have gone into Borg
over 20 years of its existence, I guess, at this point.
Yeah.
One of the adages that I found in many companies and in industry is that for every system,
there are two versions of the system.
The one version that was old and therefore has been deprecated, and the other version
that is new and doesn't work yet.
The migration isn't complete.
And the migration isn't complete. And the migration isn't complete. Everything has two versions.
And sometimes the new version becomes mature enough
that they can actually deprecate the old one
and really move to the new one.
But in other cases, they actually abandon the new one.
They're like, yeah, if it ain't broke, why fix it?
Maybe we take lessons from the new one
and put it in the old one
rather than the other way around.
So I've seen both of them happening in Google and in other companies as well.
Rajat Mittal Yeah.
I remember there was a paper on a scheduler called Omega, which they released.
I don't know if they actually ever deployed that.
They never revealed that information, so I don't know.
But given that they started working on Kubernetes, I'm guessing they didn't, or they might have
tried and rolled it back or something.
But I think that paper is the one that inspired Mesos, if I'm not mistaken. So you might not
fully deploy a system internally, but by writing a paper, you can inspire some work even externally.
Yeah. Omega was built around the same time as Mesos.
And I know that the Omega developers talked to the folks at Berkeley, Stoica's group at
Mesos, and I know there was a lot of interchange and exchange of ideas.
But you're right.
That's one of those.
The Omega paper and Omega work, as well as the Mesos paper, are papers that inspire new
generations of research.
I know that at least at some point,
places at Dropbox, we use Mezos.
So we've definitely been helped by that.
And I think in my experience,
it's been a pretty reliable system
for the kind of workloads that at least we use it for.
Yeah, and I want to ask maybe a couple more questions
and then you can call it a day.
But just super high level, maybe this is repeating something that we've spoken about previously,
but just to hone in the point, if I was a student today and I had to decide between
doing research or going into a field, like into software engineering in industry.
Very frankly, I would say something like,
it seems like in AWS or GCP or Azure,
it seems like there's a lot of high scale,
petabyte scale databases and all of that,
that work is going on.
How would you, if you were a transfer,
if you were a student right now and you had to pick,
what do you think you would do?
Would you first go for a PhD as well?
Would you try out industry for a few years,
then maybe do a master's?
How would you go about making that decision?
Yeah, so let me talk about the two angles of it.
One is the mental angle.
The other one is a more logistical angle.
So I hope that students are able to experience both sides,
both the industry side via internship while they're students,
as well as the academic side by perhaps
engaging in research projects with faculty members while they're a undergrad student
i think it's important to experience both sides because
uh it's useful to know what you like and what you don't like and you know, there is an old Latvian proverb that says, the work will teach you how to do it.
And that's kind of true of a lot of things, but it's certainly true of research.
And sometimes while the work is teaching you how to do it, you grow to love it.
And that work might be the work you're doing in a company.
You're like, oh, yeah, this is clearly for me.
I clearly want to work in a startup. Or while you're doing research a company. You're like, oh, yeah, this is clearly for me. I clearly want to work in a startup.
Or while you're doing research, you might realize, oh, this creativity is great.
Or you might realize while doing research, this is too open-ended for me.
So much flexibility is not good for me.
I need more structure, right?
Either way, I think the only way to realize what you like and what you don't like is just
to experience it.
So I definitely encourage all students, all undergrad students out there to try research
and to do internships and make sure that when it comes time to making the decision, oh,
what is the next step after I graduate, that they're making an informed decision that is
informed by their experiences.
So, I think that's the first side of it, which is the more mental side.
You want to go to a place that you're going to be happy in.
You want to go to a place where you're going to be impactful in, but if you go to a place
that it's a hot area and you really hate the area, that's much worse than picking a niche
area that doesn't have much impact, but you really love the area.
You're like, oh, that's so great. So that brings me to the second part, which is the logistical part.
Nowadays, you can go to industry, spend a few years and then apply back to grad school.
That's certainly possible. You can also work on online master's programs while you're in
industry. This is also a very common thing. So the open course that I teach on Coursera on thought computing concepts has thousands
of students in it registered at any point of time.
And something like three-fourths of the students are full-time employees who have full-time
jobs, have their own families, and they're doing the coursework at weekends and weeknights.
So that's, I think, something that is possible nowadays. So you can actually get a degree while
you're working as well, if that's something that you want to continue doing. In academia,
we do see a value in students having spent years in industry before they apply back to grad school.
Because I've had both kinds of students. I've had students in my group, PhD students in my group, years in industry before they apply back to grad school.
I've had both kinds of students.
I've had PhD students in my group who have come directly after their bachelors, and also I've had students who have spent a few years in industry, and I can see advantages of both.
On one hand, you have some maturity.
On the other hand, you may have more energy, you may have ideas, but you could have energy in both, even if
you come back from industry.
So I think the important thing when graduating is to figure out what is it that you like,
but it's also important to plan ahead a little bit.
Especially if you're going to go to industry, it's very easy to lose track of time.
You join a company and then 10 years later you realize, oh, I've been in the same job
for so long.
What's happened to my career again?
So one thing I definitely encourage students to do is just set a calendar reminder for
yourself same time every year and say, this is one day that I'm going to take maybe like
July 3rd or something like that, to sit down and think about
what my career trajectory is and whether I can improve it in other ways, make these paradigm shifts in it that would make my life happier, career-wise and also maybe personal-wise.
I think it's important to revisit one's career goals every once in a while, at least definitely at least once a year.
Otherwise, it's very easy to get lost in the next deliverable and the next deadline and the next peer evaluation.
Yeah, you get lost in those things.
Yeah, your answer to this question has made me think of maybe like a dozen more questions around, you know, academia, like interviews and how I should think about getting in.
But I think this is like a good stopping point.
And we should probably have a round two at sometimes just to dive into those areas.
I think that's also very interesting.
Like the whole PhD comics thing.
How true is that?
Very true.
Yeah.
But anyway, thanks so much for being a guest
on the show
I definitely
enjoyed the talk
and I hope you
have fun
yeah
thank you
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye
bye