Software at Scale - Software at Scale 6 - Distributed Systems with Indranil Gupta

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Thanks folks for joining the Software at Scale podcast. I have Professor Indranil Gupta, also known as Indy, on the show today. And Indy is a professor of computer science at the University of Illinois at Urbana-Champaign. And he leads the DPRG group, which is the Distributed Protocols Research Group. And the BOCCE, Cloud Center, what does BOCCE stand for? The BO stands for blue and orange. And that was a cloud center that I started several years ago.

Starting point is 00:00:48 It's a bit defunct right now. So there's more efforts right now, including a new center called Just Infrastructures that I'm working on right now. Okay. And blue and orange is the color of the University of Illinois. And that's where I studied. And I had a class with Professor Indy four years ago. And that's where my interest in distributed systems definitely increased, especially through the projects that we had. So thanks for being a guest on the show today. Glad to be here.

Starting point is 00:01:16 Yeah. And so you do a bunch of different things. You teach, but you also are, of course, a researcher. And you also give an online MOOC right on cloud computing and distributed systems. And just in general you were from, so you've been in academia for a long time right you did your bachelor's and then you straight went for like a PhD and that's where you worked in distributed systems. How did you make that decision of going into academia versus going into industry, especially in a field like distributed systems, where it seems like a lot of companies, at least initially, like Google, Facebook, not Facebook at the time, but definitely Google were working on large systems.

Starting point is 00:01:57 How did you make that tradeoff? How do you think about that? So there is two parts to my answer. First off, when I was an undergrad in IIT Madras in India, I decided to take up research in my junior year, just on a lark, as something that I just wanted to try. And pretty soon, you know, I fell in love with just the creative aspect of research, being able to come up with ideas and realize them in practice, see them working in front of my eyes. And also just the slightly competitive nature that is inherent in research

Starting point is 00:02:41 where we try to do better than what has already been done before. So that's when I got into research and that made it fairly, I would say, straightforward for me to apply for a PhD. When I applied to grad school, I had offers for direct PhD programs and then these MS slash PhD programs. And for me, the decision was obvious. The direct PhD was what I wanted to go for. And so for me, so that's the first part to your question. The area of distributed systems is what I worked on from the very beginning.

Starting point is 00:03:22 And what excited me about distributed systems as a whole is the fact that you can have the same or similar code working at different processes, different machines, different nodes in your system. And when you had so many different copies of the same machine, what we sometimes call as a replicated state machine, right? You replicate the same state machine at multiple nodes, multiple processes. The behavior of the overall system would be something unexpected. It would be what we call as emergent behavior. That was kind of exciting to me. Just the feel of that was exciting to me when I was a very young researcher. Then fast forward to when I was finishing my PhD and applying to jobs, my mind was not made

Starting point is 00:04:06 up on whether I would go to industry or a research lab or become a professor. And I applied to a bunch of companies and a bunch of universities as well. It was during the process of interviewing that my mind kind of got made up in terms of where I wanted to end up at. And that's why I stayed in research and stayed in academia. So in the interview process, so what was it like, concretely, from these interviews that you learned the kind of work that companies were doing at the time? So yeah, so I think, for me, the trade off was, and I I've seen this many times as I've worked in companies, like you mentioned in Google as well, but I was in sabbatical a few years ago.

Starting point is 00:04:50 I've seen the same trade-offs in companies. So it's not an apples and oranges comparison, really. There are a lot of positives that working in a company provides to you and some downsides. And then similarly in academia, there are a lot of positives and then there are some downsides. So it's just a question of evaluating personally for whoever is making the decision, which ones will... You write down the pros and cons. This is advice I give to all the students who are considering multiple job offers. Write down your list of pros and cons for your options. Think

Starting point is 00:05:30 about them very deeply and then throw away your list and go with where your heart tells you or your gut tells you to go to because you want to be happy no matter where you go. That's kind of what I did. And for me, you know, the biggest thing was that in academia, essentially, you do not have a boss and you're free to roam the fields wherever your heart takes you. And I just couldn't get away from that, just that freedom, the academic freedom of going in whatever direction you want to go to. Some of that freedom does exist in companies, especially in big companies and a few smaller companies where you're higher up in hierarchy, but largely you don't have the freedom that you do in academia anywhere

Starting point is 00:06:15 else. Yeah. So a lot of people go for entrepreneurship and you decided to go for professorship in that process. Yeah. I mean, back in the early 2000s, and I'm old enough now to say that phrase back in the old days. Back in the early 2000s, distributed systems was still a very niche field.

Starting point is 00:06:39 Today, distributed systems is all around us. But back then, when we would publish papers on topics like consensus and we'd go to conferences, researchers who did core systems, operating systems research, they would look at us and they would say, oh, there go those guys who do this consensus or this strange kind of group communication research. Fast forward 17 years or two decades, Distribute Systems are all around us. Blockchain, well, you know, 17 years or two decades, distributed systems are all around us. Blockchain?

Starting point is 00:07:08 Well, that's consensus. IoT? Well, you're dealing with distributed systems. You're talking about edge or the cloud? Well, distributed systems. So you cannot get away from distributed systems now. It's really become a lot more mainstream. So entrepreneurship back in those days in areas around distributed systems was much

Starting point is 00:07:24 harder than it is nowadays. Rajat Mittal Yeah. As somebody who's been in the field for a while, when you first saw the emergence of a blockchain and a Bitcoin, what was your thought process? Because I'm sure that would have been such a strange application of everything that you've learned and studied for a really long time. Yeah. I was not surprised by the emergence of it, personally, because maybe it's because of how excited I am by the distributed systems area. I know that these ideas will eventually make it

Starting point is 00:07:59 into practice. I was not particularly surprised by the implementation of it either. The way blockchains implement consensus is slightly different from the ways in which traditional consensus algorithms, like Lamport or others who have worked on it for many decades, how they're implemented. It's just a different implementation. What did surprise me was the surprise with which industry and the world in general would treat the area of blockchain. It's like, oh, you know, consensus, this new thing, new kid on the block. And a lot of us researchers who have worked on it for years are like, yeah, it's not really a new kid. It's an old kid, but it's a pretty mature kid. And I see that a lot, not just on top of of consensus, but this kind of, I don't want

Starting point is 00:08:47 to say reinventing the wheel, but just re-looking at the wheel, which happens a lot in industry, which I think is a very good thing. You know, reinventing the wheel is a bad thing, but re-looking at the wheel is a good thing because new generations and new workloads need new systems and old systems don't always work. Okay. So what are some examples of that? So I'm thinking with Bitcoin, it's a different implementation of consensus algorithm where the problem space is different, where you treat every single actor as potentially malicious versus with traditional systems. Most

Starting point is 00:09:20 of them are fine, but sometimes you want to know they're bad because the system clock is bad or something like that, and you want to make sure. What are some other things that you think the industry is re-looking at? Yeah. So a great example is databases and key value stores, right? So Idle databases, relational databases have existed for a long, long time, and they've provided very strong properties. At the same time, in the late 2000s or so, folks in industry started realizing that some

Starting point is 00:09:52 of these databases are much slower and have way too many features than they actually need. And that's partly what led to the development of key value stores and NoSQL storage systems, because they were faster, nimbler, and easier to change. Of course, what's happened over time is that people have re-looked at even key value stores, and they've looked at, oh, do we really need such weak guarantees? Don't we really need transactional properties? And so they've actually augmented key value stores and brought them back into the transactional properties. And so they've actually augmented key value stores and brought them back into the transactional world.

Starting point is 00:10:28 And today, a lot of the databases that run, even these new generation databases that run actually support transactions at high throughput, low latency. So the re-looking at relational databases in the beginning and then re-looking at key value stores again actually led us to come full circle, but we didn't come back to where we were.

Starting point is 00:10:52 We came back to a better place than where we were. So, I think that's one of the advantages of not just academia, but even industry, re-looking at things. Yeah. I think I can share a story from Dropbox, which is basically exactly this, right? We started with a MySQL database. The database was pretty much running out of space, and we needed to figure out some way

Starting point is 00:11:15 of sharding it. And on top of that, from what I've heard, that every time a product engineer wanted to add a new feature, they would have to run a really expensive migration on this database with all of user data. So somebody came up with this brilliant idea. Why don't we come up with a client-side library that just stores JSON blobs in the MySQL database? So that way, every single migration

Starting point is 00:11:39 is just adding another field to this JSON blob. And that's how we came up with our graph database, EdgeStore, which is kind of inspired by Facebook style. And then we could easily put that onto more and more databases and we didn't have to think about one of them. But of course, eventually, we needed transactions and we needed cross-shared transactions, and then we rebuilt that. And now we're thinking, can we come up with something that's from the ground up? Or can we just use something like Cloud Spanner or something like that? Because there's so many options out there. So, yeah, that's definitely what we've seen.

Starting point is 00:12:12 And I think it's definitely a good thing that industry is continuously re-looking at how to make developers more productive and how to have faster databases and all that. How would you give advice to somebody who's evaluating whether they want to go into the academia industry? Because both are doing meaningful work. And if you browse subreddits today, they'll say, oh, you don't need a PhD in CS. You can get a job without that.

Starting point is 00:12:44 And I think that's true, but what are we missing? What are the software engineers missing? What's your perspective? Yeah. So I think folks who say that you don't need a PhD to do research are absolutely right. Anyone can do research, really. So I see the difference, let me take a step back. So I see the difference between what industry does in terms of research and what academia does in terms of research as being similar activities, but with different goals.

Starting point is 00:13:17 In industry, one does research so that one can improve the company's systems. One can get better throughput, better latency, all that stuff. So that it's tied to whatever the end goals of that particular company, whatever benefits the shareholders of the company, if it's a public company or whatever, it will get them towards more funding and an IPO if they're a private company. So for us in Stealth Mode. What I described is kind of similar to what we do in academia too. You know, we build systems

Starting point is 00:13:46 so that we are better than previous systems with a key difference that we are doing it for the purpose of science, which means that the end goal of our effort in academia is not just to build the system and then that's it, but rather it's to go a little bit underneath and understand what are the intellectual problems underneath and understand what are the intellectual problems underneath this, what are the generic problems underneath this, and what are the intellectual

Starting point is 00:14:10 contributions from this activity, from this project that we have, which may be useful in other scenarios. And this means that evaluation, experimental evaluation, both mathematical as well as real workload evaluation on real clusters is really, really critical in academic research. Because we want to evaluate, we want to push our systems to the extreme and see how far they can go in terms of solving the problem that we want to solve, or perhaps even addressing other things. So, I think that's a very key difference.

Starting point is 00:14:47 Some industry and especially research labs do do science. But when the upper administration in a company is looking at a research lab, they don't necessarily see the science as the first class product from a research group. They see monetary value. They also see visibility in the community, which is of course eventually translates to monetary value. As scientists in academia, however, we are funded largely by government funds at least here in the US.

Starting point is 00:15:20 And so we are essentially producing results for, not just for humanity, but also for the US taxpayer, which translates to ideas that perhaps can be used by other companies as well. So I think that's a key difference here. So the way that translates is in a company, in a big company like Google, for instance, or Apple or Facebook, when you write code, the researchers or the developers try to adhere to style guides. There's a lot of peer reviewing. In academia, less so.

Starting point is 00:15:59 Style guides are rarer to hear of. Peer reviewing is also fairly rare, though it's picking up nowadays. My students sometimes ask me about this. Why don't we do peer reviewing? Why don't we have a style guide? And the reason is, one of the big reasons is the more rules you impose on writing code,

Starting point is 00:16:20 the more it hinders creativity a little bit. You want to build a quick prototype system and then quickly pivot in an agile and nimble manner without having to turn a big wheel. You want to turn a small wheel in order to pivot a little bit. So that's part of the scientific exploration of ideas and the space of ideas, which makes academia a lot more exciting.

Starting point is 00:16:44 It's a lot more solution-driven than industry. So that's one of the big, one of the differences, one of the key differences I see. And there are, of course, others as well. Yeah, it's a very libertarian mindset towards developing new solutions. Yeah, but distributed systems is also an interesting space where a lot of the large big data problems are inside corporations.

Starting point is 00:17:08 So I've seen a lot of the research that your group does, you embed inside another company like LinkedIn, and then you publish a paper together. And then you kind of get best of both worlds. Yeah. Yeah. I think that's right, that compared to 20 years ago, in the field of distributed systems, at least a lot more research today and a lot more publishable research happens in companies. Not just big companies, but also, you know, smaller companies. If you look at any of these systems conferences, there are a lot of publications by relatively smaller companies. And the publications are, they're on their industry scale systems, but they're also well evaluated and they fit, they adhere to all the requirements that we have from a research paper.

Starting point is 00:17:57 And part of what's driving that is just the scale of the data and the workloads at these companies, and of course the diversity of the workloads as well. That's partly it. And overall, I think that's a healthy thing for both industry and academia to be involved in similar kinds of research and overlapping sub areas of research. It's not always the way it used to be. Earlier, we'd have companies that would do their own thing

Starting point is 00:18:26 and then academia would do its own thing. And companies, I'm talking very generically here, companies would look at academia papers and like, maybe that's their own thing. Maybe we can't really use it. But I think that gap has narrowed a little bit. Gap has come down a little bit. So industry and academia are a lot more closer to each other nowadays in terms of working on similar kinds of problems, similar types of problems.

Starting point is 00:18:55 And so that's led to flow of information, ideas, both ways from academia into, and also from industry into academia. Having said that, there are still academics who choose to stay largely away from industry and kind of do their own thing in their own ivory tower. And there is some value to some of that, right? Because scientific research does require some blue sky thinking at some level. And then, of know, there are companies and groups within companies that do their own thing that don't read a single paper during a given year. And that's fine too, you know, there is, there is reason to do that as well. But I would certainly say the gap has, has narrowed and in a very healthy way. And, you know, I think the challenge for academic researchers

Starting point is 00:19:46 is to see how we can partner with the industry in order to have our ideas be realized more easily in practice. And also so that we are working on relevant problems. I like to say that it's very easy to work in academia on a problem pulled out of thin air. That's one of the easiest things to do, right. But it's not a very productive thing to do. You want to work on real problems that affect the real world, but not necessarily design solutions that are usable right away.

Starting point is 00:20:17 In the sense that you want to be a little bit forward-looking and look at the deep scientific value, come up with ideas that can be used right away, but that's not the end goal. You look at also the science and perhaps extensions of these ideas that are applicable in other scenarios. So something like you can pull out problems that existing companies might be facing,

Starting point is 00:20:43 but your solutions shouldn't be constrained by the constraints of the existing corporation that's dealing with that problem. Yeah, so I'll give you an example. So in around 2011 or 12, I started looking, that's when key value stores and the stream processing systems were really coming up, all these open source systems like Cassandra, React, a bunch of others.

Starting point is 00:21:07 I started looking at their JIRA pages to see what kinds of issues they were facing. That's one of my favorite activities, by the way. I just love looking at pages that list bug reports, however minor, however major, however resolved or not, or will not resolve, doesn't really matter. I just love looking at them. And while looking at them, several things jumped up at me. And you wouldn't get this unless you were just browsing through these pages with completely no aim in mind rather than just understanding it. So what jumped up at me was that a lot of bug reports that went unaddressed related to reconfiguration, which means that you had a bunch of data instead of servers, and now you wanted to change something about the way the data was structured, perhaps change the

Starting point is 00:21:48 primary key, or perhaps you wanted to change the set of servers on which the data was stored. And essentially the way in which most companies back then were doing it in the early 2000s was, they would shut down the data center or their servers, they would migrate the data, and then they would spin it back up. And this is terrible for downtime, right? And so that pattern kind of emerged out of all the reports that I saw. And so we said, why don't we try to solve that problem,

Starting point is 00:22:13 the reconfiguration problem, where you migrate data online, but still keep the service active. And we try to assure as low latency as if there were no migration going on. And so that's what led us to several systems, not just in the key value store space, but also in the stream processing space and a bunch of other subdomains as well. Even though the GDRI reports I was looking at were only for key value stores, we were able to think about that problem in other domains as well. So that's, you know, that's an example of something that a company may not be interested in,

Starting point is 00:22:51 but that academia could contribute to. And so that's super fascinating because you didn't, it's not like somebody told you about this problem. You were just looking at not open source is maybe not the right way, but just open bug reports of open source components. And you said, oh, it seems like people in the industry are dealing with this problem. And you just went ahead and solved it. And maybe they can take some of those ideas and apply them. So did they finally get applied to any of those systems? Yeah. So when we thought of the reconfiguration problem and we were thinking

Starting point is 00:23:22 about it, and this was the early days of what would essentially be a four or five year NSF, long NSF project. We weren't even sure that this was something that companies thought about, right? But then when we published the paper, actually, even in the reviews to the paper, which are anonymous reviews, even the reviews, folks said,

Starting point is 00:23:41 oh, this is a problem we face, but it's a very hard problem to solve. So I'm glad someone is making a scratch at it. And that's kind of, you know, what underscores why we do research in academia, right? Because directions which seem daunting to industry and so they don't even want to scratch it because they don't know how far they can go.

Starting point is 00:23:59 We can make big dents in directions like that in academia. And then, you know, once we make dents, then others can build on our ideas. And by others, I also include folks in industry. And so part of the work that we did back then was also in collaboration with companies. And so some of that made it into production systems in companies like Yahoo, for instance. Yeah, because that's the kind of problem that companies wouldn't be incentivized to solve, at least from being inside a company. It's just such a big problem. And you don't know what a timeline for that would look like. It's not solving an immediate business need, but it is a really nice service. It's the kind of problem I'd like to think of as like an intern project for the, but in a much bigger sense, like where it's really

Starting point is 00:24:51 nice to have if they solve it, but if they don't solve it, it's not the end of the world. Yeah. Yeah. And you know, internship projects are what seems like small hacks are great starting points for research projects. You know, you think, oh, this is just a small hack. It fixes my issue. But an interesting exercise to do there, something that I recommend everyone to do when they're trying, quote unquote, a small hack is to think, is there something wiser underneath this hack that I'm doing? Same thing goes with internships. So a lot of internships are like, oh, write this piece of code and it won't take you three months, and then your internship is done. But if you think a little bit deeper underneath whatever you're doing,

Starting point is 00:25:33 sometimes you uncover gems. It just needs that little bit of think time. And this goes back to why someone likes a particular area in terms of doing research. So usually what I tell people is, and this is something that one of my professors told me when I asked him, you know, I was just in grad school and asked him, how should I pick an area to do research? And the advice he gave me, which I'll narrate here is pick an area that you like thinking

Starting point is 00:26:03 about in the back of your mind, when you're walking to work or walking from work or driving to work, driving back from work, and that doesn't feel like work. Thinking about it doesn't feel like work. That's a very important criteria to have for research, right? Because it's in those idle moments that interesting ideas come and interesting thoughts come. And that's what keeps you engaged. As one of my students likes to say, if you love your work, if you love your job, you never have to work for the rest of your life. That's kind of what summarizes life in academia. You know, a lot of us are very passionate about what we do.

Starting point is 00:26:48 And so sometimes I think, oh, you know, this university is paying me salary to do stuff I love doing. There's some scam here, right? This can't be true, right? And I've had this feeling since the day I was a grad student. You know, I'm getting paid to do research. How is this possible? But there is a value in doing research and there's a scientific value in doing research, which, which in the U S certainly,

Starting point is 00:27:10 and in many other countries too is still appreciated because it contributes directly to the economy, contributes to companies. And so there is, there is space there. So something that we hear from outside academia is that it could also be stressful in terms of, you know, there's like, you have to think about how you get funding for your particular research area and all that.

Starting point is 00:27:38 Is that all like overblown or is that like a real concern? Like how would you encourage someone to think about that if that's what they're worried about? Somebody who's thinking, I really want to go into academia, but I'm worried about not getting professorship or fighting about funding all the time. Yeah. Yeah. The fear of funding or fear of writing grant proposals is one fear. The other fear associated with becoming a professor is just the fear of not getting

Starting point is 00:28:04 tenure. My philosophy to both has been to ignore them completely, to just do whatever I feel I like to do. And I think that's easier to do nowadays in general. Look, you go into academia, let's say, as a professor, and you work on things that you're passionate about, the chances are good that things will work out. Tenure rates at top places are very high. Funding rates for people who are productive are generally very good. People who are productive in terms of coming up with good ideas are generally good.

Starting point is 00:28:50 It's still competitive, but I think if you're passionate about what you do and stick with what you are passionate about, I think things will work out. There's no guarantee, of course. So yeah, that's my philosophy. I think funding is competitive. Acceptance rates for proposals in the National Science Foundation are, last I checked, they were around 10% or even lower than that. And that's significantly more competitive than inside companies. Like if you're running a team inside a company, getting funding for whatever you want to do

Starting point is 00:29:31 is something that gets worked out via discussions and with your managers and things like that. So it's not really a competitive process. It's very rare for two teams to be up against each other when they're trying to do the same thing. It's kind of a competitive process. It's very rare for two teams to be up against each other when they're trying to do the same thing. It's kind of a different game in academia.

Starting point is 00:29:51 But one of the things that comes out in academia is no ideas are bad ideas. So dealing with rejection is something that we do all the time, build a thick skin. And some of the best papers that have been written have been rejected multiple times before they were written. And this is not just true of computer science,

Starting point is 00:30:15 it's also true of physics. A lot of Einstein's papers were rejected, for instance, before they eventually became really famous. And rejection is actually a blessing in disguise. It helps you rethink this idea of re-looking at things. Again, it makes you rethink how you were communicating your idea. The idea itself and your execution of it may be fantastic, but communication is also part of writing the paper, writing proposals, or writing whatever. So unfortunately, human beings have a very low, very

Starting point is 00:30:51 broad rate in communicating with each other. We have to write words and things like that. And we have to write it a specific way so that it's communicated and it's understood by others. So that's part of our role as well, communicating. So sometimes students ask me, oh, Indy, you're talking about communication and also coming up with new ideas. What percentage would you put each as being important in a paper?

Starting point is 00:31:21 How much importance does communication have in a paper? 50% and the remaining 50% is the idea itself, or is it more like 60-30? And I like to say, well, a good paper is 90% good communication and 90% good ideas. Anik Debele, Ph.D.: Yeah. And I think an example, like an excellent example of that is the swim paper that you worked on with like a graduate student, I would say 15 years ago now. And that paper has become really, like I shouldn't say famous, but we're probably using some kind of application that's using some of the ideas in the SWIM paper. So do you mind just sharing a little bit about what that paper is and why I know about it? Yeah. Yeah.

Starting point is 00:32:06 So the story of Swim is actually kind of interesting too. So when I was in grad school, I decided to do an internship at IBM Research, T.J. Watson in New York City. And the internship was kind of unstructured. And I was very fortunate in being able to talk with not just my manager, but also my manager's manager. I would just walk down to his office and talk with him. And during the internship, I was doing something in the foreground,

Starting point is 00:32:33 but in the background, I was thinking, what can I do to turn this internship into something that could be included in my thesis? So I ended up writing this kind of highly theoretical paper. I was doing a bit more theory than systems back then. I ended up writing this paper that was a little bit theoretical on failure detection and a good, a very close to optimal way of doing failure detection. Just to interrupt, failure detection is when you're trying to figure out whether

Starting point is 00:33:02 a machine is up or not, or some kind of node is alive or not in a distributed system, right? Correct. Yes. So you have a collection of servers and the servers may fail or crash at any point of time. And you want to detect the crash of any server, which can happen at any point of time fairly quickly, but also very accurately, meaning that you don't want to make mistakes. You don't want to mark servers as failed when they actually haven't failed. I know this failure detection is a very important sub problem for pretty much all distributed systems to solve.

Starting point is 00:33:30 Everyone needs to solve it. If you're running a database, for instance, if you don't do failure detection, you might lose data or you might lose transactions, you might lose operations. And so failure detection is a core problem that every single distributed system needs to solve. Thank you. After writing this theoretical paper, I just put it away and it got published in the top

Starting point is 00:33:54 distributed algorithms conference. A year later, I was back in grad school at Cornell and a couple of my... Actually, my office mate, who was a first year PhD student, then I was in my second year at the point, maybe third year, but anyway, he was a junior PhD student. He came to me and said, no, we have to do this course project and we're just looking for an idea. And I went, you know, there's this theory paper I wrote, why don't you just read it and see if it looks interesting. So he read it and he and his project partner read it and they implemented a first version of it.

Starting point is 00:34:27 And I was pretty impressed by the fact that they actually first picked that paper to implement and by the fact that they did it so quickly. And then I sat down with them a little bit and went over some of the designs. And while we were discussing, some new things jumped up in terms of the systems design portion of it. And eventually, that became a paper that is a swim paper. And it holds a very special place in my heart, the swim paper, because it was the first paper I wrote in grad school

Starting point is 00:34:53 that did not have any faculty members as co-authors on it. It was just me and two other students. You weren't a faculty member at that time. No, I was a PhD student back then, right? So it was me and another PhD student, junior to me, and then another master student. So it was very special, very interesting that, you know, something like that could be pulled off with zero faculty involvement in a sense. It also gave me the first taste for what it felt like to work with other students. So that was the swim paper. And then, you know,

Starting point is 00:35:24 we published the paper. We never open sourced the code, because you know how student code is, right? Going back to the discussion of why we don't adhere to style guides and things like that. So we forgot about it. And then many years later, I got to know... This was when I had become a faculty member. I got to know that a bunch of companies had adopted it, that Uber used a version of Swim in their internal system, that HashiCorp had implemented Swim in their open source systems, and those were widely used by other companies. I spoke with some of the developers and I asked them, why did you pick Swim?

Starting point is 00:36:00 There's a bunch of other. They never reached out to you? No, no. They never reached out to me, but one of my former students ended up working at their pick Swim. There's a bunch of other failure detection. No, no. They never reached out to me, but, you know, one of my former students ended up working at their company and then, you know, that's how we made contact. And so, you know,

Starting point is 00:36:16 I ended up meeting with the developers and asked them, you know, why did you pick this paper at all? Was it because of it's the most scalable or was it because it was the fastest or perhaps the most fault tolerant? And I was very surprised when they told me, no, it was because your paper was easier to read. And that's it, right? So it goes back to communication, that communicating ideas in a clean kind of intuitive, understandable way is really, really, really important as well. And this is true, not just in academia, right? I mean, it's also true inside a company. You might do something spectacular,

Starting point is 00:36:48 but the presentation of it and how it is seen and is visible to others inside the company counts for a lot. Yeah. And for those listeners, Numad by HashiCorp uses Swim. I think till today, I just checked on it recently. So a lot of systems are implemented using that.

Starting point is 00:37:10 Yeah, I think HashiCorp Surf and Console also used versions of Swim in their RingPop system. I think Uber's system is RingPop. Yeah, yeah. I forget what HashiCorp's system is called. Yeah. So yeah, I think that speaks to, as you said, right, the importance of communication. And ultimately,

Starting point is 00:37:27 the work in academia and industry, especially in something like this, is not remarkably different, right? It's presenting your work well, as well as coming up with new productive ideas is what gets you ahead. Yeah, I think that's

Starting point is 00:37:43 fascinating. And in terms of failure detectors, or maybe even just like distributed systems in general, what do you think has evolved? So that is, I think, a 2003 paper, and then it got implemented in 2010. So what are some things that are new

Starting point is 00:38:01 that's like a 2020 research paper that you're excited about? Yeah, so I think there is a few things. that are new, that's like a 2020 research paper that you're excited about? Yeah. So I think there's a few things. So I think there's a lot of very interesting problems in the area of distributed machine learning. And in general, the area of distributed systems plus X, where X takes on values such as machine learning or Internet of Things, IoT, or X is human-computer interaction. How do humans interact, users really interact with distributed systems? Distributed systems plus agriculture gives you this entire area of smart agriculture

Starting point is 00:38:37 or digital agriculture. So the way I look at distributed systems is that it's a fairly mature field now. And companies have largely figured out how to implement all the basic building blocks for distributed systems, be it failure detection, be it consensus, be it multicast, or be it any of these, leader election, mutual exclusion, all these classic problems. So there are some niche sub-areas where researchers could still contribute ideas and have it be relevant to industry. Blockchain is one of those areas. And there is still some activity going on in the consensus area.

Starting point is 00:39:19 But by and large, if I come up with a new leader election algorithm for classical wired data centers, companies are not going to be interested lead reduction algorithm for classical wired data centers, companies are not going to be interested in that. They're like, yeah, Google's going to be like, yeah, we have Chubby or some successor to Chubby and that works. We don't care. A lot of the open source folks are going to be like, ah, Apache Zookeeper works just fine. Why do we care about your lead reduction or your mutual exclusion algorithm? So I think a lot of the interesting action nowadays is in the Distributed Systems plus X space, where when you relook or rethink

Starting point is 00:39:51 about Distributed Systems problems in a completely new light, under a new set of assumptions, then the old solutions no longer work as well, and you have to rethink new solutions. So I'll give you an example. We recently started working on what happens when you have a set of workers and they're doing distributed machine learning, but some of the workers are malicious. They're behaving in what we call a Byzantine mode, and that's the classical distributed systems term. And it turns out that in classical distributed systems work to reach consensus, consensus algorithms can only

Starting point is 00:40:33 provably tolerate only up to one third of your workers being malicious. If you have more than a third plus one being malicious, then bets are off. Things can go really bad. But in distributed machine learning, it turns out that you can actually tolerate as many as 50% of your workers being bad. And in fact, in practice, you can tolerate a lot more bad workers than that. And so when we proved this formally, it was very surprising to us.

Starting point is 00:40:59 Like, why is this happening? But when you think about it intuitively, it makes sense because machine learning is naturally very noise tolerant. There's already a lot of noise built in in data and the machine learning that you see, whether it's stochastic gradient descent or any other machine learning, is already naturally noise tolerant to some extent of noise that is inherent in the data. And so if you have techniques that essentially treat the malicious nodes sent data or gradients

Starting point is 00:41:23 as part of this noise, then you're getting some of the same behavior. That noise tolerance is also inherited by the Byzantine tolerance. And that's what gives you a little bit more Byzantine tolerance in distributed machine learning than you would get in distributed consensus. So that's one of the scientific insights, which you wouldn't get if you were doing this like in a company or something. And it also opens new doors, right? Now you can think about Byzantine tolerant machine learning

Starting point is 00:41:53 with a lot more confidence. There are many other kinds of security attacks. Byzantine is just one. Poisoning attacks and a bunch of other attacks are a lot more insidious. But it opens the doors to thinking about them and aiming for higher, not just aiming for what the classical distributed systems results told you. So the same problem that had well-known results in classical distributed systems has completely different bounds, has completely different solutions in

Starting point is 00:42:17 distributed machine learning systems. So I would say distributed machine learning is one area where there's a lot of action. I think serverless versions of many classical systems are another area. Serverless or actor based versions. In general, I think when we build systems, we don't necessarily pay as much attention to what the users experience. So So thinking about user metrics and user experiences, but lower down in the system stack, I think that's a very important part. That's an area where I think industry has a much harder time making progress. That's where I think academia can make a lot of inroads. Distributed machine learning, distributed IoT, other distributed systems areas are areas where industry will

Starting point is 00:43:07 have as much stake as academia. But areas like distributed systems plus HCI, it's something that academia may be able to contribute a little more. It's fascinating that distributed machine learning is much more resilient to Byzantine nodes compared to classical algorithms. I didn't fully grasp that, so if you could just go a little deeper into

Starting point is 00:43:31 when there's a system that is like, so what's the difference? You mentioned that it could be, even with a system that has more than 50% of nodes being bad. At that point, wouldn't you rate the bad nodes' data as less noisy compared to the good nodes,

Starting point is 00:43:53 even though that's not correct? Or what am I missing here? Yeah. So the first big difference is that the problem, the consensus quote unquote problem is slightly different in the classical systems world where, and in the machine learning world. In the classical systems world, essentially in the consensus problem, you're trying to make a clear decision between a one or a zero, right?

Starting point is 00:44:17 Should I accept this right sent by a client or should I accept another client's right as the next one in my order, right, for instance. While in machine learning, if you look at stochastic gradient descent, for those listeners who know how it works, or at least have heard of it, essentially, all it means is that you're doing a gradient descent and you want to make sure that you're descending towards the same minima and that you're descending at the same rate. Those are the only things that matter. So that's a lot more continuous mathy than the discrete math version of the discrete consensus problem, right?

Starting point is 00:44:52 So essentially all we needed to prove to show that the Byzantine nodes would not have an effect was to show that the minima towards which we were converging is the same, one. And two, the rate of convergence is the same as if you had no failures at all. And so the problem itself is slightly different. Okay, and you can just run like a bunch of simulations

Starting point is 00:45:11 to prove something. That's right, yeah, yeah, yeah. So, and not surprisingly, it turns out that, you can prove something, but usually the bound is not tight, which means that when you simulate it, you can actually tolerate a lot more bad nodes. And then the effort, of course, is to go back and try to tighten your bound and try to prove that what you saw in the simulation is actually also

Starting point is 00:45:34 true in theory as well. Though we have not been able to do that, but yeah. And with distributed machine learning, it seems like there's so many different things you have to think of, right? You have partitioning your data in a particular way. It seems like there's so many different things you have to think of, right? You have partitioning your data in a particular way. It seems like you would have to think about it in a completely different way versus the standard way you would shard data in a distributed system. Right. Yeah. And there's many different kinds of distributed machine learning, right? So the results that I talked about, about Byzantine tolerance applied to distributed versions of stochastic gradient descent, or what we call it SGD. But there is deep neural nets, DNNs, there is recurrent neural nets,

Starting point is 00:46:10 and all these different categories of machine learning. And some of them have distributed versions of it, some of them don't. So there's a big challenge of how do you make distributed versions of these? And I have a large neural network and it won't fit on my eight gigabyte GPU. And I need to split it across multiple GPUs. How do I split it best? So there's first that step you need to solve. And after you solve that step, then you come to the question of, oh, now what happens if some of these workers go offline or some of these workers misbehave? So that essentially you're kind of revisiting the distributed systems workflow,

Starting point is 00:46:54 but for a completely new domain. Yeah. It seems like a rabbit hole you can go really deep into. Yeah. Yeah. Because everything is easy as long as it fits onto one machine. Yeah. But then as soon as that stops happening, you're stuck. Because if you think about all of these different things. Yeah.

Starting point is 00:47:11 And with the advent of new models like GPT-3, they just won't fit on regular GPUs. In fact, even well-known DNNs like Inception V3, ResNet, these are the standard things that researchers look at. They won't fit on a GPU that has a couple of gigabytes of RAM. Just won't fit. Just imagine trying to do training or inference on cell phones, which don't have a lot of memory, with any of these models. And it's just nearly impossible today.

Starting point is 00:47:37 Forget the power-hungryness of these neural networks. Even the memory constraints are very hard to satisfy. So I think that's another area where distributed systems plus X, you look at distributed systems plus IoT plus machine learning, it has a lot of open problems. How do you take large neural networks and make them fit and run on smaller devices? Yeah, there's so much. And now I'm just curious, and I want to look up some papers. What is the recent work in this field? several years, several companies, including Jeff Dean's team at Google, they looked at this problem. What happens if I have this large neural network, but I have multiple GPUs and I need to split the neural network across these GPUs, right? And essentially it's a problem of graph partitioning.

Starting point is 00:48:37 But the partitioning has to be done carefully so that, you know, you don't end up, first off, you partition quickly so that you come up with a plan quickly. And second, once you have partitioned, then the actual execution time of the placed, the partition model is also pretty fast. So a lot of the work that existed when we looked at this problem two years ago, we said, okay, let's look at this partitioning problem. It's actually called, in machine learning,

Starting point is 00:49:01 it's called model parallelism. That's the technical term. So a lot of the solutions that existed, including two papers from Google, essentially were using reinforcement learning to do the placement itself. So they would profile how it ran, and then they would profile it quite a bit, and they would use reinforcement learning as the scheduling algorithm. And the schedules they came up with were very good. They were comparable to schedules generated by experts.

Starting point is 00:49:31 But the time to generate the schedule would be hours or in some cases, like two days. And that's too slow for, imagine a developer that's trying to quickly play around with a few models and you tell them, oh, now you have to wait for two days for your next iteration. That just doesn't work. Just to plan how to partition the data.

Starting point is 00:49:47 Yeah. Yeah. Just how to partition the data. So we said, and this is, again, an example of what you may not be able to do in industry because it's so open-ended that it's not clear you would succeed. We said, let's look at very old classical literature that's looked at parallel computing, that has looked at DAGs or directed acyclic graphs of tasks. And let's see if those partitioning algorithms can be applied.

Starting point is 00:50:16 No reinforcement learning, just good old-fashioned classical algorithms. And so we actually found a couple of algorithms that were published and that were provably close to optimal. It's an NP-hard problem, so you can't really solve it optimally, but you have heuristics that are close to it. And we applied those and we actually implemented those in TensorFlow. We're like, okay, now that looks like a paper. And it was crap.

Starting point is 00:50:38 It didn't work well at all. It was like, what's going on? And then we had to look under the hood and see what constraints TensorFlow required us to solve. There were a lot of constraints related to co-placing operators, forward and backward operators on the same device, grouping operators so that the communication was minimized, a whole set of heuristics and a whole set of optimizations, some of which were essential, some of which were needed for efficient operations. And this entire activity took several grad students, including the lead student, who

Starting point is 00:51:13 was a very persistent student, I would say up to a man year, one year worth of effort. And at no point along the way were we ever confident that we will have a system at the end of this or we will have a paper. It was just so completely open-ended. And I still remember the day the student came and said, Indy, we have a positive result. And he said, you know, remember those neural networks that were being placed in two days by Google's paper? We are able to place that in 10 seconds. And the placed model is also comparable to the experts. And so that's an example of how academia can help you re-look at directions. In this case, it was like, oh, you don't need to apply reinforcement learning. Reinforcement learning

Starting point is 00:52:03 looks like a big hammer, but maybe it's too big a hammer to swat a fly with. Maybe a fly swatter is just perfect. Was the kind of problem just that there were so many things that had to be solved? Or was there just a few key insights that had to be made by those grad students in order to solve this? And did they ultimately open source their work? Do you know if it if they used in production? Yeah. Yeah. So the work is open sourced and we publish all of our systems and papers and code for it. We publish it on my group's webpage, the DPRG page,

Starting point is 00:52:35 dprg.cs.duiuc.edu. It's still a relatively new paper. So this paper was published last summer. So the inroads to industry are... It takes a while to make inroads into industry. A lot of the problems that we needed to solve were very TensorFlow problems. TensorFlow is a complex system to work with. Every system is complex to work with. Every system is its own beast. TensorFlow has a lot of quirky things. PyTorch and other systems has its own similar but different quirky things. And part of our effort was to solve some of those challenges, but also think consciously about whether those challenges were TensorFlow-specific

Starting point is 00:53:19 or whether they were also generalizable. So I think that's part of the challenge there as well. I think some of the techniques we came up with, especially optimizations where we group operators together, those would be applicable no matter what your base system was, whether it was TensorFlow or PyTorch or something else. But there were a few other things,

Starting point is 00:53:38 especially related to avoiding cycle, for instance. Our placements might sometimes generate cycles among the operators, which was unacceptable to TensorFlow. And so those would be TensorFlow-specific optimizations. There were a few others. But looking back on it, I would say it was a great adventure going through all those steps. And there were many things we tried that didn't work out at all. That's part of doing these. You try something and it's like, no. A month later, you're like, no, that didn't work out. Let's backtrack.

Starting point is 00:54:11 Let's put on the other part of the state space search. And I think that unpredictability would be just very hard to do in industry. Research labs, a research group might be able to do it in a research lab. But even there, you know, they have deadlines and things like that. It's much harder. Yeah. group might be able to do it in a research lab. But even there, they have deadlines and things like that. Yeah, as somebody who goes into distributed systems research, you don't think that, oh, I'm going to be mucking around with TensorFlow code base for a year.

Starting point is 00:54:36 But that's the kind of stuff you work on. Yeah, and it needs persistence, right? So this is one of those things, all those experiences where the adage of you must really love it in order to make progress and it comes back so uh this student was very persistent and i was very impressed by how persistent uh they were um a different student may have given up at some point of time may have been like i'm wasting so many months of my life on this i don't know if

Starting point is 00:55:01 this is going to succeed or not and part of it is just having confidence that it's going to succeed. And I think just that persistence pays off a lot. So, you know, persistence pays off, not just negatively, it pays off everywhere. But, you know, sometimes you have to, you have to cut your losses and run, right? It's kind of like, you know, you're in a movie, you're paid some money, you're in a theater, you're paid some money and you're like, this movie is kind of sucky. Should I go out or should I stick with it? And, you know,

Starting point is 00:55:34 if you're wise and you're persistent, you'll realize that it's a movie that you can actually change, that you have control over how the movie goes and that you could turn it into an exciting movie. One of those movies where you can just change your interpretation and have fun even though it's a garbage movie.

Starting point is 00:55:51 Oh, yeah, yeah. Like The Room. Yeah, yeah, yeah, exactly. Yeah, yeah, yeah. Yeah, and I guess somewhat related to this is, I know that you're passionate about this topic, which is how do you get industry to collaborate better with academia right yeah definitely the bigger groups like

Starting point is 00:56:12 google and facebook they publish papers on like a regular basis but the standard software engineer especially in a smaller company or even in different parts of larger companies they're not really incentivized to publish papers. I know that I've thought about it and I just wrote a tech blog and I gave up. It's a lot of work. I don't know how the review standards would be. I don't read too many papers in the space that I'm in, but how does one get to change that? Or do you have any ideas on what could industry be doing differently? Yeah. So I think it's understandable that many companies and many groups that write industry systems don't ever publish their work. You know, when you think about how someone would spend their time, like a few hours in a typical day. You could spend it writing code to earn your company a million dollars. You could spend the hours alternately looking at logs of a recent outage and understand

Starting point is 00:57:13 some of the causes of that outage. That's kind of exciting too. Or you could fix the outage or you could troubleshoot. Or you could just have a nice lunch or you could take a walk. Or you could spend time with family or friends. Or you could just have a nice lunch. Or you could take a walk. Or you could spend time with family or friends. Or you could write. That's like the most boring of all of the options that I mentioned. It seems like the least productive.

Starting point is 00:57:35 So that kind of indicates that there is an obstacle to publishing or cutting-edge papers. But at the same time, many of the systems that have been built, whether it's in large companies or small companies, and that have had some degree of success, even if they are only industry systems, and even if they do not solve a core scientific problem, even then they are worth publishing because in many cases, the reason the system is attractive may not be visible to those who developed it. For instance, maybe they're handling scales of users at a much higher scale than has ever been handled. Or maybe they're handling a diversity of use cases,

Starting point is 00:58:18 much different than has been handled. Or maybe just the workload that they're handling is just so unique that it will be interesting to the broader community. So I think there is a lot of value to publishing papers that are inside a company, a system that is deep down in the bowels of a company that users will not see. I'm not talking about the YouTubes or the Gmades. I'm talking about Google's core scheduler for the data centers, which by the way, was never published. Borg has been around inside Google's data center since the early 2000s, and the Borg

Starting point is 00:58:53 paper was only written in the mid-2010s. So it ran for a decade and a half before it ever saw the light of day. And so I think, you know, I certainly hope that a lot more industry practitioners think about publishing their systems. And also, I hope that academics and academic researchers are willing and take the step to partner with some of these companies in order to help them evaluate the systems, analyze their systems, because I think that's part of the hard work that you need to do to write the paper, right? For which sometimes the industry practitioners don't necessarily have time, developers don't necessarily have time. So that's something that we were able to do with a few companies. With

Starting point is 00:59:38 Microsoft Service Fabric, we work with the Service Fabric team and help them evaluate their system and kind of bring together the different parts of the system all in one paper. And same thing with a few other companies that we have had to do. So it's a very interesting exercise. It's a very different exercise than typical academic research. Typical academic research is like, oh, what's the new idea I have? Or what is the new problem I'm solving? Here, the goal is more like, what did this company do? Or what did this group do? And what did the system do? And what use cases are they solving?

Starting point is 01:00:14 What workloads are they looking at? And how do we tell the story? How do we communicate this in a way that reflects what the company was trying to do. So I think that's kind of important. Companies sometimes are more reticent to publish the systems that are deep down in their bowels because they see a competitive advantage to them being secret. But to that, I would actually say that publishing papers is a long-term investment strategy. If a company is visible in a conference, a well-known systems conference, the company is able to attract better employees, researchers who are grad students, and they're like, oh, Company X is doing cool research in this topic, and maybe that's a company I want to work in. So it's a recruitment tool.

Starting point is 01:01:20 It's also a way of addressing competition. If you're publishing on a system that you've been running for a while, you're not going to publish on the latest version of the system. So you're already well ahead of what you're publishing. So you're already well ahead of your competitors. So you'll have to look at what Google did, for instance, with MapReduce. So when they wrote the MapReduce paper, Google has never open sourced its MapReduce implementation to this day. They only wrote a paper on it. But the paper was then picked up by folks at Yahoo, who then wrote Hadoop based on it, and then HDFS and a bunch of other systems. And that essentially is what kicked off this entire big data systems revolution back in the mid-2000s. And that revolution has benefited a lot of companies, including Google, which wrote the

Starting point is 01:02:06 original MapReduce paper. And it has helped attract as well, helped attract folks to join the company, Google. So that's just an example. Not every paper you write is going to span or spawn a new sub area of research. Some of them may have more limited use, but everywhere you write brings visibility to the company and gives you an edge over your competitors. Yeah. At least it seems like software engineers do want to share what they're working on.

Starting point is 01:02:37 There's corporate tech blogs and then there's presentations. I think there's also a lot of uncertainty, right? I'm writing this paper. I have no idea whether it's going to get published or not or what the requirements are. So definitely that sense of community or that sense of help from academia that, oh, I'm working with a grad student who's in this space and they think I have something valuable to share. Yeah.

Starting point is 01:03:01 I think that would go like a long way, especially if you'd see a case where a company would publish an informal blog post saying, this is the system that we've built. And maybe somebody from academia could be like, oh, that looks like a very interesting candidate for a research paper. And that might just be the small nudge required to go from a blog post to a research paper. Yeah. And sometimes the developers and companies who are even building these systems may not even realize that their systems are a significant step forward in the research landscape. So I definitely encourage developers to talk with academics as well. You don't need to repeat all the details of what you're doing, but you can talk at a high level of what you're doing and

Starting point is 01:03:49 that might be enough feedback for you to see that whatever you're working on is really a big forward step. So I think it can help both ways. So I think researchers can help both ways, right? So I think researchers can benefit by understanding what companies are doing so that in academia, we work on problems that are still current and relevant. And, you know, we are not solving 10-year-old problems, which have already been addressed. And then companies and industry groups can also benefit from talking with us because it helps them calibrate whether or not they can give more visibility to whatever they're doing right now. So I think that's one of the advantages of going to a conference. So I encourage people from industry, academia, everyone to go to any of

Starting point is 01:04:35 these systems conferences. Even if you feel you might be outside your comfort zone, that's a good thing. You go outside your comfort zone and that teaches you something. And just talk with people. These conferences, it's not just the presentations, but it's the corridor conversations you have with completely random people that lead to some very interesting insights. I cannot tell you

Starting point is 01:04:56 how many research projects I've started because of completely random conversations and corridors. There's just so many of them. Yeah. As soon as this pandemic gets over, that's one of the first set of things that I'm excited to do.

Starting point is 01:05:09 Much harder to do corridor conversations over Zoom. Set up a time. I think that's one of the ways in which I think research in general has suffered a little bit in this pandemic era or the Zoom era. It's much harder to have impromptu conversations. Everything has to be scheduled. Plus, you know, there is always the issue that you can't really be on Zoom for too long, right? There are downsides.

Starting point is 01:05:36 The downside might not be really obvious from the get-go, but there are definitely downsides, let alone like the social interaction and just building relationships and all of that. So you mentioned right at the start that you were a visiting researcher or like you're taking a sabbatical to work in industry for a year. So were there some, and you'd been in the field for more than like 10, 12 years at that time. So what were some things that you learned that were interesting by going into industry? Yeah. So 2011 August to 2012 August was the one year that I spent at Google. I was a part of their infrastructure team. And I think just seeing the discipline that exists inside the company in terms of how they write

Starting point is 01:06:27 code, how they peer review code, and just the style guides they adhere to. These are things that are known openly. I think that was eye-opening for me. Also seeing the kinds of problems that they care about, the kinds of problems they work on was useful and it benefited my research. Though I didn't end up working on exactly those problems after I returned, it was kind of a change in the mind frame,

Starting point is 01:06:57 the way in which to approach problems. There were problems that I looked at before I went to Google and my reaction was, ah, that looks like an interesting problem. And then after I came back from Google, my reaction to the same problem was, no, that has an easy existing solution. That's not worth looking at. But this other problem, that looks more interesting.

Starting point is 01:07:19 So it's more of a grounding, I think. The grounding to reality, I think, was really, really critical. So I think that mind frame shift was really really critical so i think that that mind frame shift was really the most valuable thing that i got from there um plus of course you know i i was very fortunate to work in a team which had uh amazing people uh and i got to interact with a lot more people beyond that current team uh and um i you know those are still colleagues and collaborators to this day on those other projects and similar projects.

Starting point is 01:07:49 So I think that's part of one of the side effects of being in academia that, you know, you get to meet and work with so many different kinds and different sets of people that you make friends over the years, more professional friends and personal friends over the years. So I think that was, yeah. So I would certainly say that my one year at Google changed the way I look at problems, that changed the way in which I pick problems to work on because it really, really grounded me.

Starting point is 01:08:23 And I definitely recommend, you know, something similar for all faculty members, for all folks who are in academic research, and to do it periodically, because industry is so fast moving in the systems era that you have to do something like this every seven, eight years in order to stay grounded. Are you okay with sharing what exactly you worked on or is

Starting point is 01:08:46 that covered under some sort of NDA? Yeah, those things are covered under NDA. There is a patent we wrote, which I think people can find from my homepage. So it was related to how you do, how Google's core scheduler, the Borg scheduler, runs in the data centers. And Borg is pretty much the predecessor to Kubernetes. Correct. Yeah. And Borg still runs today, right? So Borg still runs internally in the data centers in Google. And it was developed very early in the days of Google, as the Borg paper published 2015 says. I think there's a more recent paper on Borg as well. I think John Wilkes, who was my mentor when I was the one year in Google, John Wilkes has been very instrumental in

Starting point is 01:09:40 working with folks inside Google to write papers on not just Borg, but on other things. This thing that I was saying earlier, where academics need to work with industry. Sometimes the folks, developers inside industry sometimes do that. And John Luke has been fantastic in doing that inside of Google since he joined there. So a couple of papers have been written on Borg, one in 2015, and then I think there was one more last year, I believe. Also in Urosys, if I'm not wrong. I would say those papers don't talk about the entire 100% of Borg design, but they talk about its important parts. So having seen Google's inside implementation of Borg and having seen the papers as well, which is an external phase, I can see what's missing

Starting point is 01:10:27 and I can see what's new, what experiments are new. And it all makes sense, right? So there's still value in publishing those papers. Yeah, you definitely need those shepherds. It seems like John Wilkes to help push that along. And I can also imagine how hard it would be to migrate from a cluster scheduler to something like Kubernetes. No matter how mature Kubernetes seems from the outside,

Starting point is 01:10:55 the amount of bug fixes that would have gone into Borg over 20 years of its existence, I guess, at this point. Yeah. One of the adages that I found in many companies and in industry is that for every system, there are two versions of the system. The one version that was old and therefore has been deprecated, and the other version that is new and doesn't work yet. The migration isn't complete.

Starting point is 01:11:22 And the migration isn't complete. And the migration isn't complete. Everything has two versions. And sometimes the new version becomes mature enough that they can actually deprecate the old one and really move to the new one. But in other cases, they actually abandon the new one. They're like, yeah, if it ain't broke, why fix it? Maybe we take lessons from the new one and put it in the old one

Starting point is 01:11:43 rather than the other way around. So I've seen both of them happening in Google and in other companies as well. Rajat Mittal Yeah. I remember there was a paper on a scheduler called Omega, which they released. I don't know if they actually ever deployed that. They never revealed that information, so I don't know. But given that they started working on Kubernetes, I'm guessing they didn't, or they might have tried and rolled it back or something.

Starting point is 01:12:04 But I think that paper is the one that inspired Mesos, if I'm not mistaken. So you might not fully deploy a system internally, but by writing a paper, you can inspire some work even externally. Yeah. Omega was built around the same time as Mesos. And I know that the Omega developers talked to the folks at Berkeley, Stoica's group at Mesos, and I know there was a lot of interchange and exchange of ideas. But you're right. That's one of those. The Omega paper and Omega work, as well as the Mesos paper, are papers that inspire new

Starting point is 01:12:42 generations of research. I know that at least at some point, places at Dropbox, we use Mezos. So we've definitely been helped by that. And I think in my experience, it's been a pretty reliable system for the kind of workloads that at least we use it for. Yeah, and I want to ask maybe a couple more questions

Starting point is 01:13:02 and then you can call it a day. But just super high level, maybe this is repeating something that we've spoken about previously, but just to hone in the point, if I was a student today and I had to decide between doing research or going into a field, like into software engineering in industry. Very frankly, I would say something like, it seems like in AWS or GCP or Azure, it seems like there's a lot of high scale, petabyte scale databases and all of that,

Starting point is 01:13:42 that work is going on. How would you, if you were a transfer, if you were a student right now and you had to pick, what do you think you would do? Would you first go for a PhD as well? Would you try out industry for a few years, then maybe do a master's? How would you go about making that decision?

Starting point is 01:14:03 Yeah, so let me talk about the two angles of it. One is the mental angle. The other one is a more logistical angle. So I hope that students are able to experience both sides, both the industry side via internship while they're students, as well as the academic side by perhaps engaging in research projects with faculty members while they're a undergrad student i think it's important to experience both sides because

Starting point is 01:14:34 uh it's useful to know what you like and what you don't like and you know, there is an old Latvian proverb that says, the work will teach you how to do it. And that's kind of true of a lot of things, but it's certainly true of research. And sometimes while the work is teaching you how to do it, you grow to love it. And that work might be the work you're doing in a company. You're like, oh, yeah, this is clearly for me. I clearly want to work in a startup. Or while you're doing research a company. You're like, oh, yeah, this is clearly for me. I clearly want to work in a startup. Or while you're doing research, you might realize, oh, this creativity is great. Or you might realize while doing research, this is too open-ended for me.

Starting point is 01:15:12 So much flexibility is not good for me. I need more structure, right? Either way, I think the only way to realize what you like and what you don't like is just to experience it. So I definitely encourage all students, all undergrad students out there to try research and to do internships and make sure that when it comes time to making the decision, oh, what is the next step after I graduate, that they're making an informed decision that is informed by their experiences.

Starting point is 01:15:38 So, I think that's the first side of it, which is the more mental side. You want to go to a place that you're going to be happy in. You want to go to a place where you're going to be impactful in, but if you go to a place that it's a hot area and you really hate the area, that's much worse than picking a niche area that doesn't have much impact, but you really love the area. You're like, oh, that's so great. So that brings me to the second part, which is the logistical part. Nowadays, you can go to industry, spend a few years and then apply back to grad school. That's certainly possible. You can also work on online master's programs while you're in

Starting point is 01:16:23 industry. This is also a very common thing. So the open course that I teach on Coursera on thought computing concepts has thousands of students in it registered at any point of time. And something like three-fourths of the students are full-time employees who have full-time jobs, have their own families, and they're doing the coursework at weekends and weeknights. So that's, I think, something that is possible nowadays. So you can actually get a degree while you're working as well, if that's something that you want to continue doing. In academia, we do see a value in students having spent years in industry before they apply back to grad school. Because I've had both kinds of students. I've had students in my group, PhD students in my group, years in industry before they apply back to grad school.

Starting point is 01:17:06 I've had both kinds of students. I've had PhD students in my group who have come directly after their bachelors, and also I've had students who have spent a few years in industry, and I can see advantages of both. On one hand, you have some maturity. On the other hand, you may have more energy, you may have ideas, but you could have energy in both, even if you come back from industry. So I think the important thing when graduating is to figure out what is it that you like, but it's also important to plan ahead a little bit. Especially if you're going to go to industry, it's very easy to lose track of time.

Starting point is 01:17:44 You join a company and then 10 years later you realize, oh, I've been in the same job for so long. What's happened to my career again? So one thing I definitely encourage students to do is just set a calendar reminder for yourself same time every year and say, this is one day that I'm going to take maybe like July 3rd or something like that, to sit down and think about what my career trajectory is and whether I can improve it in other ways, make these paradigm shifts in it that would make my life happier, career-wise and also maybe personal-wise. I think it's important to revisit one's career goals every once in a while, at least definitely at least once a year.

Starting point is 01:18:26 Otherwise, it's very easy to get lost in the next deliverable and the next deadline and the next peer evaluation. Yeah, you get lost in those things. Yeah, your answer to this question has made me think of maybe like a dozen more questions around, you know, academia, like interviews and how I should think about getting in. But I think this is like a good stopping point. And we should probably have a round two at sometimes just to dive into those areas. I think that's also very interesting. Like the whole PhD comics thing. How true is that?

Starting point is 01:19:00 Very true. Yeah. But anyway, thanks so much for being a guest on the show I definitely enjoyed the talk and I hope you have fun

Starting point is 01:19:09 yeah thank you bye bye bye bye bye bye

Starting point is 01:19:13 bye bye bye bye bye bye bye bye

Starting point is 01:19:15 bye bye bye bye bye bye bye bye

Starting point is 01:19:16 bye

Your Ad Here

Software at Scale - Software at Scale 6 - Distributed Systems with Indranil Gupta

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.