Computer Architecture Podcast - Ep 5: Datacenter Architectures and Cloud Microservices with Dr. Christina Delimitrou, Cornell University

Starting point is 00:00:00 Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to cutting-edge work in computer architecture and the remarkable people behind it. We are your hosts. I'm Suvainai Subramanian. And I'm Lisa Hsu. Today we have with us Professor Cristina de la Mitrio, who is an assistant professor in the Electrical and Computer Engineering Department at Cornell University.

Starting point is 00:00:22 Cristina has made significant contributions to improving resource efficiency on large-scale data centers, QoS-aware scheduling and resource management techniques, performance debugging, and cloud security. She received the 2020 IEEE TCCA Young Architect Award for leading research in ML-driven management and design of cloud systems. Prior to Cornell, she earned a PhD in electrical engineering from Stanford University. Today, she's here to talk to us about data center and cloud architectures and applying machine learning techniques to optimizing and managing these systems. A quick disclaimer that all views shared on this show are the opinions of individuals and do not reflect the views of the organizations they work for.

Starting point is 00:01:16 Christina, welcome to the podcast. We're so happy to have you here with us today. Well, let's kick this off with a nice simple question. These days, what gets you up in the morning? A few things. Non-work related, getting the vaccine and going back to normal life. Work related, like you mentioned, there's a few problems that we're working on, both on what's the right way to build servers for modern data centers and also what's the right way to build the applications that go on top of the servers and specifically what is the role that machine learning can play in improving the design and management of these large-scale systems. Right. So what are the major trends that are changing the landscape of these data center architectures

Starting point is 00:01:51 and cloud computing more broadly? Yeah, that's a good question. So if you look at cloud computing five or 10 years ago, its key competitive advantage was that it was using commodity equipment. That was the main difference between cloud computing and HPC, high-performance computing. And that is how it got economies of scale and improved cost efficiency. So even back then,

Starting point is 00:02:11 it's not that it had a single type of server. There were still different generations of servers, different server configurations, but it was a much more homogeneous picture. If you look at what cloud computing systems look like today, there's a lot more heterogeneity, whether that is through reconfigurable acceleration fabrics, whether it's through special purpose

Starting point is 00:02:28 accelerators. And there's a lot of good reasons why people are switching to accelerators. They have performance, power, in many cases, cost benefits, but they also introduce complexity in system management. And then on the software side, you have a similar picture, which is that traditionally cloud applications were built as what we call monoliths. So a monolith is an application that includes the entire functionality as a single service and then is deployed as a single binary. And if at any point in time it needs more resources, you scale out multiple copies of that application across multiple machines.

Starting point is 00:03:01 And as long as the application remains small in scale and in complexity, the monolithic design approach is fine. The problem is that when the application increases in scale and complexity, as with every other area of systems research, you want modularity. So that is the primary reason why you see programming models like microservices and like serverless compute popping up. And again, there's many good reasons why people are using these programming models. We can talk about them in more detail, but they do introduce more complexity on the software side as well.

Starting point is 00:03:31 And it's for those two reasons why a lot of the work that we do is around automating the system design and management and trying to abstract away that complexity. So neither the end user nor the cloud operator has to deal with it on a day-to-day basis. So let's expand a little bit on microservices. Could you tell our audience a little bit about how these microservices are different from

Starting point is 00:03:53 monolithic cloud computing applications? Are there any unique challenges that these essentially bring up that makes them harder to manage? Sure. So again, a monolithic design is what you would call a single application. So you have a single code base, compiles down to a single binary, and is deployed as a single application. You can have slightly more complex versions of that, where you have a front end, that would be the web server, and then you have a back end, which would be the

Starting point is 00:04:21 database. And then the in-between, the mid-tier, is where all the logic of the application is implemented. So that's not entirely a monolithic application, but as far as microservices are concerned, it would still qualify as a monolithic design. Microservices are modularity at the extremely fine granularity. So you have still the front-end web server, you still have the back-end databases, but the middle tier, which would be one service, can now be hundreds of unique services, and then each of them scaled out to multiple replicas. So the advantages of this programming model is that modularity makes the application easier to understand. So if you are in a big

Starting point is 00:05:04 company and you are in a big company and you are developing an application, you don't need to be familiar with the entire code base. You need to be familiar with the microservice that you're responsible for, and then the API to interact with all the other microservices. And usually these are standardized APIs, either through RPCs more commonly or HTTP requests in some cases. The other advantage is that it ties nicely with this idea of a containerized data center where each service is deployed in its own container. And then if it needs more resources,

Starting point is 00:05:33 you just scale up or scale out that service. You don't have to scale out the entire deployment. It also helps in the sense of software heterogeneity. So you're not tied to a single language that the entire application is written in. If, let's say, the front end would benefit from a higher level language, you can do that. If other tiers will benefit from lower level languages that optimize for performance, you can have that. It's a longer discussion.

Starting point is 00:05:58 How many languages do you want to have in your system? Because that adds complexity as well. But it gives you the possibility of having some language that originates. Now on the cons side, it doesn't come without its challenges. So first of all, servers today are not designed for microservices. They are designed for applications that have performance requirements at least in the millisecond, if not multiple millisecond granularity. If you have a service implemented as microservices, then the end-to-end performance target is milliseconds, but then you might have 100 microservices on the critical path. So the target for each of them would be in the microsecond

Starting point is 00:06:36 granularity. So you need much more predictable performance, much more low latency operation, even from the hardware. And then there's all the overhead that the software layers add. There's also complexities in the sense or challenges in the sense of dependencies between microservices, because even though they are loosely coupled, they're not independent from each other. So the problem that that can introduce is that if one microservice becomes bottleneck, that can propagate, it can create back pressure to other components of the system, and it can propagate across the system and even become worse and worse. So that is difficult to diagnose because you don't always know what was the root cause of the performance issue, and that motivated a lot of the work that we did on performance

Starting point is 00:07:19 debugging. And it can also take a long time to recover from because essentially you have to go tier by tier and correct the resource allocation until the entire performance of the service recovers. So that's at a high level why people are switching to this know, one of the potential aspects of a monolithic system is that you can sort of understand it end to end because it's one monolith. And now in order to sort of distribute complexity into modules, at some point, you still want a sort of monolithic picture, at least, of what is happening. So you don't have sort of like cascading, you know, effects rippling throughout where nobody understands where they start or where they're going.

Starting point is 00:08:09 So are there aspects of this where you still need something that's kind of monolithic that's looking at the whole thing? And is that or is that more sort of someone needs to be on the side pulling telemetry and then having a monolithic processing sort of program on the other end that's pulling things from multiple aspects. But somebody still needs to understand the whole thing in order to know what they're looking at. Is that one of the complexities you're talking about? Right, exactly. So you definitely need tracing. And most systems, at least that we're familiar with, have some end-to-end tracing where they track once a request arrives in the system,

Starting point is 00:08:49 what is the latency breakdown? So what machines does it traverse? What services does it traverse until it goes back to the user? So you definitely need some global visibility into the system. The problem is that if you task a person with having that global visibility, you still revert back to the very complicated monolithic application where somebody has to understand the entire code base. So you need to understand the topology, but you don't need to understand the details of each individual microservice. Right, so this seems like something ripe for where all those inputs may go

Starting point is 00:09:15 into some sort of ML system that can sort of help you interpret what is happening. So would you say your focus is more on the software side of the story and managing that complexity or on the hardware side of the story and managing that complexity? Because you kind of discussed both. Right. So I am looking at both because if you look at where the performance and predictability comes from and where the resource inefficiency comes from, both hardware and software are responsible for that. So on the hardware side, what I'm looking at is what's the right way to build servers for these new programming models that can offer more predictable performance, more low latency operation, whether accelerators can play a role in that. And if they can, what is the right

Starting point is 00:10:00 type of accelerator? So one example is that because microservices talk over the network to each other, they spend a large fraction of their overall latency processing network requests. So if you look at a breakdown of latencies, more than 50% in some cases is just processing network requests. So that's clearly very inefficient. You're not doing useful work. You're just receiving RPC packets and then sending RPC packets. So given this observation, what we are looking at is what is the right acceleration fabric for networking in microservices?

Starting point is 00:10:32 And we've built this system based on an FPGA that offloads the entire networking stack, TCP included and RPC framework included on an FPGA that's very tightly coupled to the main CPU. The reason why you want it to be very tightly coupled is to have very efficient data transfer between the host CPU and the FPGA. And the reason why we're using an FPGA is to have a reconfigurable fabric that can adjust to the needs of different microservices because you won't only accommodate one service.

Starting point is 00:11:01 You might have tens, if not not hundreds of microservices that have very different network traffic requirements and network traffic characteristics. So that is kind of a first step in the acceleration that you can do. There are other system tasks, things like garbage collection, things like remote memory access, encryption, machine learning models, that could also be accommodated in reconfigurable accelerators like that, which brings up questions on virtualizing the FPGAs, allowing resource isolation, resource partitioning on the FPGAs. That's more on the hardware side. On the software side is more questions of how do you manage resources for this application? So one example is what is a cluster manager

Starting point is 00:11:45 that can take into account the dependencies between different microservices in a way that guarantees the end-to-end quality of service? Or the performance debugging that we were talking about before, which is if something goes wrong in the system, how do you figure out what was the root cause of, what was the root cause that caused the problem? And how do you also correct it so that it doesn't happen again in the future. What's important with all this work is

Starting point is 00:12:11 can you get some insight because you can design the machine learning system that has high accuracy. A problem with machine learning in systems is that it's difficult to get explainable outputs from this machine learning algorithm. So get something where there is an insight that can help you design the system better, not just get better performance right now. So that is a general issue with using machine learning in systems. We have been making some progress with explainable machine learning techniques. So one thing that we've been looking at is

Starting point is 00:12:45 for performance debugging specifically, can you use the output of the performance debugging system to correct design bugs in the application? And right now, this is a limited set of design bugs. So it can be things like blocking connections, maybe shared data structures that create bottlenecks, maybe cyclic dependencies between microservices, things that even though the applications were designed by people

Starting point is 00:13:09 that have a lot of expertise in cloud applications, there are still bugs that are difficult to fight. But if you use the machine learning system, it can help you pin down where the problem might be, and then it still needs some human intervention to actually fix the problem. That sounds really exciting. Let's dig a little deeper into the machine learning system that you mentioned, especially

Starting point is 00:13:30 one that's able to pinpoint where there are sources of bugs or performance bottlenecks in the system. So what does this machine learning system look like? Is it different from the machine learning systems that we see for other kinds of tasks? Is it your vanilla CNNs and LSTMs, or do you have some other kinds of structures in your machine learning model that aid this particular task? Yeah, so we built two systems for performance debugging. I can tell you what the first one was. So the first one was using techniques that you would find in other systems.

Starting point is 00:13:59 It was a hybrid network that had CNN followed by an LSTM. So that was SEER. And the goal of it was to identify patterns in space and in time that if you don't do anything, if you don't take any action, will turn into a quality of service violation in the near future. And the reason why we were looking at the near future is so that before the problem occurs, you can take an action and essentially avoid it. Because if it happens and you don't notice it, then it takes a long time to recover.

Starting point is 00:14:30 So, Cier relied on tracing that was both distributed and per machine. So the distributed tracing is your typical RPC level tracing which collects the latency breakdown of an application from the beginning to the end. So what are the services it traverses? What is the latency for each service? And then the second level of tracing was pair machine utilization metrics or performance counters, if that is something that you have access to. So these are the traces that would stream into the model and

Starting point is 00:15:01 the output would be the probability that a microservice would be the root cause in the near future. Now, this is a supervised learning technique, which means that to train it, you have to give it some annotated traces. Annotated traces with root causes that you know are correct. Now, how do you know when a root cause is correct? Only if you've caused it. So the way we did this was by injecting sources of unpredictable performance through some contentious applications. And that allowed us to know where the quality of service violation started from

Starting point is 00:15:34 and annotate the trace correctly. Now that ended up having very high accuracy. The disadvantage is that when you're in the production system, you can't really go and start hurting the performance of the application when it's live disadvantage is that when you're in the production system, you can't really go and start hurting the performance of the application when it's live, because obviously the user experience will be degraded. So to address that, and also some other issues that some of the requirements that CR had,

Starting point is 00:15:56 which was very frequent tracing, quite a bit of instrumentation in the kernel, things that are not easy to do in a production environment where you might have third-party applications that you can't necessarily instrument. So the follow-up to that was the Sage system that you mentioned in the beginning, which is, again, a performance debugging system. It, again, relies on machine learning, but it's entirely unsupervised. So its goal is not to improve the accuracy over SEER. It's to improve the practicality and scalability. And that system relies on two techniques. The one is building the graph topology of the different

Starting point is 00:16:32 microservices. So essentially it builds this causal Bayesian network, which gives you the dependencies between microservices in the end-to-end application. And then the second technique that it uses is what is called counterfactuals. Counterfactuals are these hypothetical scenarios of what would happen to the end-to-end application if I were to tweak something in one of the existing microservices. So if, for example, I see that I'm experiencing poor performance, if I assume that the performance of one microservice was normal, does that fix the end-to-end problem? So that is how Sage works. The accuracy that it achieves is pretty similar to SEER,

Starting point is 00:17:10 which was good. Actually, it was better than we expected because usually supervised learning works better. But it's much more practical to deploy at scale because it doesn't need as much instrumentation at the kernel level. Even at the application level, all you need is the end-to-end RPC level tracing. Oh, that's really cool. So you're able to deploy that ladder and

Starting point is 00:17:33 just have it kind of determine things on its own purely by essentially noticing phenomena and then conjecturing about what would happen. Interesting. And so what sorts of systems are you actually deploying and testing these ML systems on? Yeah, so Sage is in collaboration with Google. We have tested it in Cornell's clusters and also Google Compute Engine for a more large-scale experiment. And this is using some applications that we developed.

Starting point is 00:18:02 So this is a benchmark suite called the Death Star Bench, which people usually ask me why it's called that. It's called that because the dependency graphs between microservices are called the Death Star graphs. It's these bubbles that show all the edges between microservices. social network, it has a movie reviewing and browsing site, has an e-commerce site, a banking system, and then a couple of applications that are more related to IoT, so swarms of edge devices like drones. So both systems we evaluated with those applications and then both on smaller clusters that are fully dedicated and controlled and also public cloud clusters. So do you foresee, this is kind of a joke question, do you foresee something like Death Star Bench replacing, say, spec? It wouldn't replace it. It would be nice if it complemented it for cloud applications. This is always a problem with cloud research in academia, right? You don't have realistic applications. You'll never get access to the internal applications

Starting point is 00:19:04 that Twitter or Netflix have, which I mentioned because there are two companies that were the first to use microservices. The idea with Death Star Bench was to build something that uses realistic components. So we use individual microservices that you would find in a real system. But of course, the functionality is simpler than what you would have on Twitter or Netflix. It is extendable, though, so extensible though. So, and also open source. So if anyone that's hearing this is interested

Starting point is 00:19:32 in this research, feel free to give it a try. We welcome feedback and contributions. Very cool. And so is there, earlier you were talking about how these sorts of microservices also have sort of different demands on the kind of hardware that they run on. Does this particular benchmark have a potential dual purpose where, on the one hand, you can use it to sort of train this ML system to figure out what's happening and be able to understand what's happening within this sort of faux microservices deployment, but at the same time use it to consider what kind of hardware changes would need to be made?

Starting point is 00:20:09 Yeah, that's a good point. Absolutely, yeah, you can do that. The caveat there is making sure that the logic tiers are representative. So like I was saying before, this is not a production application. So the logic is simpler than what you would find in a system like Twitter or Netflix. But you can

Starting point is 00:20:26 add any logic that you want on top of it to make it more sophisticated or even simpler. The front end, which are the web servers and the backend databases, both in memory and persistent, those are production class. So those are systems that you would find in the production system today. And in fact, these are also the applications that we're using for the acceleration work. So for the network accelerator, those are the applications that we use to quantify how much time we spend doing network processing and also the performance benefits once you offload the network stack to the FPGA. I see. So, okay. So let me make sure

Starting point is 00:21:01 I understand. So what you were saying is before when you were using ML to sort of trace potential dependencies between microservices and figure out what the problem was, the key there is understanding the relationship between the microservices and therefore the logic inside of them can be very simple because you're just looking at the connections between them. But when you're talking about running it on hardware, the fact that you've kind of like hollowed out what's happening inside of the microservices for the sake of looking at communication now means that it's not necessarily the kind of thing you want to understand when you're running on top of the server, because you have this essentially empty shells that are just communicating. And so in order to validly evaluate what would happen in hardware, you would want to actually fill them out with real semi-complicated logic of what microservices would be trying to accomplish.

Starting point is 00:21:46 Is that? So you can add more complexity to each individual microservice. They do implement the logic that they're responsible for. So they do implement a social network where you can add posts, communicate with others, send direct messages, reply, get recommendations, get advertisements, all these things that you would find in a typical social network. But of course, it's not production class code. So you don't have all the complexity that you would find in a typical social network. But of course, it's not production class code, so you don't have all the complexity that you would find in a social network. So if the hardware study that you're doing depends on

Starting point is 00:22:13 having all the complexity, then you might get different results. But they are simple microservices that you can start from, and you can add more sophistication in individual components if the study that you're looking at requires more complexity. I wanted to circle back to the ML aspect of things that you've been working on. So you talked about how these microservices and cloud-based systems, there's a lot of complexity between these different things. And when you're trying to apply ML to these, normally for a different task,

Starting point is 00:22:44 like for a natural language processing task or something, machine learning model developers, they try to induce some inductive bias where there is some semantic understanding of the system or the way the system is architected that needs to be induced into the ML model? Or is that broadly not a concern right now, but it would be good if you could induce some of those things inside? Have you had conversations about these with ML researchers as well? Yeah, that's a good question. It depends what is the goal of the ML system. So, of course, if the system has some semantic knowledge of what the application is doing, that would absolutely be useful. The ML techniques that we've been using so far, they are relatively general purpose, so you would find them in other domains. They do take into

Starting point is 00:23:37 account the topology. So even if they don't take into account the logic, so what the functionality of the application, they do take into account the end-to-end graph topology. And that gets you most of the way there. I think if, for example, the scope of the ML technique is to find design bugs or security threats, things that are not necessarily related to resource provisioning questions, then having some semantic knowledge embedded in the model would be very useful. For the use cases that we've looked at so far, we didn't see that a lot of accuracy was lost by not having the semantic knowledge, but as we expand to the correctness

Starting point is 00:24:18 debugging as well, I think we'd have to somehow expose that. Are there any particular idiosyncrasies about designing ML for systems-related tasks? So you touched upon a few of them. I think in your papers, you have talked about training data and whether you have access to training data or if you can even generate training data and so on. Are there any other kinds of idiosyncrasies in systems design that make it uniquely challenging or require a different mindset compared to vanilla ML tasks? Yeah. So if you look at many ML papers, the primary goal is improving accuracy. And that's not always the case with ML in systems. Sometimes it is perfectly fine to drop some accuracy on the

Starting point is 00:24:58 table as long as you can keep the inference time low, because there's a system that works online. These are applications that have very strict latency requirements, in fact, day latency requirements, which is even more challenging. You cannot have a performance debugging system or a cluster scheduler that takes minutes to decide where to place the task or how many resources to allocate to it, or even correct the system after it had some performance issue. So it's more about making the system very interactive and very latency sensitive, less about getting the absolutely optimal decision.

Starting point is 00:25:31 Do you find that sometimes with adding ML into the decision-making processes of managing microservices and such, that the ML itself injects variability that you don't want to accept, meaning that, you know, given that it could be slightly undeterministic, depending on what the inputs are to ML, you know, maybe the, you know, it can be a sensitive system where like your input is a little bit different, so your output is different, but really the inputs shouldn't be that different because the output you know the it's a manifestation of the same same thing it's just that this one is 1.1 this is 1.12 or something like that but the result is different and if the result is different now you have variability in the after

Starting point is 00:26:17 effects of a decision that where you didn't want to see any do you ever see problems like that manifest where you've injected some amount of non-determinism into decision-making processes? Yeah, absolutely. That can happen. So that's a question of how do you design the technique, whether you fed it with the right training data and whether the technique has some explainability. That's why I was mentioning explainability in the beginning, because if you don't know why the technique told you what it told you, there's no way to say if this is the right output or not. There's still a lot of work that needs to be done with making the output of ML interpretable. But I'm glad to see that the ML community is also focusing on this problem instead of just improving the accuracy or scalability of techniques.

Starting point is 00:27:00 There are already some explainability techniques that we are applying in our systems. And the idea is to essentially interpret the output and use that to gain some insight on how to better architect the system. So that was how we found a lot of the design bugs that we've identified. It's still early days, so there's a lot that needs to be done in that area. But that would address the problem that you mentioned. I see. I see. And so what do you see as kind of the harder problem to solve?

Starting point is 00:27:30 You know, you've been kind of straddling this hardware side and software side throughout this discussion. Yeah, which one seems harder? That's a good question. So I think the dependent, managing the dependencies is the hardest problem and is the most different from what traditional cloud applications look like. Because I had worked on using machine learning and cluster management before, I was a bit more familiar with solving

Starting point is 00:27:55 that problem. So in that sense, the hardware acceleration was newer to me, but I don't think that makes it necessarily harder than the other. I think both problems, both sides are two sides of a coin. They're both challenging. They just have different challenges. So in the one it's more of an implementation of a hardware accelerator. You need to understand the networking stack in a lot of detail. You need to decide what parts of the networking stack does it make sense to have reconfigurable or hard coded on the other side is more about selecting the machine learning

Starting point is 00:28:25 algorithms that are suitable for this problem, potentially developing new ones if that's required, and then collecting the traces, deploying the system. It's more of a distributed system design in that case. Right. So this sounds like an incredibly complex problem that touches multiple different fields and multiple layers of the stack. You know, we have talked about ML, we have talked about systems, networking, operating system, hardware, and so on. So how do you think about tackling such a problem, right? Like for someone with a computer architecture background, like are there simplifying, normally researchers do make simplifying assumptions. So are there like simplifying assumptions that you use or are are there useful signals that you have had to figure out? Like, here's the part of the stack that makes the most sense to go and address first. So how do you go about scoping out this problem?

Starting point is 00:29:14 It's modularity, right? This is what microservices do. So you break it down to the individual components of the system stack. We're not trying to change everything at the same time. The first thing we did was design the applications and then do a characterization study. So from hardware level to distributed settings, what are the challenges of these applications? And you change one thing at a time. So you look at existing servers,

Starting point is 00:29:35 how well do they behave for these applications? Where are the bottlenecks? We in fact revisited some of these old questions of do you want the big cores, do you want the small cores for this programming model? And then you go to operating system and networking. What are the bottlenecks there? What does it make sense to change? At the cluster management level, what is the impact that dependencies have? And at the programming level, programming framework level, what is the right programming framework for this type of applications? And then based on that,

Starting point is 00:30:00 you prioritize. You see that if 50% of your end-to-end latency goes towards networking, then that's a huge factor. That's something that needs to be optimized. Once that gets out of the way, there are other system tasks that also consume a lot of cycles, but nothing that is as dominant. bottlenecking one microservice can cause my entire system to collapse because this bottleneck has propagated across tiers and made the end-to-end performance, I don't know, a hundred times worse, then that's a big problem to tackle as well. But it's on a different level of the system stack. So not trying to change everything at the same time. Do you find that there's cohesion across your students in terms of understanding the whole problem from all the way from hardware up through software or how do you how do you find students that are able

Starting point is 00:30:50 to understand across the spectrum or or do you sort of find once okay this one is going to be targeted at this particular problem this is going to be targeted at this particular problem do you modulize your students i suppose a little bit a little bit i do tend to take students both from the cs and the ec side it's not always that the ec students want to do architecture and the cs students want to do systems sometimes it's the other way around but i think it's good to have a mix because just from their backgrounds ec students tend to have a deeper knowledge of how to build hardware how to work with fbgas, what are the tradeoffs in architectures. CS students have a more global understanding of how the software stack will affect the

Starting point is 00:31:31 performance of the system, perhaps not always as detailed knowledge of the hardware levels. There is a bit of separation. So there are some students that work more on the hardware side. There are some students that work more on the machine learning for systems. Hopefully everybody understands what the big challenges are. They are all working with the same infrastructure, the same applications, and the same servers. So in that sense, there's a lot of collaboration between them. Great. So this seems like potentially a time to transition into asking about your experience as a junior faculty, since we sort of like touched on, you know, student and student hiring. So I believe you're our first junior faculty on the,

Starting point is 00:32:12 on the podcast. And so we'd love to hear some of your perspectives about the transition from being a grad student to being faculty, how you sort of got going, chose your topic, you know, wrote your, you know, the grant writing process, all that for some of our audience members. Sure. Let me remember, because it's been a couple of years at this point. So what I found, I did not, I should say, I did not radically change my topic when I switched from PhD student to junior faculty. So I still stayed within cloud computing. I started looking at different types of applications. So for my PhD, I had only looked at monolithic applications. There was still this idea of applying machine learning to systems,

Starting point is 00:32:56 but not for the complex applications that we have today. And at the time, microservices were not a thing. And also, I had not worked on hardware acceleration at the time. That is an entirely new, new topic. What I found most challenging switching from student to faculty was time management first, because you're not, so as a student, especially senior students, most of what you do is working on your project. And at that point, you know how to work. You're very productive. You have good knowledge of the area, good knowledge of your project, you can make a lot of progress. As a junior faculty, first, you're most likely switching to a new place. So that's a very subjective thing, how easily people adjust to a new place and new people. I'm one of those people that don't adjust very easily. So that takes some time. But more than that,

Starting point is 00:33:45 you have to adjust to this idea of advising students as opposed to doing the work yourself. So sometimes it's very tempting to jump in and do a lot of the work yourself, but you're not there to do the work for the students. You're there to train the students to learn how to do the work themselves. Another challenge is that as a student, you know what advising style works for you. But as a professor, you have to adjust to the advising style that works for each of your students. And it's not always the same. So some students might like more hands-on advice, more frequent meetings, going through the implementation. Some students might prefer more high-level discussions, more infrequent meetings,

Starting point is 00:34:25 perhaps longer ones. So finding the right balance between what each student needs to be successful and to make progress, that is challenging at the beginning. And I guess one good advice that I was given and that I also gave is to not start with too many students because people are very enthusiastic when they start as a junior faculty. They want to do a lot very quickly, but people should not underestimate how long it takes to learn how to advise others. Even if you've had some experience with advising undergrads or even other PhD students, it's still different when you're the main person that's responsible for that student as opposed to helping your advisor with a student. So it's better to start with one or two students at the beginning, learn the game and then scale up.

Starting point is 00:35:11 Yeah, no, that's, that sounds like very good advice for someone just starting out. You were asking also about the grants. I can, I can say a little bit about that. So that's part of the time management too, right? You have to learn to divide your time between working with students and teaching, which I was fortunate that at Stanford, I did get a chance to teach a couple of classes.

Starting point is 00:35:30 But again, it's not the same when you are the person that's responsible for the class. And then grant writing, which I didn't really have experience as a student. So that was something that was entirely new. And you have all the service in the department and outside the department. For grant writing,

Starting point is 00:35:44 it's something that you get much better at as time goes by and you learn how to frame what you want to do. So rarely the first grant that somebody writes is going to get accepted. Maybe if it's a small one, maybe if it's an industry, if it's a grant to industry. But if you're submitting to NSF or DARPA, usually the first one's not going to get it, and that's fine. It takes some time to get used to writing grants, expressing something that you want to do as opposed to something that you have done, which is what you learn to do as a student. But as with writing research papers, it's something that you pick up the more you do it. And again, once you start, when somebody starts as a junior faculty, at that point, hopefully, you know fairly well how to write papers.

Starting point is 00:36:37 So from that to learning how to write grants, it's not that hard. I do try to not submit too many grants. I also try to not have a really large group because I want to know what each student is doing and meet frequently with each student. I don't really like to have a hierarchy of multiple postdocs. And because I don't have a huge group, I don't have to submit a ton of grants. So that helps as well to put more effort into the ones that I do submit. You touched upon teaching very briefly here. Maybe you can expand on that a little bit, especially the last year we've had COVID and lockdowns and so on.

Starting point is 00:37:09 How has that affected both teaching classes as well as advising graduate students as well? Yeah, I can tell you about teaching first. So it is challenging, definitely. I am fortunate in the sense that my classes, neither the undergrad nor the graduate one, need physical presence. So students work with machines, but they can do that remotely. So in that sense, they've been fairly easy to transition online. Of course, you have to deal with different time zones. Some people are taking the class at 3 a.m.

Starting point is 00:37:38 So they cannot call in. They have to watch it later. So you have to adjust a lot of things. For example, I used to do short quizzes in the beginning of my undergrad class. You can't do that if it's 3am for some, because it's not fair. So something said to be adjusted, but it's been easier, I think, for my classes compared to a lot of other people's that have in-person FPGA experiments, have robotics experiments that people need to be there for. So it hasn't been too challenging.

Starting point is 00:38:06 I've seen people struggle much more with teaching online, but of course it has its challenges. Advising students online, it also has the challenge of not being able to sit in front of a whiteboard and just think for a couple of hours. It's usually a Zoom meeting. More often than not, we have a lot of Zoom meetings in a day, so you don't want to make them even longer. If you have six hours, you don't want to make them eight or ten.

Starting point is 00:38:35 So, of course, that limits a little bit the time that you spend with each student. But I am glad that all my students, first of all, are healthy. They didn't contract the virus. They are safe. Most of them are in the U.S. I have a couple that got stuck abroad, but hopefully will be making their way back to the U.S. soon. But they have remained productive and happy with the research. Okay, that they're a bit less productive than they would be normally. That's okay. It's a difficult time for everybody.

Starting point is 00:39:06 Indeed, indeed. So a little bit more on teaching. What are you teaching with a computer architecture class? What kind of classes are you teaching? Yeah, so the undergrad is a computer architecture class. It's a senior level. So it's not the introduction to computer architecture, but we do simple processors. So the five-stage pipeline, single cycle, FSMs. We do caches.

Starting point is 00:39:33 And then we do a little bit about advanced processors. So out of order, superscalar, branch prediction. Let me see if I remember the whole memory disambiguation. A couple more. So what is it?

Starting point is 00:39:44 Speculative execution and a couple more techniques. And then they have to build a processor and very low at the end. So the reason I ask is because I'm actually on the side teaching a class, that same class right now. And one thing that I've noticed is that a lot of things are the same from when I learned it a long time ago. And one of the things that you were talking about in the beginning is how we may have to change how we build and design hardware, given this new world of cloud computing and how everything looks a little bit different now. Do you think so sort of from a philosophical sense, we need to change how

Starting point is 00:40:20 we teach computer architecture? That's a good question. I think you still need the basics. You still need people to understand what a pipeline processor is. But if you look at computer architecture classes 20 years ago, right, the main thing that they did was, I don't know, single cycle designs or ISA designs, and that was it. And they have been augmented with several things, the classes that you see today. One thing that I have done is I have incorporated a discussion of accelerators and then large-scale systems and very low-power systems, so cloud computing and then embedded devices. It's not the focus of the class. There are higher-level classes, including graduate classes, that are specific for cloud computing,

Starting point is 00:41:01 which is one class that I'm teaching. There are also similar classes for embedded computing, but I do try to mention them even in the undergrad class so that people know that when you need the basic information, that's not the only way that computing systems look like today. Makes sense. Makes sense. Yeah. Yeah. I've struggled with this a little bit myself because you definitely don't want to release a computer architect into the wild that doesn't sort of understand, you know, a five stage pipeline. But at the same time, there are a lot of things that are going on in the field today that are very different from thinking at that level. What are you thinking about for the future? So you've had this beautiful body of work about ML systems and cloud and QoS and all the stuff that we've been talking about.

Starting point is 00:41:50 Do you foresee yourself continuing down? Like, is there a lot more to squeeze out of this area? Or are there other things that you're thinking about? What else makes you excited? Yeah. So fortunately, cloud computing is one of these areas that keeps transforming every few years. So you don't run out of problems to work on. I do plan to continue this work on microservice. I think there's still a lot of problems that are open, specifically with applying machine learning to systems.

Starting point is 00:42:16 We've only scratched the surface of what you can do. There's a lot more. For example, what's the right way to design microservices? If you have a monolith and you want to transition to microservices, how do you do that? The way people do that today is very empirically. So they look at the monolith and they start chipping away to design individual microservices. That's not a very systematic way. I think machine learning would help in that as well. I think it would also help with not only performance debugging, but correctness debugging.

Starting point is 00:42:41 So I mentioned in the beginning finding some design bugs. We're still at the early stages of what you can do with that. So part of it is finding the bugs. You might also be able to automatically fix some of those bugs using ML. Same thing with security attacks. You might be able to automatically detect when a system is being attacked and potentially block it. On the hardware side, the networking acceleration is kind of the first step. There is a lot of other system tasks that you can accelerate. There's a lot of work that can be done in programmability for these accelerators, especially since you are exposing them to the end cloud user at this point, which has neither designed the applications nor the accelerator.

Starting point is 00:43:19 So you need an interface that's much more user-friendly. And then slightly, not outside cloud computing, but in conjunction with cloud computing, I also have this project on coordination control for swarms of edge devices. And the idea there is using a programming framework like serverless, which is well-suited for applications with intermittent activity, applications that have short-lived tasks, and a lot of data level parallelism to offload some of the computation to a back-end cloud system.

Starting point is 00:43:52 And in that case, there are hardware questions. What is the right way to build the hardware that goes into the edge devices? The ones that I'm more interested in is how can cloud acceleration help with performance predictability in applications that span the cloud and the edge? What is the right way to manage resources? So how do you decide what to run at the cloud, what to run at the edge, and what's the right interface to design applications that go into the systems, which don't only have the complexity of the cloud

Starting point is 00:44:20 and the multiple components of the application, but also have part of the application that runs on this unreliable edge device, which has very limited resources and unreliable connectivity. I see that sounds exciting too. That one has been challenging during the pandemic because we can't access the drones. Yeah, I can imagine. Yeah, because now computing is, there's a whole question of how to divide it up between cloud and something that is on the edge, how to do that divide when,? Like architecture sits at the boundary of hardware and software, right? So based on this work, have you had any thoughts or insights into, you know, how should these either microservices based systems or looking at these edge based systems or edge plus cloud based systems? Is there something that's missing in our programming abstractions that would enable us to deploy these systems or manage the growing heterogeneity and things like that in a lot more seamless manner?

Starting point is 00:45:30 That's a good question. So architects are not going to like this answer, but higher level interfaces are better for this complex programming models. So of course it adds inefficiency. The higher up you go in the system stack, it adds inefficiency. But the complexity is such in many cases that you cannot in the system stack, it adds inefficiency. But the complexity is such in many cases that you cannot expect the user to be exposed to all that complexity, whether that is defining all the APIs between microservices, whether it is deciding what's

Starting point is 00:45:55 going to run in the cloud versus what will run at the edge, whether it's specifying constraints, scheduling constraints, resource management constraints, security constraints, all this, which application designers have to do today. Even as it is, it adds inefficiency because many times people get it wrong or they don't truly understand what the application needs or how it should run. Yes, high level interfaces add inefficiency, but they also abstract away that complexity. And then if you have a system that can understand the application and understands the system abilities and the system requirements, you can recoup that lack of efficiency there. So do you think the answer then is higher level interfaces for architects to think about? Or do you think that it's inserting a runtime layer where the architectural interface can sort of remain the same, but you still have this kind of another layer inserted

Starting point is 00:46:51 that enables a higher level interfaces? Yeah, probably something like that. Of course, again, you're adding another level of indirection, so that has its issues. But the more complex systems become, the less you can expect the user to have a full, a global visibility over all constraints and requirements. Yeah. Any words of wisdom that or advice that you would like to share with our listeners, especially, you know, younger faculty or graduate students as they're charting this territory and this new and exciting space?

Starting point is 00:47:27 Let's see, this is not very new, but pick a problem or several problems that you love working on because you'll spend a lot of time doing it and it's better to work on something that you really like even if it's challenging, even if it's a heavy design project, which is something that many times people might try to avoid just because of the amount of time that it takes. So even though it might take longer, design projects are very useful, especially in the systems community. And it's important to focus more on the quality of work than the quantity. I'm also glad to see that a lot of, since you talked about junior faculty, that a lot of tenure evaluations are starting to focus more on the quality of work instead of just the number of papers that somebody publishes.

Starting point is 00:48:15 And hopefully that will continue. That's great. So Christina, thank you so much for joining us today. It's been an absolute delight talking to you about these various topics. We were so glad to have you here today. Thank you. Thank you for having me. And to our listeners,

Starting point is 00:48:30 thank you for being with us on the Computer Architecture Podcast. Till next time, it's goodbye from us.

Your Ad Here

Computer Architecture Podcast - Ep 5: Datacenter Architectures and Cloud Microservices with Dr. Christina Delimitrou, Cornell University

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.