Computer Architecture Podcast - Ep 19: Arkaprava Basu - Memory Management and Software Reliability with Dr. Arkaprava Basu, Indian Institute of Science

Episode Date: March 17, 2025

Dr. Arkaprava Basu is an Associate Professor at the Indian Institute of Science, where he mentors students in the Computer Systems Lab. Arka's research focuses on pushing the boundaries of memory mana...gement and software reliability for both CPUs and GPUs. His work spans diverse areas, from optimizing memory systems for chiplet-based GPUs to developing innovative techniques to eliminate synchronization bottlenecks in GPU programs. He is also a recipient of the Intel Rising Star Faculty Award, ACM India Early Career Award,  and multiple other accolades, recognizing his innovative approaches to enhancing GPU performance, programmability, and reliability.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, and welcome to the Computer Architecture podcast, a show that brings you closer to cutting edge work in computer architecture and the remarkable people behind it. We are your hosts. I'm Suvinay Subramanian. And I'm Lisa Hsu. Our guest on this episode is Dr. Arka Prava Basu,
Starting point is 00:00:16 an associate professor at the Indian Institute of Science where he mentors students in the computer systems lab. Arka's research focuses on pushing the boundaries of memory management and software reliability for both CPUs and GPUs. His work spans diverse areas, from optimizing memory systems for chiplet-based GPUs to developing innovative techniques to eliminate synchronization bottlenecks in GPU programs. He is also a recipient of the Intel Rising Star Faculty Award, the ACM India Early Career Award,
Starting point is 00:00:47 and multiple other accolades, recognizing his innovative approaches to enhancing GPU performance, programmability, and reliability. In this episode, we get deeply technical on GPU programmability, particularly with respect to software reliability and efficiency. We had a great time nerding out with a few trips down memory lane sprinkled in. We hope you enjoy it as much as we enjoyed recording it. A quick disclaimer that all views shared on the show
Starting point is 00:01:12 are the opinions of individuals and do not reflect the views of the organizations they work for. Arka, welcome to the podcast. We're so thrilled to have you here. Thanks Lisa and Sabina. And really, a big thank you for inviting me to this interesting podcast. Yeah. Well, we're glad to have you. For our listeners, I've known Arka for a very, very, very long time. He was an intern at AMD research when I was first a young full-time engineer.
Starting point is 00:01:44 And so we go way back and I'm really happy to reconnect here with you now. So let's start with our typical first question of the podcast. What's getting you up in the morning these days? So if I have to literally answer that, then that will be our two-year-old son. Every day I'm excited to figure out what new word he learns, what new tricks he learns. So that is the most interesting thing I would say these days. Technically talking about the technical side of things, the fact that how much of our daily life is now controlled or affected by the software and the fact that a lot of today's software relies on accelerators like
Starting point is 00:02:29 graphics processing unit or GPUs and that change has been sudden. For example, for many decades, until a few years ago, it was always the CPU. Why do you run your software? And there was other devices including some GPUs and so on that would actually hang off the bus on the side. It's always a side show. But now, suddenly this GPU has become a new CPU and the software that is written for it is quite different than what we are used to for many decades previously. And that brings new challenges and new opportunities. And as a lot of the software affects our life, which means that how we use these accelerators, how you write programs for that, how efficiently the software runs on them, really matters.
Starting point is 00:03:26 That's something that I actually work on and that's pretty interesting for me and exciting for me, thinking every day, what can I do, how can I contribute to that. Very cool. So I think the thing that you said there that really caught my attention the most is that the software that's being written for GPUs is quite a lot different than the software that you remember writing for CPUs. So, of course, CPUs and GPUs, they seem like distant cousins. They're kind of related and yet very different, such that from an architectural level, there's a lot of things that we can borrow, but a lot of things that are different.
Starting point is 00:04:02 And of course, that translates then to the software. What would you say is the biggest difference that if someone were to be a primary CPU programmer that was converting to a GPU programmer, what's the biggest change that they would have to wrap their heads around, would you say? Yeah, the whole programming model is different. The fact that there is this hierarchical programming model where you have hundreds and thousands of threads that can actually run concurrently.
Starting point is 00:04:27 It's not a very large CPU. It's not a CPU with many, many codes. It's completely different. You have to actually give a structure explicitly about these threads and the sheer scale of it. It's quite different. Then the fact, for example, those who have written some market-threaded programs, for example, would be kind of expecting, say, a kind of cache equivalence to use the terminology, a little bit technical terminology, but you generally expect the data
Starting point is 00:05:02 that is written by one thread, you'll see the latest value of it if you read it from another thread. Something that the hardware can always be providing them. Some of these assumptions might not be always true when you come to writing a GPU software, because at that scale of concurrency, in supporting those nice programming features that could automatically happen for the software is hard. And then there are many other nuances like there are things that... There are a lot of things that the GPUs are really good at, matrix multiplication and
Starting point is 00:05:42 so on. But there are things that they're not really tuned for if you have a lot of EFEL conditions. That often happens when you write a CPU software. They aren't run really well. They aren't actually utilized the GPU well. So many considerations come when you're trying to write high-quality GPU software that a typical CPU programmer might not have thought about. Right. So you talked about cache coherence and closely related problems, consistency, and so on. And of course, in the CPU world, we are used to fairly, let's say, strong semantics on the CPU side. And when you go to the GPU side, maybe it doesn't readily come out of the box.
Starting point is 00:06:31 Maybe we can expand on that. And you talked about applications where you have conditional statements or maybe irregular control or data flows. And that's probably one of the places where you hit this problem. So how do programmers think about this today? What support does the hardware provide? And how do you deal with it in the software side, both with respect to correctness, which you might have to think about explicitly in the context of many, many concurrent threads,
Starting point is 00:06:57 and also in terms of performance? Now the fundamental reason to go to substrates like GPUs is to get those vast amounts of parallelism. So how do you think about both correctness in handling these threads and performance concentrations in the context of substrates like GPUs? Yeah, that's a really nice question. In general, I actually think about both of these correctness that you talked about, like functional correctness, and then this performance, say efficiencies or inefficiencies. In my mind, they are actually at two sides of the same coin.
Starting point is 00:07:30 And in general, the broad umbrella that we can actually talk about is quality of the GPU software. And it can have two parts. You want the program that you write to be reliable, the GPU accelerated program to be reliable. Like it shouldn't be that if you run a program 100 times, 99 times, it gives you the right answer. But in one time, it crashes, that gives you something wrong. Because one time that it actually does something wrong, it can be catastrophic.
Starting point is 00:08:00 And we have seen so many examples in real life that software bug can have a disastrous effect. Now, that a lot of software today relies on the GPU programs, bug that can sometime manifest, can cause real trouble. So in that context, as you rightly kind of hit the nail, one of the big challenges is when you're talking about the GPU software, it's by definition, it's a massively parallel program. Right?
Starting point is 00:08:34 And synchronizing parallel program has been hard. Like when I remember during my early days of PhD, at the time, it was this whole thing about like from single core, there is this whole thing about the multi-core. So you needed multi-center software and once you had the multi-center software, there was this kind of different kind of bugs like data races where the program was not properly synchronous. So sometimes it will give you wrong answers or crash. Now, if you map it now to the GPU software, this is a problem of another different dimension. Because what happens, particularly in GPUs,
Starting point is 00:09:16 since you have hundreds and hundreds of thousands of threads concurrently that could be concurrently running on a given GPU, synchronizing across that many number of threads is costly, costly in terms of the overheads. At the same time, even how the GPUs are, you have to program the GPUs, you have to explicitly structure, have a hierarchy of threads, it's often not necessary to synchronize globally across all the threads of your program. So there is this unique concept that doesn't actually exist on the CPU for, say, any of the GPU programs, be it GPU programming languages like CUDA or OpenCL, there is this concept of scopes,
Starting point is 00:09:57 where there are synchronization operations which you can qualify with kind of saying like, you know, what subset of the threads are guaranteed to see the or witness the effect of the synchronization operation you're actually doing, whether it's atomic operation, phase separation, so on. Now what happens is that now this is again unique to a GPU program where you can, you have, you know, quote unquote, proper synchronizations, but their scopes are wrong, while the producer and the consumer of our data are not within the same scope of operation. So what can actually happen, for example, in the CPU, if you say, write a program that has only, say, synchronization operation, you can't have a data release by definition, but you can have such a situation in GPU program just because you have synchronization operation,
Starting point is 00:10:46 but they are not of correct scope or strength. So it creates a new dimensions where your software can be wrong. And this synchronization bugs are particularly notorious because by definition, they are intermittent. Sometimes they show up and then cause havoc. And most of the time they might not actually show up. But once they actually show up, how do you actually debug it? The first thing that you would actually try to do to debug that crash, for example, you try to reproduce that buggy situation or to manifest that bug. But if you are unlucky,
Starting point is 00:11:21 you might try many times, but that synchronization error, because of timing, it's not going to manifest itself. So that overall affects the quality of the program big time. So before you go on, Arka, that was super helpful. I think maybe two things. One is I'll just provide a quick story, which is when we were both at AMV together, I remember writing a coherence protocol, which is not exactly the same, but the same sort of problem
Starting point is 00:11:47 where it doesn't manifest every time. And then in order to test it, you need to run like 100 million load in store instructions and see if you have a problem. And I remember it was like, oh, found a bug, fixed it. Now I hit it with 10,000 load stores. Okay, now I run again.
Starting point is 00:12:04 Oh, I'm so happy because I made up to 100,000. It doesn't mean it's bug free. No, I'm gonna fix another one. Oh, it made it happen. And now it's like, you have to wait longer and longer and longer until you find the bugs. And it doesn't mean that they're not there. So I totally get your story.
Starting point is 00:12:16 But the thing that I wanted to sort of maybe step back for a second and say is, you know, a lot of our listeners are on the more youthful side. And so I think their coming of age tends to be around a lot of ML stuff. And so when our coming of age was GPGPU becoming a real thing and everybody wanted to learn OpenCL, everybody wanted to learn open, you know, CUDA, I guess, OpenCL since we were at AMD, but you know, CUDA has now taken over the world. And so many grad students of sort of around our cohort were very intimately familiar with
Starting point is 00:12:51 GPU architecture and what that meant from a computing standpoint. I feel like it might be a little bit less ubiquitous now. So maybe you can take a moment to just say like, you know, these scopes that you're talking about are the software scopes that you're talking about, the software scopes that you're talking about, are very tightly coupled with the hardware architecture. So maybe you can take a minute to just say, what is a warp? What is a bunch of threads? What are these scopes that you're talking about? Because you could imagine, like you were saying, you've got these hundreds of thousands of threads. And if you say you wanted to do a fence across 100,000 threads, gosh, that sounds awfully costly. But what is this hierarchical execution
Starting point is 00:13:25 that you're talking about? Just briefly so that listeners can continue following along as you keep going. In GPUs, since you wanted to support very large amount of parallelism, your hardware resources are organized in a hierarchy. You have, if you want to relate it back to like the CPU, you can think of loosely that you have something
Starting point is 00:13:46 else, say a streaming multiprocessor, which like can be like you can think of an core of an multi-codes, not exactly, but something like that. And you will actually have hundreds of these, you know, given GPU. And within this streaming multiprocessor, you will have what's the basic structure is that you will have a single instruction, multiple data, where you will have what's the basic structure is that you will have a single instruction multiple data where you will try to execute a same instruction on different data items. And that's how one of the key things, that's how GPUs are so good at doing parallel processing, single instruction multiple data. Now this kind of hardware and the hierarchy is also reflects on the software. And in this hierarchy, there are resources, like some level one cache, for example, is only shared, only is private to us
Starting point is 00:14:33 in a streaming multiprocessor array, that we talked about. Now, this kind of gets reflected in the programming dialect, say in CUDA or OpenCL. So where, as I said, like when can eventually launch a kernel, a GPU function that's called kernel onto a GPU. You say that essentially you provide a grid of threads that can have hundreds of thousands of threads, right? And you actually then organize these threads like, okay, a group of say a thousand threads will be part of a given one thread block. And the programming language is such that it says that, okay, these thread blocks, all
Starting point is 00:15:08 the threads in one thread block would always be executed in a single, in a SM, or symmetric multiple set. Which means that those threads have something common among them. They can share in a level one cache, for example. They can communicate faster among themselves versus the threads that might be part of different thread blocks. And this is what exactly in the synchronization, the scope synchronization actually takes advantage of. If I actually as a programmer, I know that the producer and the consumer of a data item, because you want to
Starting point is 00:15:44 synchronize because you have a producer of a data and a consumer of a data item, because you want to synchronize, because you want to actually have a producer of a data and a consumer of a data, you know both of these producer and the consumer are within the same thread block, same essentially group. Then you know they can actually communicate faster with say level one cache. If they are not part of that,
Starting point is 00:16:03 then you have to actually take a longer route, say to a level two cache. And this makes a lot of performance difference. Just to give you a feel of how much it can be in a typical, say, fence operation that you actually use to make a data visible to other threads. If it were to communicate across threads that are part of the same group that I mean that I talked about the thread block and that can be said 20 times faster than if you have to communicate across multiple thread blocks that are running in the same
Starting point is 00:16:35 GPU but in different parts of the GPU. So these are actually exposed to the programmer. So if somebody is not careful enough while writing the program, and that can actually happen, then if you're lucky, then you'll actually see the updated data. But sometimes, the programming actually has error, and if it's sometimes, it would happen that the consumer of a data would not actually see the most latest data, and at that point, all bets are off. So that's the kind of complexity that comes in when you're talking about writing good quality program for GPUs.
Starting point is 00:17:15 These days, not only are the GPUs getting bigger, but now you have clusters of GPUs themselves, like these DGXs. I'm going to show my ignorance here on programming, actually doing the programming for these sorts of things. With the standard synchronization primitives that are used right now, that your program is not identifying or that is identifying as potentially inefficient, your work, ScopeAdvice,
Starting point is 00:17:41 does synchronization potentially span across an entire cluster of GPUs? Does that get even worse? I can imagine it's 20 times faster within a thread block on one SM than going to another SM. But now if you're potentially also having to go across NVLink, are there synchronizations that even exist that go across the entire thing and how slow could that be? So we currently, so far, we have been looking even just within a single GPU, like that much of performance, like, you know, can a single line of code change can be like make something like a 30% or 35% HD faster program. But you are right that there is a lot of scope also, you know, in the sense like, you know, if you have to go across the different GPUs in a
Starting point is 00:18:24 cluster, then yes, there is a way to actually synchronize there should be another different scope of synchronization for that system scope. And if you're not careful, then there's a lot of time you can actually spend in just communicating with another GPU rather than doing compute, which is what these GPUs are essentially built for.
Starting point is 00:18:44 So it's very much a wastage of resources. Yeah, that makes sense. And I'll say, during my time at Microsoft, this kind of problem comes up over and over again in architecture, right? You want to give a lot of flexibility, and you want to give programmers freedom to not have to think about the hardware resources underneath, because then if the hardware changes or the program becomes not portable or something like that, so you want to give programmers a lot of flexibility. At the same time, if you want really, really good performance, then you want to be very, very, have a very good understanding of the hardware resources that are underneath
Starting point is 00:19:20 so that you can take advantage of them, right? And so just at a totally different layer, we would have customers in Azure who say, yes, you want to give us a bunch of virtual machines and you want the flexibility. We all want the flexibility, but just spray them anywhere across the data center. But really at the end of the day,
Starting point is 00:19:39 we want these guys on the same rack because communicating over even within the same rack, all you have to go is up to the top of rack switch and you're fine. I don't want to have to go multiple rows away. And that's essentially the same thing in an abstract sense is going to a different SM. So this notion of hierarchy, which is so useful in computer architecture, and this notion of leveraging the hierarchy, like trying to balance the tension between leveraging the hierarchy for performance purposes
Starting point is 00:20:08 and allowing programmers to be flexible and agnostic, and yet also understanding the hierarchy, like that's the tricky part. So it sounds like at the end of the day, you and your students came up with this tool, Scope Advice, which can essentially come in and find errors in sort of synchronization scope in a program.
Starting point is 00:20:27 Can you talk about that for a little bit? Yeah. Actually, there are two parts to it, I would say. One is functional correctness. What happens is that if you have a producer and consumer of the data that are not properly synchronized in the sense that the scopes are not big enough to cover both the produce and consumer for trade, then you have functional bugs when things are really bad,
Starting point is 00:20:53 like things where a program goes wrong, can have a crash, things can give wrong data and so on. The other, while we're actually doing that, this other aspect that we actually noticed is that you provide programmers with all this flexibility that you can actually essentially fine-tune how much cost you want to pay for synchronization as long as you really correctly know that where your producer and consumers of a thread are, right? But programmers also prioritize functional correctness over getting the best performance possible. So what happens, what we
Starting point is 00:21:35 have, and they are actually also aware of the fact that if they do it wrong, they will be actually found, they take in a scope that is synchronization that is a is cheaper, but it doesn't do the job, then there is a bug that we talked about. So what happened is that people, we are finding that they're actually, programmers can be actually conservative and they might not actually use this cheaper, but more, I would actually say synchronization operations that if you not use carefully, there will be bugs. So this talk, the work that you are talking about in a published recently in micro is building that in a tool to age programmers so that they can focus on
Starting point is 00:22:17 writing functionally correct program that could be their first job, right? And then rely on the automatic tool to figure out why the performance is left on the table. That way, make the life simpler for the programmers. And that is the kind of the series of work that we have been doing on the both the functional correctness side and the performance debugging.
Starting point is 00:22:44 And that we call, generally call, how we can actually help with this tool, how we can actually help improve the quality of the software. And at this point, I should also kind of acknowledge this Computer Data Podcast that long back, I think in 2020, there was a podcast I was hearing, 2020 or 2021, but definitely during the COVID time, and Kim Hazelwood was your guest.
Starting point is 00:23:12 And at one point, she talked about how it is so hard to make sense of performance when you have a GPU program, and it's so hard. And that also kind of affected our thought process that, okay, we thought like, yeah, we should actually have better tools that can actually help the programmers write good quality GPU products. So thanks, Lisa and Suminath for having this computer architecture podcast at the first place. Yeah, absolutely. I think this is definitely one of the places where you can exchange ideas and learn about different themes of problems
Starting point is 00:23:48 and maybe how relevant it is in different spaces. So expanding on tools for thinking about both correctness and performance, can you tell us about your approach towards building these tools? What are the key techniques that you're trying to leverage? Do you do static analysis? Do you need dynamic runtime information? And how does this interplay with both correctness
Starting point is 00:24:09 and performance and the nature of the application itself? Some applications, you might know all of your data access patterns statically ahead of time at compile time. And for others, for example, graph processing applications or others, you might actually have runtime or input-based data dependencies. So how do you think about building a tool that can provide the right set of visibility and guidance to the programmer on how to best get the right balance of correctness, but also getting the best possible performance for the application that they are targeting? If it is okay, I'll
Starting point is 00:24:39 answer this question in two parts. Since I asked how to actually got into it. There is a kind of a personal story behind that how we actually started on this project actually started on initially in 2020 and 2021. So it goes actually long back. So during my, and then I'll actually come answer your question about dynamic and static tools. So during my kind of early, I reached the same, you know, within like within two and two and a half years into my PhD program, I wasn't sure like, you know, what is going, what I'm actually working on, like, you know, just trying to figure out what would be my into this topic and so on. At that time, I was, you know, working at, you know, I was doing PhD with Mark Hill at
Starting point is 00:25:21 Wisconsin University of Wisconsin Medicine. At that time, a new assistant professor, Sean Lu, actually joined the department. Now, those who actually know Sean Lu, he is an expert in bug hunting for multi-threaded CPU software. Because at the time, multi-threading, finding bugs, confidential bugs in the multi-threadreading programs was really a hot topic. So I tried to learn something from her and I should even try to see if we can actually collaborate and start a project. Ultimately, it's not completely worked out
Starting point is 00:26:00 because I got interested in something else, but that learning was there, like these databases, what kind of error can happen and so on. And then finally, when I started working on the GPUs and so on, initially looked into the GPU memory subsystem because that's what virtual memory subsystem, that is what I actually did for the CPU on the, for my PhD. But then this learning that I had about this,
Starting point is 00:26:23 what can actually go wrong in the concurrent software, came back saying that in the case of a GPU, it's like in on-steroid, concurrent software on-steroid. And that kind of, many years later, that kind of started this whole thought process that we need to look at what kind of unique functional correctness issues are the bugs that can happen on the GPU software. So that was kind of a bit of a story, like how it all started.
Starting point is 00:26:48 Now, getting into more specifics, like dynamic tool versus static tool, and this is a very timely question. And the reason is, so currently the tools that we have built so far and actually published about, they are dynamic in nature, because as you point out, you can actually, you, under certain circumstances, some type of applications, you would require runtime information, you know, pointed out correctly. The graph, for example, it can be actually input dependent, right?
Starting point is 00:27:18 But the dynamic tools has their drawbacks, including very large performance overhead. Although it's a debugging tool, it's not only performance overhead, but also the memory. It takes a lot of memory space metadata. And in case of a GPU, the memory is constrained, the capacity of the memory. So, what kind of application you can run this tool through becomes limited, becomes limited because, the tool itself is taking up a bunch of memory that you have. So it's current tools that we have actually released are dynamic, but we are pretty close at getting something on the static analysis side.
Starting point is 00:27:59 You're actually working on something that hopefully you'll actually submit in a month or so, but we're looking at static analysis tool, like how much you can actually get information. And there are constants as I point out, like the input information is not there. But we have started from the dynamic analysis tool and then see whether the static analysis help or how much it can help the dynamic analysis so that some of the drawbacks of dynamic analysis are mitigated at At the same time, you might not want to rely completely on the static analysis tool.
Starting point is 00:28:29 That is super cool, Arka. First of all, your two-part answer. That is very interesting about how your encounter with Shanlu at Wisconsin, you were interested in the time, you learned some stuff, and it didn't pan out at the time. But then, here we are, what? It's probably 15 years later now, 10 years later, where it has actually been formed like a pretty big cornerstone of your research program. And I think that is really amazing. And so we usually talk career stuff more at the end of the podcast, but it just seems like a good time to say, a lot of times what I see in the youth, I don't know why, these days, everybody wants
Starting point is 00:29:09 to get somewhere faster, faster, faster. So if something is not directly helping you reach your immediate goal, it's like, whatever, I don't want to do it or something like that. I mean, I don't want to generalize too much, but that is our society, right? Fast, fast, fast, go, go, go. So I just thought it was really cool to hear that story about how this kind of new professor comes on. Maybe we can collaborate, maybe we can work together.
Starting point is 00:29:34 Doesn't even work out at the time, but now here we are so much later and it's like the foundation of your work and you have this nice call out for her here. And that's another thing. We have a lot of thank you call outs on this podcast where people talk about someone who has inspired them, someone they learned from. And so this is a really collaborative field. I just want people to really remember that, you know, there's almost nobody can do any of this by themselves.
Starting point is 00:29:57 So that's a super cool story. And then with respect to the second piece, the more technical piece, the static versus dynamic tools. So on the static side, I guess I've got to sort of go back in my memory a while now. So when you have something like a thread block, which has a thousand threads in it, that is being mapped to a particular SM, and it's being mapped to a particular SM in groups of warps, right? So like that's what, 32, 64 threads at a time. So that presumably is happening in real time, if I recall. So you don't
Starting point is 00:30:29 have a particular schedule of how exactly those thousand threads are being mapped onto a SM. So to what extent, I mean, I guess you start with dynamic because it's in some ways a simpler problem in terms of correctness, right? And so once you go to static and you lose that information of what order things are actually happening in, like I know you said something is going to come out soon, but what would you say is the key difference in what you're able to discover? And is it any different from, you know, sort of CPU static versus dynamic analysis tools? Was there anything that surprised you there? Yeah, one thing that actually surprised, one thing that always happens with static
Starting point is 00:31:12 tool is that you tend to become conservative, right? And what the effect of conservativeness is that you tend to actually churn out false positives, because you just don't have the full information. You assume something can happen, but actually would not actually happen given the constraints on the inputs. And then, you know, you can report something that is that don't work out. So that is one of the problems with static analysis that happens. And I don't want to let all our thunder for the work out, but at a high level, what we're more looking into is that if static analysis can help the dynamic analysis.
Starting point is 00:31:56 It's not needed that you do everything at the runtime. Many of the GPU programs, they tend to be more structured to the question that you're actually asking, how it is different from the CPU side. GPU programs often tend to be more structured than the CPU program. You can actually draw more semantic information statically, and that can aid your dynamic analysis. So you don't need to do either or one of them.
Starting point is 00:32:30 One can help others. I think that's a pertinent point, which is static analysis, you do lose a lot of information, but the fact that GPU programs are structured allows you to glean some semantic information within the scope of that structure. And if I recollect correctly, this is not the first place where you have made this observation.
Starting point is 00:32:49 Even in some of your other work, you have tried to extract semantic information where possible in order to provide the right scaffolding for your maybe dynamic tools and analysis to either improve performance while guaranteeing correctness or otherwise. For example, I think some of your other work has looked at unified virtual memory, so expanding the scope out of only the GPU memory. So GPUs can also talk to the CPU memory and that's available in recent GPU platforms. And in those cases, also you might want to have
Starting point is 00:33:18 like dynamic tools in order to understand when you should fetch data into your GPU memory, which is the faster HBM, versus when it needs to be in the CPU address space and at what time do you trigger these transfers. Maybe you could elaborate a little bit on how do you think about what is the semantic information that's relevant and how does this change the flavor of the dynamic tool, like whether it's reactive in terms of what's happening in the application versus trying to be a little more proactive in figuring out what's happening in the application versus sort of, you know, trying
Starting point is 00:33:45 to be a little more proactive in figuring out what performance bottlenecks or what performance issues can actually be tackled. And you're absolutely right. Like, you know, some of this, what happens often across multiple projects is that you learn something useful in one project, and then you carry over some of this meta-learning over to other projects. And one of the things that you're actually pointing out, yes, we actually have been looking into this how to oversubscribe the GPU's memory. As we know, typically GPUs, you have a plenty of compute. But if one thing that is in shortage is the amount of the memory that a GPU card can
Starting point is 00:34:27 have on board. At the same time, more and more, there's a need to actually process a larger amount of data. You might be forced to use multiple GPUs just because you're running out of memory in one of the GPUs. GPUs are costly. It's not the compute. The memory is what's causing you the hardship. In that context, as you pointed out, there is a way in modern GPUs where it allows the programs running on GPU to access the DRAM, CPU attached DRAM, which is of course, much larger, you can easily have data bytes of memory there. But the problem has been that in accessing this DRAM, data on the DRAM over the PCI interconnect
Starting point is 00:35:19 and from the GPU is pretty slow, has been slow. And the key observation that we had is that you don't need to be reactive to actually reduce that word. You can be proactive. And proactive in the sense is that as you are currently, as you are pointing out, right? That, you know, the GPUs often have more structure in how the programs are written.
Starting point is 00:35:43 So if you do a static analysis of those GPU programs, you can glean out a lot of semantic information about the memory access pattern. That is, it's going to actually happen. And then use this information, feed it to the driver, GPU driver, that would actually move the data between the RAM and the GPU's memory, the HPM.
Starting point is 00:36:06 In the way that then what is really happening is this driver itself becomes proactive in the sense is that before the GPU program actually requests some data, it's already there because your static analysis already told that this is what is expected like one of this particular program. So that is also a kind of a mix of both static analysis that is informing how at the runtime or dynamically the GPU runtime and the driver is reacting to in managing the overall memory, including over substitution. That is one of the other work that we currently look into, how to over-subscribe the GPU memory.
Starting point is 00:36:54 So with UVM, I think the main thing that I heard you say is you do a bunch of static analysis to figure out access patterns so that you can have the driver, you maybe pre-perform some memory accesses so that data can be ready for the GPU. This allows you to potentially oversubscribe GPU memory. I guess I just wanted to confirm that what I heard more or less is that you use static analysis to search for access patterns such that now your driver is almost functioning like a software prefetcher. Is that pretty much the case? That's part of it and it's an important part of it. So also what happens in a GPU is that you launch so many threads on a GPU card and
Starting point is 00:37:36 like you want to actually launch a work on the GPU, the hardware may not be able to execute all of those threads at the same time. But through static analysis, you can actually find a relation between, say, you have a large data structure that won't actually fit in your, say, GPU memory. But you can actually find through static analysis, like, what subset of thread will access what part of that data structure. And now you also know, given this the part of runtime, the dynamic information coming in that on the GPU that you would run that code, okay,
Starting point is 00:38:09 it has, it can just run only this subset of threads at a given time, which then with the static analysis, you know only a part of that large data structure and which part of the large data structure would be accessed when these thread blocks are running. You just try to keep that on the GPU memory while you know this other part of the data structure, all we need it. So that is the static analysis informs the driver not only like what to prefetch, but what data to kick out whose use has been used. It's not going to be actually required anytime soon.
Starting point is 00:38:49 That is the information you're actually coming from the static analysis. Okay. So I guess what I'm trying to figure out exactly is how you would decouple the dynamic execution pattern of a thread block on an SM and say, like, oh, well, I've prepared something except oops, you know, this time we're doing the warps in this order instead of that order. And so then now it's messed up. It sounds like what you're saying is that you're avoiding that kind of pattern search. And what you're looking for is, okay, if this warp is happening at this time, these patterns tend to happen maybe in warp. And therefore, when the next warp shows up,
Starting point is 00:39:28 I can identify a pattern and then make sure everything that in this warp is gonna be hunky dory. So this is just one position. Like it's not, we look at slightly at a higher level, like this is a thread block level, which is like instead of 32 threads, it's more of a thousand, thousand 24 threads. But at a high level, what you're actually saying is
Starting point is 00:39:47 right. Look at things like what data would those threads actually require. And there is one more thing is that we also found out that in an application, you have different types of data structures. It's not like one data structure that you'll actually access in a GPU. So what we've also done is that in some cases, you just can't actually, static analysis is not good enough. Like you can't actually figure out what that access pattern would be. Like you have a pointer-based data structure in a pointer chase. Like how do you figure out what data you will access? It just depends, right? But the observation there also is that, again, you don't have to do all or nothing because even in a, say, application that is doing,
Starting point is 00:40:32 having a data structure that is a pointer chase, that is doing pointer chase, but there will be other data structures that whose access pattern you can actually figure out. So what you can actually do is that for the parts of the virtual, the data structures whose access pattern you figured out, you do the static analysis and guide the driver what to keep, what not to keep. The other place essentially you can just say that
Starting point is 00:40:57 static analysis cannot help. So you are falling back to how do you do it, react as you see those memory access pattern, and then you actually fetch them. So it's not all or nothing. That makes sense. I think then if you're looking at things, so two things, if you're looking at things from within the thread block, I guess the thing that I was wondering is, would the dynamic scheduling of warps within a thread block affect your prefetch patterns?
Starting point is 00:41:24 If you're doing static analysis, then you know, there's maybe a particular pattern that might be different if you're doing it if the order of the thread blocks ends up being different. So that's one question. Like, how are you making sure that it's not that, I suppose? And then the second question being then, you know, if it is more or less like a software prefetcher, and not just a software prefetcher, but you're also like a replacement guide, I suppose, as well. What ends up being the trigger? Because you've got this static analysis,
Starting point is 00:41:50 presumably for a given program, and now is it saying, oh, I know this particular program, because maybe it's different patterns for different programs. So what is the actual trigger that tells it to do something? So those are the two separate pieces.
Starting point is 00:42:04 Yeah, so I think we do actually multiple things in that app, but there are actually multiple triggers. One of the triggers that is for this, what you call, which is rightly so, it's kind of a stop-start. Preparation, what also happens is that in a GPU accelerator software, you typically also have this pattern where you iteratively call a kernel.
Starting point is 00:42:25 Within a loop from the CPU side, you're actually calling the same kernel, but it would be operating on different parts of the data structure. Now you see that the trigger is this end of a loop on the, or you can say kernel launch for different parts of the data structure to pre-fetch what data would actually be required. So pictorially, if you think of like, this is pretty common, you have a loop on the CPU,
Starting point is 00:42:54 which actually launches a kernel that will actually operate on different disjoint part of a data structure. And your trigger is that every time you actually finish your iteration and you want to actually get back, you can actually, you know, pre-fetch things up. Before the kernel starts running and start requesting data that then you would have otherwise have to actually service
Starting point is 00:43:16 and pay the cost of moving those data in the critical part. So this goes off the critical part. So that is one example of trigger. So about like what thread blocks are actually running. So first of all, we also did a little bit of reverse engineering of figuring out this is not too hard, what order these thread blocks will be actually executed.
Starting point is 00:43:39 So that is something that we do take advantage, but that is not necessarily absolutely needed. So, for example, you know that if this set of thread blocks going to run, they will touch a given part of the data structure. And you see the first access, you can immediately push all the data that would be required for, you know that it would require for those thread blocks. So So that also we actually do, but we also take advantage of this reverse engineer information in which order the thread blocks will actually get scheduled. They're pretty fixed across the generation of the GPUs, doesn't change. So we talked about a single GPU and multiple GPUs and then we have expanded the scope to GPUs and CPUs in the context of unified virtual memory.
Starting point is 00:44:29 Maybe expanding that a little further, we have various storage layers in our hierarchy. You talked about the memory hierarchy in the GPU, and then there is memory on GPU plus the CPU. Now, people want access to even more data for different applications. Some of your upcoming work sort of tackles the problem of things like KV stores or key value stores. And those require way more memory than even what's maybe available in a DRAM.
Starting point is 00:44:55 And so you need to look at other forms of storage in the entire hierarchy. So expanding on that, you looked at sort of persistent memory. It's traditionally been in the realm of CPU systems. It's also a relatively new memory technology in the sense that it's not been deployed as widely in real production settings and turns out to be somewhat finicky to use both in terms of correctness and in terms of performance. Can you maybe double-click on some of these applications that require
Starting point is 00:45:22 this vast amounts of storage, and how that intersects with tooling to understand what is the best opportunity to leverage different parts of your memory hierarchy, how that intersects with concurrency and performance, like CPUs and GPUs have different attributes, both in terms of their computational capabilities, their memory footprint and bandwidth, and also ease of programming. And ultimately, it intersects with the application
Starting point is 00:45:46 and workload characteristics. So tell us a little bit about these applications that require different forms of memory and compute, or different phases in the application, and how that intersects with your techniques and tools to enable these kind of things. That's a nice question. So, you know, this also goes back a little bit back in time.
Starting point is 00:46:09 So, I think in 2021-22, that time frame, you observed that a lot of compute have started using a lot of software, different software have started using GPUs as their primary compute platform. But there isn't one, I would say, class of domain of software that needed persistence, like a key value store or databases, that are pretty important in the general software architecture are kind of missing out. And then this, if you just to even how important this could be, like anytime, for example, we go to Instagram or Facebook, in the backend, in all this data that we actually tend to fetch, go to like, probably a software stack, this name's Rockstabuse, it's a parser-stacky value store,
Starting point is 00:46:59 it's kind of a NoSQL database, it remembers things, you can actually log even your clicks and everything. And you can actually look up the data with an identifier, which is called keys. And we are talking about internet scale software, which means that the throughput is very important. You need to serve a lot of requests together. In that context, the GPUs are the throughput engine. However, now you regret two things, which doesn't so far like did not gel well, right? You need throughput, which GPUs can provide.
Starting point is 00:47:32 The same time, the GPUs are not designed to deal with storage of the power systems as such in general. And the reason could be like the storage or the power system memory that we actually talk about are, you know, hang off the CPU, not like the storage or the persistent memory that we actually talk about are you know hang off the CPU not from the GPU. So it's naturally it is natural that you look at how CPU can actually leverage them not the GPU. Now that is why we actually started thinking hey look there isn't the way that we actually try to think is that is there an interesting use case from the software point of view and we we thought that, yes, there is.
Starting point is 00:48:07 Not only this kind of this parser-stripped key value store that I actually mentioned, but you remember when you are actually doing many of the GNN training from time to time, it is very common to actually checkpoint this learned weights and so on to us somewhere that you would not actually forget, somewhere in parser-stripped storage and so on to us, somewhere that you would not actually forget, somewhere in the first step, like storage and so on.
Starting point is 00:48:28 And same happens for really long running, high performance, this scientific computing, like computational fluid dynamics, because you don't want to lose if something goes wrong, you don't want to lose this, and all what you have actually learned through hours or days of computing. So there is a need.
Starting point is 00:48:45 Like a lot of computers move down the GPU, but they don't have a good access, a direct access to the storage or the persistent subsystem. So that triggered like in how, you know, we can create this enable this ability for the GPU to directly access and read, write to persistent memory. So that was startup work like this S plus 22 and 23 works and so on. And then we started thinking a little bit more digging more into the specific more commercially deployed software systems that makes use of
Starting point is 00:49:19 persistence. And that's where we actually started looking into something called this Rocks TV, which is used by Meta and actually from actually from Meta. And similar actually, software does exist from Google, for example, it is called the LevelDB. And what we found was something interesting. What we found is that even in the software, that is, you know, we are not talking about the GPXL software, but we are just talking about like tech software that is running on the same CPU today and commercially deployed and you see you tried to break down how much time is spent on whack. Found out that even though those are what you call the persistence aware software because they need persistence, they want to
Starting point is 00:49:57 actually store something, it's not that they're actually spending all their time there. Majority of the time actually goes in the software manipulating in-memory data structures. And then only a small part really goes in writing, reading or writing from the persistent media. This is particularly true if you actually run them with fast storage systems of today, including persistent memory. So what it means is that you don't always,
Starting point is 00:50:27 even for the persistent program, you don't have to always worry about, oh, how much bandwidth I'm having to the persistent medium or not, or how long it is actually taking. You have a lot of time that you're spending on seeing memory data structures, and you can actually speed them up on the GP,
Starting point is 00:50:44 particularly when you have an internet skill in a throughput, that you have an internet skill service and you are trying to support a large amount of throughput. So the trick there is to figure out what are the parts of the current software that can be actually accelerated, that can actually have a lot of parallelism and can be accelerated well on the GPU. Fork them off to the GPU. Don't try to do everything from the GPU. Leave the parts that are not too much critical to the performance and let the CPU handle that.
Starting point is 00:51:17 So that division of labor becomes extremely important. And that's what we actually showed in our upcoming paper, SIGMOD, where we show how you can take a commercially deployed software stack, break it up neatly between CPU and the GPU, and you can have very high throughput because you're certainly able to use GPUs. So this is interesting, Arka, because as I was telling you before we started this episode, when you were an intern at AMD and we were starting to think about GPGPU very, very early days, the use case of cooperation between CPUs and GPUs was a big question. Was it going to be fine-grained? Is it going to be coarse-grained? And at the time, we really didn't know.
Starting point is 00:52:01 And so there was the question of how we were going to design these things without having a solid understanding because there was no existence of a use case. And here we are again now. Gosh, this has got to be like more than 15 years later, where here you have found a use case of how to partition work between a CPU and a GPU. So my question to you here is, so it
Starting point is 00:52:23 sounds like this particular use case that you found is you're dealing with in-memory databases, you're doing a lot of work in the in-memory databases, and then at some point you want to save off a bunch of stuff to persistent memory. So it sounds like what you're saying is, you do the in-memory database manipulation because that's big and highly parallelizable on the GPU,
Starting point is 00:52:42 and you do the saving off to persistent storage on the CPU. Because initially, when you first started talking about this observation, I was like, oh man, are you saying that we want to hang Flash off of GPUs now too? Are they just really going to become a whole cousin of CPUs where they have the same sort of resources? But eventually, you came to know we're going to send that stuff back to the CPU to store. I guess the question is, presumably in this kind of use case, you have a lot of in-memory data manipulation. But then you're able to boil that down to something smaller to ship back to the CPU to store.
Starting point is 00:53:16 Is that right? I imagine it would be a problem if the thing you had to store was also very large, because you wouldn't necessarily want to ship that back. Yeah, and then the other thing that actually happens is that, and this is, I think, so when this software is written, like, you know, the developers, while they should have written this CPU, they're all aware that the storage would be slow, right?
Starting point is 00:53:43 So once you just start writing a software, assuming that you use data structures, like they use like log structure, March Tree, and so on, LSM Tree, that are suited for that kind of scenario. You want to actually reduce your time going to the storage. And now if you actually attach a very fast storage system,
Starting point is 00:54:05 that time that you actually already your software is designed to reduce your time to going to a storage. And now it's the hardest, if you have a very fast, say, fast system memory, that time actually shrinks big time. That suddenly opens up the opportunity for you to, because you're now, if you look at the end to end time, you're spending the majority of time at the end-to-end time, you're spending the majority of time
Starting point is 00:54:26 manipulating the in-memory data structures before you actually write, as you say, only in a limited amount of time and then there's some compaction and so on, and then finally you actually write something to the storage medium. And you can actually let it be in the CPU. And then that is okay because you are already the time that you have been in the Amdus law coming. This is a small part, leave it alone. The bigger part that you are actually still spending on this manipulating in-memory data structures, you can also do something like GP. So that is what comes out of that. There are different challenges comes out on that because what happens is that when you were in the software
Starting point is 00:55:11 expects you to execute something sequentially, but you are actually executing them parallel in GPUs. At the same time, you want to make, keep the pretense that to the higher level software that is actually using the RockSleep in the the back again, is that nothing really changed. It's just dropping. So what it means is that you need to then start playing this all these trees, which you keep multiple versions to hide the fact that you ran something in parallel, but you
Starting point is 00:55:39 don't want to show that result to the customer. The customer software is using, say, this, the backend software, the Rocks to be in the customer. The customer software is using say this the backend software, the rocks in the background. So a lot of tricks come in there and that's where a lot of fun is and the contributions are as well. Yeah, so circling back to the top of your answer, you sort of touched upon how evolving technologies like in this case storage technologies sort of change the performance trade-offs and so the bottlenecks sort shift across your application, and it also reveals new opportunities for optimization. One consideration I had in this was, when you talk about the division of labor between CPUs and GPUs, one constraint that often comes up is that GPUs are typically attached to
Starting point is 00:56:16 the CPU host via a PCIe bus or a PCIe link, and those are typically very low bandwidth. I wanted to get a sense of how much this plays a factor into how you think about the division of labor, but also in terms of emerging technologies. So for example, NVIDIA has the Grace Hopper-based systems where they actually provide a very high bandwidth between the CPU host and the GPU machine. And so if you have things like NVLink that provides sufficiently high bandwidth, then would that enable different avenues for optimization for either these applications in terms of division of labor or maybe other applications where you have found the PCI bottleneck is a reason that you're actually hamstrung.
Starting point is 00:56:54 And if you actually had higher bandwidth, you could maybe go to much higher QPS or queries per second or throughput, or you could support a different form of division of labor between the CPU and the GPU. This is an excellent question. We actually had to handle this. So as we were talking about division of labor, in general it would actually feel like, okay, this is division of labor between what happens on the CPU versus GPU. But there is another division of labor actually happens on the memory side of things.
Starting point is 00:57:22 For example, in a key value store, your keys are relatively smaller than the data, right? A data can be like, you know, multiple kilobytes. So one of the things that we actually did is that we kept the values on the DRAM, put the pointers of those on the in-memory data structures that we kept on the GPU's HPM or GPU memory. And why it was important,
Starting point is 00:57:50 because this pointer is always eight bytes, but the data can be actually multiple kilobytes. And if we had to actually move this data, you know, to and fro over the PCI from the RAM to the HPM and then back, then the benefit of the speed up that you would actually get by doing things in parallel on the GPU would be lost. Most of it will be lost. We had to also think about not only
Starting point is 00:58:17 the division of lever in terms of what happens where, but what some data, which is the data or the values or the keys actually placed. So that is also a key part of the decision making process. Now, coming back to your other question that, hey, if you had much faster link between the CPU and GPU, would you have designed something differently? And it would make actually even make some things actually much faster. Answer is certainly yes.
Starting point is 00:58:45 For example, which we did not, one of the things that we did not talk about much is that in this safe persistent Kibhado store, although ultimately you were kind of in a persistent things from the CPU and so on, but at the same time when you are writing something, for example, you log as well. And in our case, all of those actually
Starting point is 00:59:05 happens from the GPU directly to the power systems storage. That crosses those logging requests. Those actually crosses the PCIe, which can easily add around each way 300 nanosecond latency. If it is actually faster, performance would actually improve significantly. You can actually do much more because the logging overhead is non-negligible like if I remember correctly and 15 to 16 percent of the time goes in logging even though you are actually doing it from the GPU there is
Starting point is 00:59:36 a parallelism but still it's it takes time to get things because the persistent medium the storage is not attached to the GPU. It is on the other side of the PCIe on the CPU. Certainly looks like a very rich area for further research and as newer systems and technologies come to the fore. And yeah, we look forward to reading about your work in AdSigmod this year. So you've talked to us about a variety of different projects and we have peppered this with what are the origins story for those projects, but maybe this is a good
Starting point is 01:00:08 time to sort of wind the clocks back and ask you about your origin story. So how do you get interested in computer architecture and how did you sort of land up at IISC? I'll be happy to provide my perspective. So I really wanted to understand how systems at the computer actually work. I think that is where I got interested in computer architecture. At the end of the day, whatever we actually do in the algorithm or we do in the software, ultimately it needs to run on the silicon. Without that, we get no value. That is why it feels like, okay, I'll start from the bottom of the stack and figure
Starting point is 01:00:46 out what's happening there. And then what really happened, and this is also a personal story, as I mentioned, I started working under Mark Hill at Wisconsin. And as you can see from my previous conversation, I was actually also thinking, what could be my thesis topic? I was actually also thinking like what could be, you know, my thesis topic, right? I was just searching for that because initially what I actually worked on none of this actually went into my thesis So what happened there and this is also kind of an I would actually say, I don't know Maybe a stroke of luck during that time in Wisconsin and there was a kind of an
Starting point is 01:01:22 Typically what they used to do is that there will be like rooms where the students would sit and typically the computer architecture students will always sit in in a set of rooms. Okay, so you are, you know, in the room with the computer architecture student and then there will be like some rooms where the system students would sit and then there is another floor where the theory students would sit and so on. That's how it's organized. In my case, what happened around, when I went to lecture program in the second year
Starting point is 01:01:51 of my PhD, is that I got assigned to a room where the other person in the room wasn't systems person. It happens to be Harris Bollos, who used to work on operating system. And at that time, he was actually working on the initial on the CPU side, the power system memory and so on, memory subsystem. Now, looking back, we spent a lot of time discussing about what could be, you know, the ideas. A lot of them was looking back, you know,
Starting point is 01:02:20 was not so well thought out and so on. But like, you know, we discussed a lot. And then I think how I should go into this virtual memory system, why do we actually did a lot of in my PhD, a lot of like co-design of the operating system and the hardware, some of that thing happened because I was sitting next to the system person
Starting point is 01:02:41 and listening what he is doing and so on. And then that kind of also kind of an, there is an induction that actually happens. So, and that's how I got into that virtual memory and so on. After my PhD, I actually joined AMD research and there, and until that, I never actually, you know, even in Spain, any time thinking about the GPUs. And when I actually joined AMD Reset, at the time, this Exascale project was in full swing, and it was pretty clear that you want to actually get those Exascale systems, the flops,
Starting point is 01:03:15 going to mostly come from GPUs. So for me, there was no option but to work on the GPU, which turned out to be a big blessing. So that's why I actually started working on the GPUs, but I knew the virtual memory. So you do what? Like you started doing virtual memory of GPUs, right? And then as I got more and more comfortable with the GPUs, and then finally I had to move back to India.
Starting point is 01:03:39 There was some personal reason my dad was falling sick and so on. So I said, okay, I'll actually move back to India. And then, you know, I had an opportunity to join industry. Actually, if I can say like this Microsoft research in Bangalore, but then I also thought in, in interest at one of the things that I actually understood that I really liked or enjoyed mentoring juniors,
Starting point is 01:04:06 the students and the interns. So I thought, okay, I'll actually give it a try if I can create students who are passionate about doing system research in India. And that's where I actually landed in the Institute of Science. That's quite a history where a lot of serendipitous things, right? You come across this person. I mean, I think the thing that we're learning here is like, you come across certain people, you learn some stuff from them, and then eventually that's just providing fertilizer for future work, right? And so I think that is something that our listeners could potentially take away, which is like, you know, every encounter with anyone crossfield, same field,
Starting point is 01:04:48 whatever is potentially fertilizer for future work. And you're case of point. And now here you are at ISC, you know, a premier Institute in India. And we're happy to have you, except now you're on sabbatical, right? At ETH, is that right? Or was it? I'm actually on sabbatical at EPFL. EPFL yeah awesome so how long have you been at EPFL? I've been here in a couple of months almost three months now okay I'll just spend several in the next few months here before heading back so that's the plan. Cool. India, pretty different from Seattle. You've been all over the world,
Starting point is 01:05:28 doing computer architecture research, systems research all over the world. Yeah. Yes. I like, as I said, I learn from people and that's what I'm here for. Hopefully, I'll actually bring back some learning that in future might become the start of another thing, another project. Yeah, the traveling computer architecture researcher, it reminds me of Paul Erders, the mathematician who used to hop from place to place,
Starting point is 01:05:56 go and visit different mathematician friends in different cities and universities, and everywhere he went he would pick up a new problem based on the conversations and then put them in his notebook and then solve them either at that time or at some point. So maybe in your journey as well, you've interacted with multiple people. As you said, serendipitous interactions are structured as well and leads to the seeds of new research projects and ideas as well.
Starting point is 01:06:22 Very cool. So before we close, do you have any words of advice for our listeners? If I have, I hope there are actually young students who might be like the PhD students and so on. They are also listening to it. One of the things that I should learn over the years is that as long as one is learning something new,
Starting point is 01:06:43 we don't need to actually look for immediate results. And then if you learn something new and learn that thing well, careers are long. You never know how long, you know, how it will actually come true in your profession later on in your career. So looking at, you don't always have to look at the clock. And that way, one would actually enjoy the research journey more. So, looking at, you don't always have to look at the clock. And that way, one would actually enjoy the research journey more. I love it because it's true. If you think of it as a long game, then everything is future fodder.
Starting point is 01:07:15 If you think of it as a short game, then we're constantly feeling like we're failing. So it is a long game. I mean, I've been around a lot. I feel like I've sort of watched you grow up in some ways, Arka, because you were just like a baby-faced intern, tons of enthusiasm. And now here you are, like this well-established, well-published professor at the top institute. And so it's just been very cool to come back and talk with you. I really appreciate this opportunity. Thank you.
Starting point is 01:07:43 We are very happy to have you here. Yeah, absolutely, Arka. I think it was a delight speaking with you. We are very happy to have you here. Yeah, absolutely, Arka. I think it was a delight speaking with you. And to our listeners, thank you for being with us on the Computer Architecture Podcast. Till next time, it's goodbye from us.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.