Computer Architecture Podcast - Ep 23: Cross-stack Design and Tooling for Large-scale Distributed AI Systems with Dr. Tushar Krishna, Georgia Tech

Starting point is 00:00:00 Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to cutting-edge work in computer architecture and the remarkable people behind it. We are your hosts. I'm Suvinai Subramanian. And I'm Lisa Xu. On this episode, we were joined by Dr. Tushar Krishna, who holds a PhD from MIT and is an associate professor in the School of Electrical and Computer Engineering at Georgia Tech. Tashar's work shapes how the computing community designs modern, large-scale distributed AI systems, spanning specializing specializing. accelerators, memory hierarchies, and communication fabrics, as well as driving design space exploration with pioneering tools like Astrosim, Chakra, and Garnet. A member of the Iska, Micro, and HPCA Halls of Fame, his impactful research has garnered over 21,000 citations, and the 2025 DAC Under 40 Innovators Award. He also actively shapes future AI computing standards as a co-director of Georgia Tech's CRNCH, and it's a co-ch

Starting point is 00:01:00 of the ML Commons Chakra Working Group. Tashar joined us to discuss how he got his start with network on chips but gradually made his way up the stack and into larger systems, all the way into large-scale distributed AI systems where he spends a lot of his focus now. Keep listening to find out how maybe for Shakespeare all the world is stage,

Starting point is 00:01:18 but for Tashar, everything is a network. A quick disclaimer that all views shared on this show are the opinions of individuals and do not reflect the views of the organizations they work for. Tashar, welcome to the podcast. We're so excited to have you here. Thank you, Lisa. Thank you, Savina for inviting me. I'm really, really excited. Yeah, we're excited too. And it sounds like you have listened to the podcast before. So the first question is going to be, what's getting you up in the morning these days?

Starting point is 00:01:50 I'm actually not a morning person. So what really gets me up in the morning is my daily shot of caffeine, which is often chai that I brew to really wake up all of my transistors. What's really getting me very excited, to be honest, to wake up these days, is just reading about AI, right? I think it's actually been very sumbling and also exciting to see a lot of the role like myself and like my broader community is playing in enabling AI. So I think that's something that's been fascinating. I think the last few days looking at how, you know,

Starting point is 00:02:22 how effective Claude has become in terms of coding or like seeing people talk about massive, massive improvements across different fields. I think that's just been very, very exciting. And the scale of systems that people are building to Raniai is also something that's very, very exciting. Of course, it comes with its own challenges with respect to building these power and so on. But I think in terms of what it would take to build these large, massive data centers, people are talking about putting these data centers in space now.

Starting point is 00:02:53 So, I mean, everything is moving at a pace, which is very, very exciting. So I think that's been something that's drawing me a lot towards just waking up and kind of getting to work. Yeah, yeah, sounds good. I mean, totally. Everything feels very, very accelerated with respect to this AI, right? I feel like when I was a lot younger and people would find out what I did, I was a computer scientist or whatever, what people would ask me like on airplanes or whatever is like, so what computers should I get? How much RAM should I get?

Starting point is 00:03:23 And I would be like, that is, I mean, I'll answer that question for you. but it's not the most exciting thing. But now, if you walk around, what people are asking is like, what's with chat GPT, what's with Gemini, what's with this banana nano thing, what's with Quad Code. And it was especially exciting, I feel like, because prior, or immediately after 2012, AI became a big thing, but it was still stayed within our community, right? The broader public did not know about it, but now the broader public not only knows about it,

Starting point is 00:03:53 but is using it and is being encouraged to try to accelerate themselves, even if not being technical people. And so it's like a totally different landscape. You're right. It's exciting. Yeah. I think it's also reached a point where I feel like it's both very fast, right? Like literally something that you talk about, like in two months it's obsolete.

Starting point is 00:04:12 Something fancy has come out, right? I mean, which is very, very unique, I think, in this kind of a domain. But it's also in the grand scheme of things quite slow because we are really still like the infancy stage of AI. So I think that's been very fascinating. Yeah, so I've known you for a long time, Tushar, right? You were an intern at AMD while I was there, so many, many years ago. And since then, you made a name for yourself working on network on chips. And now it seems like you are working on AI sort of things as well.

Starting point is 00:04:44 So why did you tell us a little bit more about that, as well as your transition in this area? Yes, yeah, yeah. So I can talk a little bit about my journey. So I remember we met, I think it was 2009. or 2010, I was interning at AMD. My PhD work was on on-chip interconnects, and a lot of talk that time was about many core and how do you scale to hundreds of cores on-chip,

Starting point is 00:05:10 and specifically how do you scale coherence. So I was looking at new kinds of knock architectures for that, and my PhD work ended up being in the area of trying to design knocks where you can traverse multiple hops within a single cycle. I was looking at how do you kind of leverage wire delays and avoid having to stop at every router. So that I think, so that got me to Intel right after my PhD where I joined VSSAD.

Starting point is 00:05:40 This was a group run by Joel Emmer in Massachusetts. And the group was looking at custom acceleration. This was a little bit pre-AI. And the idea was to look at accelerators. And what was exciting now was instead of like CPU cores, these were more like customized processing units, but you still had the problem of kind of communication between them. These were smaller cores.

Starting point is 00:06:02 So I guess some of the ideas I had in my PhD of going many hops in a cycle made a lot of sense because you could actually jump through many of these within a clock cycle. So I started looking at like knock fabrics there. After that, so that group kind of disbanded after about a year. this was again, as I said, pre-AI. So there was no killer app. And so unfortunately, I think that the company didn't see a lot of value in such an

Starting point is 00:06:28 accelerator at that time. And so that's when I ended up moving to academia. So in the interim, I actually got a chance to work on this chip called Iris with Vivian Z at MIT, Bassoon, Eushin, and Joel. So I was kind of transitioning between my time at Intel to, I was in the process of interviewing. And then they were like, hey, you know, we're building this chip. and we would love to have somebody help design and all. So that kind of actually introduced me to the kind of two convolution neural network.

Starting point is 00:06:55 So this was, so the ImageNet moment had happened. AlexNet was there and the excitement was, okay, there is this very interesting workload called AlexNet. And it seems to be showing some promise and we can do this well on these, whatever the mobile GPUs were at that time. But let's see if we can design something that's more specialized. And so that kind of is where I think I really kind of got to learn, about Kibb.

Starting point is 00:07:21 So unlike a lot of my knock work where we were trying to really design for very general traffic, I think what was very exciting here was given this specific neural network, you knew exactly in each layer once I partitioned it, this is the exact traffic that will flow, right? And so that was very exciting where we could kind of strip down the knock to something that was much, much simpler and gave us very, very high performance for this workload. So once I kind of got to join technology, this kind of generally got me excited. And I think while we were still, of course,

Starting point is 00:07:51 I mean, the idea of domain-specific acceleration was an area that I wanted to look at, I think since AI was kind of becoming popular. And I think the advantage of that was also, so as an architect, it's always, well, you can kind of design systems. How do you get access to workloads? It takes a while till the community gets to a suite of benchmarks. I think the nice thing with AI was these models were coming out. You could kind of get access to these.

Starting point is 00:08:15 And so that's what got me kind of tailored towards looking at acceleration for AI. And so again, that same mindset. I was like, okay, I'm fundamentally designing a knock between these processing elements. I know how to do this. It's really about understanding the traffic patterns. So that kind of started moving me a little bit, I would say up the stack, right? I mentioned in my PhD, it was a lot more about the architecture of the knock and understanding wire delays and so on. I was now moving up the stack to, okay, how exactly is this workload going to get mapped onto this substrate because that will exactly determine the traffic pattern. And if I know that traffic pattern, I can design a knock that's a lot more domain specific.

Starting point is 00:08:54 And so that's where we started doing work in more kind of flexible AI accelerators. I think I still kind of had this view and this vision that we still want things that are more flexible because, yes, it's AlexNet today. At that time, it was like, it's AlexNet. Now we've moved on to Google Net and Resnets are coming out. So things are still changing. People were starting to talk about sparse convolution neural network. So it was like, okay, we still want something that is more general,

Starting point is 00:09:25 maybe something that's not as general as a mini-court knock, but something that's still more general, but at the same time tailored for the traffic pattern. So that got me into the domain of data flow, where it was kind of the specific kind of spatial and temporal scheduling of the workload because that had direct implication on what the traffic. So I think concurrently, of course, a lot of work from, again, Joel, Emer, Vivian, and Eushen on data flow. I think that got me excited, a lot of work that was happening.

Starting point is 00:09:55 Again, with my former Intel colleagues who are now at NVIDIA research, Michael, Aung-Sho, who were all looking at data flow. So we kind of started collaborating around thinking about the data flow and the implications of how that leads to the traffic requirements and then how do you design like interconnect architectures. So we started kind of thinking about this. went on to design a lot of interesting accelerators. Mary was one very exciting accelerator that I'm very proud of from my group where we were

Starting point is 00:10:23 trying to design a extremely flexible AI accelerator where you could essentially kind of configure it down to individual PEs doing their own individual tasks. And then depending on how you map, you can have a group of these PEs act as a vector engine together or a matrix engine together. And so that was kind of the flexibility we were trying to offer. So that, I think, got me, as I said, like, that kind of got me moving higher and higher up the stack. So then we started thinking more about more generic data flows.

Starting point is 00:10:55 How do you formalize thinking about data flows? We did this work called Maestro where we were trying to think about a formal language to describe data flows in terms of how they reflect, how data is getting mapped and kind of moved up the stack even further. And so that's kind of been. by journey into AI accelerators. And today I'm very excited to see many of those ideas have made it in some form of the other, right, to a lot of custom NPUs that do exist in industry.

Starting point is 00:11:25 Of course, there's these GPUs that have been dominating, but I think a lot of the custom AI6, both in the data center space and as well as on the edge, have kind of moved towards these directions around data flow optimization and data reuse optimization. Right. That's a fascinating journey. And I think you talked about the earlier. stages or early signs of code design, right? So in the prior world, when you're looking at many core architectures,

Starting point is 00:11:48 like standard CPUs, but multicores, we focus on communication patterns with general workloads, and then the transition to AI, and especially the fact that they have fairly predictable communication patterns, and how it's coupled very tightly to the data movement, the data flow, some of the early stages of code design, and that, of course, led to your series of works like Mary, Maestro, and a variety of others.

Starting point is 00:12:09 So just expanding that piece a little more, how do you think about cross-tag design for AI? I think in the early days, it would have seemed somewhat unusual, coming from a standard processor background, to start looking at applications with very well-defined data flows where you know exactly what's going on

Starting point is 00:12:25 and then looking for opportunities to optimize them. I'm sure this now extends not just across the chip, but maybe even further across the entire system. So how do you think about the multiple layers in designing all of these AI systems, starting from accelerators at the bottom-most layers, at the chip level, you have intra-accelerated data flow. And then as you start putting all of these chips together to larger and larger systems,

Starting point is 00:12:47 you have similar flavors of themes of problems, but maybe with slightly different constraints or with a set of more interesting trade-offs that you can explore. Yeah, that's a great question. So I think that's, so you're right. I think this was early days of, kind of maybe defining and kind of trying to formalize what we mean by co-design. Because I think one of the things, as you alluded to, right, what AI brought was,

Starting point is 00:13:11 somewhat of a rethinking of the entire stack. So there was, of course, there's always been the very classic stack in terms of architecture and micro architecture that we've all learned about and leveraged in the CPU world. And so as long as you were below the ICE, you could do a lot of these tricks with the micro architecture, ISA was the contract, you never wanted to break. And then you had, of course, a lot of the compilation and the PL above it.

Starting point is 00:13:36 But what was interesting with AI was, I think, A, the ability and the opportunity to squeeze a lot of performance meant that you could break down some of these stacks and break down some of these layers. And also, some of these layers were probably designed for a very general purpose computing platform and computing paradigm. While here, there was an opportunity to rethink a lot of this. So I think with that, I think it was interesting to play a role in trying to very actively think about what these layers should be. So in some sense, the architect, like so that like how I talked about data flows, right? So what parts of this data flow are kind of getting literally like going into silicon? What parts are being exposed up to the software stack?

Starting point is 00:14:24 What can you really program? What does it mean to program something that is configured? There are all these interesting conversations coming up in terms of how do you do code design. Because fundamentally, I think while if you kind of go into the domain of, even folks who've been doing, say, FPGA design for a while, right? People know that if you have a very specific algorithm in mind, you can literally create RTL for it and get performance. I think that concept was known,

Starting point is 00:14:50 but it was never economically viable to just go build a chip for each algorithm. So AI was a domain where it was economically viable. People were like, this was a stage of time when, again, a lot of AIAC companies were also popping up. So it made sense to kind of think about this. But again, you didn't want to over-design as, well and over-optimized because then you would not be future proof. So that's where I think this co-design kind of conundrum came in where you want to do

Starting point is 00:15:17 co-design, but what are the right abstractions? How do you ensure that you can still get this cross-layer optimization benefits while still maintaining enough flexibility that as your systems evolve and as your workloads evolve, you can still get good performance? And I think one big example of this was, say, when the move from CNNs to now LLMs, where I think over-designing for convolutional layers, which is an example of what Iris did,

Starting point is 00:15:47 Iris, as I said, was really, really optimized for AlexNet. But if you kind of run, say, a transformer model on that today, a lot of those design decisions in the hardware may not be, may not make sense. While I love, in fact, what, say, Google did with the TPUs, where they went with a more general. generic abstraction of a matrix multiplication unit, which kind of stood the test of time, even though we moved from CNNs to transformers.

Starting point is 00:16:10 So I think things like this were the kind of challenges you started thinking about in terms of co-design. And as you said, right, I think as workloads evolved, as we started getting to training, not just like smaller models, models were becoming larger. I think as transformers came out, I think the first, I think, at least for me, the exposure to more distributed ML came with,

Starting point is 00:16:34 looking at training these models. So you needed not just one accelerator, but many accelerators together to train a model for the compute throughput or even for the purposes of memory. Like training footprints are large. You cannot put everything into a single memory of a single GPU. Or if you have very large models, you may not even be able to fit the body weights in a single GPU.

Starting point is 00:16:56 So all of this started kind of exposing me to more distributed ML. And in fact, I think what you mentioned, so when I actually remember a conversation, we had many years back where I think this same abstraction idea where I started with, ultimately it's a knock between PEs, right? Now it was basically a P.E was now a chip. And again, ultimately we're still thinking about designing the interconnect between these. So the abstraction level went up even higher.

Starting point is 00:17:21 So I, as I said, going from now links and like chips to packages to a larger system. The fundamental problem was still similar. You have some workload. you are trying to map it onto these different quote-unquote compute units, which are now bigger rather than a single processing element or a Mac unit. And as soon as you do the mapping, that again affects the traffic pattern. So I think from my lens, it was the same problem. I'm still trying to figure out what is the traffic pattern between all of these accelerators,

Starting point is 00:17:50 but now it's kind of from a high level of abstraction. So that actually introduced me to the idea of collective communication, so especially with distributed training. And again, I mean, now that's also become common with inference becoming distributed, but at least at that time with distributed training, it kind of, again, got me into that very systematic thinking about, okay, this is the workload, this is how I partition it, this is the exact traffic pattern. And this is in this collective nature where all of these accelerators, again, GPUs,

Starting point is 00:18:19 but not are all communicating. And again, what are the right fabric architectures for these? I think that's where some of the exciting questions came in. One thing I'll also mention from a core design, perspective that I felt and I feel is much more challenging once you go to a larger system, is that you are also dealing with a lot of different kinds of link technologies. So I mentioned you're trying to design a network, but on a single accelerator, ultimately you are designing a network on that chip and at least it's the same,

Starting point is 00:18:53 whatever, processing, a process technology. Once you're getting to a larger system, this network now includes on-chip link, It includes on package links. There's a whole packaging, whatever the packaging technology and whatever its constraints are come in. You're talking about links within a rack. And so now you're talking about whatever high-speed link technology you're using there.

Starting point is 00:19:15 So as an example, NVIDIA say today uses something like NV link there. And then you're talking about links across racks in the scale-out domain. And this is where, again, are you using Ethernet or Infini Band and or like even optics today? So I think kind of the interplay of a lot of link. technologies and the role they play in trying to design the entire network begin. Again, from my perspective, great. So even more networking questions and a lot of my learnings and a lot of my building blocks in terms of understanding how do you build networks was now playing a role here.

Starting point is 00:19:48 Again, with the same constraint of the traffic pattern is more deterministic. So I can do something more concrete there. Yeah, so that was really interesting, Shishar. I think I want to ask a little bit further about this co-design question, right? Because when I worked in Data Center, we had a lot of some of the similar questions, except just along the intra-data-center traffic route, right? Where you've got multiple communication,

Starting point is 00:20:11 then server-to-server communication, inter-rass communication, all that kind of stuff. It was just like not necessarily AI. We're talking about general workloads. One thing that caught by ear when you were talking about this co-designed piece was the decision of exactly how to, because you still have to create abstractions, right? You've got this whole new playground.

Starting point is 00:20:28 It's a new workload. It's a killer app. It's enough of a killer app that you have to design custom things for it. And now you've got this empty, clean sheet, where are we going to draw the lines so that you can remain flexible? And then on top of that, you've got these link technologies like you were talking about. So I guess two questions. Did you find there to be some sort of general principles about how to draw the lines to make sure that you could future proof, like some sort of rules of thumb? And then secondly, when it comes to things like these link technologies, to what extent do you feel like it's about the protocols that are going over these links or the physical aspect of the links themselves?

Starting point is 00:21:10 What does that interplay like when you're thinking about the design? Yeah, that's a great question, Lisa. So I think to your first question about how do we even, if you're starting clean state, how do we even think about these lines? So I'll answer it in two ways. So one is, what's interesting is that some of these lines just get drawn based on the expertise. What do I mean by that? So, I mean, once you're kind of now designing this entire vertically integrated system, if you think about the expertise, a lot of the expertise of the AI models is really with the machine learning practitioners, right?

Starting point is 00:21:42 And then you're kind of coming down to folks who are thinking about the software stack. And so this is, I would say, folks with a lot of experience on like on, on, of just software infrastructure software systems down to architecture, down to even I mentioned circuits. So I think there's some natural boundaries just get created based on what are parts of the system that you are aware of, which are natural, but which is also, I think, the abstraction layers that sometimes we've had to break to get even more performance. I mean, just taking a tangent.

Starting point is 00:22:14 For example, if I, oh, I really want to get, I mean, these days people are looking at numeric formats and so on. So here you're talking about something at the ML algorithm level, but you're really thinking about bit precision, which is like a very, very low level going down to how many literally wires do I lay out in Parglin so that I can get and how many can I fit on a chip. So that's, I think, so that's been, I think, the opportunity of code design, but ultimately I think that's one granularity where some of these lines get drawn.

Starting point is 00:22:43 And I think that's what's happened, I would say, like naturally, right? Like there is certain abstraction layers that just got added based on, some of the tooling ecosystem that got created. So, PyTorch, like, I mean, early days it was cafe, did it actually pie torch, TensorFlow, jacks. I think they kind of subsumed a lot of what was the traditional programming languages layer and even a lot of what was traditionally the compilation layer, at least some of the front-end compilation.

Starting point is 00:23:10 Moving down to like different companies having their own kind of operate, like whatever libraries for compute and communication, that kind of became the layer at which. which you entered an accelerator. And now you had like the software stack and the hardware of the accelerator. Ideally, as architects, we really want to very clean ICSA, but that's something that's not really happened. So that became one piece.

Starting point is 00:23:32 And then as I said, you go down further to the actual implementation. So with that kind of with that high level idea, I think what's now interesting is, as you said, right, like how do you decide, right? How much to code design and how do you future proof? And I think this is where I think overall what's helped. And again, I'll kind of. I will credit a lot of the front-end software stacks for having done this. I think the idea of operators that you would like to run became a lot more prevalent here.

Starting point is 00:24:02 So, I mean, ultimately, I think what's happened is if I have a framework like a jacks or a pie torch, there's these kinds of operators that I would like to accelerate. And so then it just simply becomes how much do I envision this operator kind of changing. And so that's where this operator is something that seems to be. be quite robust now across many, many generations of these models. This is something we would like to really, really optimize for and like harden in silicon versus there are some of these operators that may not be, may not have kind of reached that level of robustness yet.

Starting point is 00:24:37 And so here maybe let's not over design. Let's actually keep it quote and quote more programmable. So as an example today, I think some of the operators that, like for example, matrix multiplication and some of the vector ops are really baked in. into silicon while a lot of the operators related to activation functions, you typically use more programmable hardware like your vector engines or DSPs for running those. So I think that's at least the direction where things have moved, which seems to have kind of been somewhat stabilized.

Starting point is 00:25:10 I think that's the way I'll view it, even though there's always an opportunity to optimize, as I said, if you harden things more. So that's, I think, the first part, I think, it's kind of a, again, I think, As I said, it's one of those domains where I still feel I think we need better abstractions and a clearer boundary, but at least, as I said, I think just based on the economics of things and expertise of things, we've kind of navigated to these layers, which seem to be reasonable so far. Could you remind you what was the second part of your question? It was sort of related to this like abstraction layer. When you're thinking about design

Starting point is 00:25:44 Yes, roadocles versus links, yes, yes, yes. That's actually a great question. So I've actually had the time to think about this. So it's a bit of both. At some level, right? You are right. Like, I mean, in some sense, ultimately, what we care about is the protocol that's running. And like the physical layer is really just a mechanism of implementing that protocol.

Starting point is 00:26:03 And it may have certain tradeoffs. I think what's, so in terms of functionality, I think that makes sense. And you can actually, as long as you design the right protocols, you can get these things to work. But again, AI is this interesting domain where we're really. trying to squeeze out every ounce of performance. So now that's where some of the underlying physical characteristics coming. So for example, I have links that have a certain bandwidth.

Starting point is 00:26:29 That's the only amount of bandwidth I can drive. And I have this bandwidth mismatch between, say, links on chip versus links on package versus links that are, let's say, on cross racks. And so that's the physical constraint. The sizes of the messages that are going to be going on those links is really, dependent on my AI model and how I've partitioned it and so on and so forth. So that's kind of the message sizes that will have to be communicated. And so now I have this message that has to go through a series of these pipes that have varying

Starting point is 00:27:01 kind of widths because some of them are fatter, some of them are narrower. And one interesting nature of a lot of this communication is the collective communication especially is it's not just communication. You're also doing computations. You're actually doing reduction. So some of these messages are kind of getting combined. and becoming larger and larger. So as you're going through the network,

Starting point is 00:27:21 your messages are kind of becoming larger, but now trying to go through narrow pipes. So there's all of these interesting, I would say, challenges that come in, which again makes it imperative to do a very good partitioning early on. And that's where you need to be a little bit more aware of the underlying physical links to get good performance. I think the analogy I have in mind is often, like, for example,

Starting point is 00:27:44 in the HPC world, right? People would try to write code in a way that it really fits in the caches and it's aligned to the cash lines, even though that's technically a micro-architecture thing. So functionally, yes, as long as you don't really need to do that, but if you really want to get performance, make sure your tiles are aligned to the cash size, I think something similar is needed here

Starting point is 00:28:04 and is something that is happening, which also makes things more challenging. And that's why I think a lot of the physical constraints and the physical characteristics are affecting design decisions as you go out to stack. So is there something akin to, what was it called it, like the Intel software, something manual. I forget what it's called now. But there was that big fat book that came out with each processor that basically said,

Starting point is 00:28:28 okay, if you really want to squeeze performance in this processor, these are the things that you need to do that would maybe potentially go down to the instruction where are there equivalent things? Yeah. I think something like that is, I think, missing. So I think what's happening today is that this kind of expertise, exists inside companies. And I don't think they've,

Starting point is 00:28:52 I would say, specifically release manuals to, because there's still, I would say a big competition in terms of, say, I system and AI hardware. So if I'm a company,

Starting point is 00:29:01 let's say, I'm an Ingridia or I'm an AMD, I have these customized libraries that are aware of a lot of these underlying nuances. And so if you kind of give me a certain operator, I will kind of compile it down based on that, which is,

Starting point is 00:29:15 which also becomes part of their secret source. And so which is also why I think any time there's a new chip that comes out, even though in principle it can give me very good performance, you really need kind of this intermediate layer to kind of get that performance. And I don't think these manuals exist. And in part maybe because things are not even, I mean, things are moving so quickly that there is no systematic way of doing this today. I think if we reach a point where we, in Intel's case,

Starting point is 00:29:44 it's dominant player, it has a CPUs, it and it really wants all the customers to be able to get like really good performance and it kind of releases all of this for you to be able to do that. I think we have not reached that stage yet, which is why a lot of it just feels like weird, dark magic and suddenly you get like, okay, really, really good performance after tuning your software. That's really what's happening behind the scenes. I mean, some of these things that are tuning to low level hardware to, you know, what you're running from the software. Yeah. We're saying sometimes you tune your software and just boom, you get this jump in performance. And you're just like, yay.

Starting point is 00:30:21 And it made me think about how, and part of the reason why we don't have these manuals is because companies don't want to expose their secret sauce, right? And so back in the day, Intel would have the software developer's manual and then you go through that and use its wisdom to try and figure out what to do. But at the same time, you had Tom's hardware, something new comes out, and they've got extremely targeted workloads to figure out what is happening on the inside. So like, oh, you have bad performance. Oh, we had a performance spike. That means that the microarchitecture looks like this. I guess what I wonder is, is there any sort of similar effort happening within the AI space

Starting point is 00:31:00 where you've got some secret sauce that's come out? And maybe you've got a suite of workloads to figure out, oh, this is how they've laid out their network because suddenly it fits or suddenly, it's faster. Is there anything like that? It's a great question. So I think at a high level, I think that's part of the goal of ML Commons. Like, I mean, they started ML Perf and now they've,

Starting point is 00:31:26 ML Perf, I mean, ML Commons became ML Commons because they're like, ML Perf is just one thing and we want to create more tailored, I would say, benchmark. So there are benchmarks that are, I mean, as an example, there's a benchmark that just stresses the storage subsystem to kind of, kind of understand it. But I think what you're about about one level deeper.

Starting point is 00:31:45 It's not even about stressing parts of the system, but within that, are there certain kinds of benchmarks that help you kind of or maybe like figure out what some of the bottlenecks are. So I would say that again,

Starting point is 00:32:00 so there are parts of the system that, I mean, parts of the like system architecture that are quote and quote fairly public still. So example, like n number of GPUs, connected, like NBL 72, like 72 GPs connected. Those things are known.

Starting point is 00:32:15 I think a lot of the, some of these nuances that we were talking about is really like within a GPU or within a package where I guess you could have like some customized kernels to stress test this. And I believe there are some, but I don't know of a sustained effort in that sense. It's actually a great, great point. I would say that Lisa, what's I am. It also relates to the previous things about learnings. One thing I've realized now talking to a lot of these companies.

Starting point is 00:32:42 And he's even when I've asked questions around this, it seems like an industry right now, especially with AI moving so quickly. One of you, I was talking to a friend who's in a company that does like a lot of AI influence deployments. He said right now, the aim is just to get things working and deployed. I think trying to really optimize for performance costs. I mean, those are things that will come. But right now we go with, say, in this case, I go with an Nvidia.

Starting point is 00:33:08 and Vendidia gives me the software stack and I kind of just use it to kind of run things and it works, right? And that's what as I said, why also I feel it's been challenging for some of the newer players. But I think as we, as this things gets more, gets more mature, maybe at least on the on, like on the accelerator side, I can see efforts like this coming in, which will also mean that these efforts I've said, one is they'll help you identify certain things. And it'll also in, in turn, like help you optimize like the software stack for different kinds of systems. So I think there's a great opportunity here to kind of do what you just said. Actually, that's a great question.

Starting point is 00:33:43 Right. So just discerning a couple of themes from the conversation, right? It looks like on the one hand, we have been through a period of rethinking our entire computing stack, especially for AI. And even within this, there's been multiple inflection points that you alluded to. So we started off from CPUs to domain-specific accelerators at the chip level. And then after that, we went from CNNs to transformers, and then transformers to LLMs, which do it up from single chips to multiple chips. We did LLM training and now we have LLM serving reasons for reinforcement learning.

Starting point is 00:34:13 Those are all like multiple inflection points with very unique challenges that have come up with each stage. At the same time, I think you talked about the fundamental problems have somewhat chematically remained similar. So you talked about like paralyzation and mapping. It could be within the confines of a single accelerator and your intra-accelerator data flow and so on. So there's paralyzation and mapping. And then there's data flow and communication. So that's sort of how you've assembled a lot of the building blocks and pieces of the puzzle

Starting point is 00:34:37 as you've been building the systems and moving with the times as these changes have happened and so on. And yes, of course, we have maybe organically sort of settled into a set of abstractions that seem to work well currently, but there's still a lot more pedagogy and dissemination of all of this that still needs to happen. I want to switch into a slightly parallel thing. So we talked about the vertical code design across multiple layers of the stack. At the very top of the conversation, we touched upon how we are seeing accelerated cadences and compressed timelines. Things are moving very fast. we're deploying these systems at tremendous scale.

Starting point is 00:35:08 At the same time, technology is not static. There are new packaging technologies, new link technologies. You know, you have chiplets, you have chip-to-check communication, all the way up to the data center. So people are designing these systems at breakneck pace, and you need to make the right system design choices. Right. So one other, I think, contribution of your research group's work

Starting point is 00:35:27 is not just the ideas that come across and the abstractions that you have developed, but also the tooling ecosystem that backs this up. Because you've talked about the vertical stack from chips all the way to data center, but then for each of these layers and cross-cutting these layers, you need a set of tools if you need to be able to make these decisions

Starting point is 00:35:43 with the appropriate velocity that this time demands. So can you talk a little bit about this parallel plane, in addition to all the layers of the stack, how the tooling ecosystem sort of underpins this helps people make design decisions in a fast-moving space where you need to get the performance

Starting point is 00:35:59 that you need out of the system, make these decisions in a very compressed timeline very quickly with velocity, be nimble and agile as the world moves and be able to understand what's going on because the landscape might evolve. You need to repivit your design decision depending on how things they evolved. So how do you think about the tooling that supports all of these things and all the work that's come out of your group to support this as well? Yeah, that's a great question.

Starting point is 00:36:21 So it's something that I'm very passionate about. So I think you hit the nail on the head. So first, it is a hard problem. So we're already talking about redefining the stacks. But I think it's a hard problem because A, yeah, As you said, right, like, the workload is evolving. Like, I mean, we've literally just within the last 10 years seen CNNs being the dominant workloads to, I think LSTMs had their time, especially I remember once the TPU paper came out, it had LSTMs that got people very excited to the transformer paper to transformers. I mentioned, these are larger models, training to now inference.

Starting point is 00:37:00 And so right now, LLM inference is kind of the hot topic of the day, and everybody's talking about pre-filled decode and so on. So that's the workflow, right? If we kind of view it from a system perspective, what's also interesting in the three things I just mentioned, like CNNs to LLM training to inference, is the switch from things that were heavily compute bound. Like CNNs were great.

Starting point is 00:37:25 They had a lot of reuse. They were heavily compute bound, and that's why a lot of interesting data flocked and so on to kind of get even more reuse you. You were not bottleneck my memory bandwidth. To LLM training, which was fundamentally and is like very, very network bound. So that's where a lot of the challenges with networking communication, especially I mentioned things at scale and all of the things about heterogeneity in the network,

Starting point is 00:37:49 to now inference, especially the depode stage, which is heavily memory bound, right? So that has its own challenges. So I think just kind of, so that already shows that as the workload is evolving, the balance within the system is changing. a very, very course level. So that's one channel. The other reason is a hard problem, as you also alluded to,

Starting point is 00:38:05 is the scale, right? So the kinds of things you care about on chip versus on package, versus on rack versus rack to rack are also different. And when I say the kinds of things, it's also like literally, right? Like on chip, you're literally trying to squeeze out nanoseconds, right? Like cycles and I have a, I mean, I remember my own papers in oxide.

Starting point is 00:38:25 I don't get 10 cycles speed up, right? Once you kind of go to a package or a rack, you're talking about microseconds. and there's, I mean, those few cycles don't matter as much. And then you're kind of talking rack to rack, and you're talking about these large-scale distributed platform, then you're even going into, into like, as I said, micro to milliseconds. So what that also means is how do you kind of design or, like,

Starting point is 00:38:48 think about tools that can span this entire piece very well, right? And so then we already touched upon, like, how much code design. So, again, that's that's an open question. And I think the other big piece, at least now that's becoming more challenging here is heterogeneity. So these systems are also heterogeneous. I alluded to like heterogeneity in terms of just a network, but especially once we get to allow LLM inference,

Starting point is 00:39:13 the system is heterogeneous. You have, especially with disaggregated serving, people are doing pre-fill on certain GPUs, decode on the other, and then you have tool calls for a genetic system. So again, so wrong so short, it is a hard problem with all of this. And so now this is why tooling here is hard, because the different kinds of tools that people care about for each of these is also different. I mentioned, right?

Starting point is 00:39:33 Like, if I am somebody trying to do something on chip, like, I mean, I collaborate with companies that are a lot of, let's say, the semiconductor manufacturing companies or the vendors, right? They are talking about, oh, my new fancy S-Ram technology or like this technology that will give me so much speed up, right? They're literally talking about evaluating things like that versus you're talking to somebody who's a system these. researcher and they are trying to think about this knob in RDMA, like how much benefit would that give and should I have some new kind of congestion control? So, so in some sense, the scale and the language is also so different that how do you kind of reason about all of this in one infrastructure? So that's very challenging. So based on I think the way we've tried to approach this and I think, and again, it's, I'll kind of describe the journey to where we are today.

Starting point is 00:40:22 So I would say today I can say, oh, I'm very excited that thanks to my group and a lot of collaborators, we've built an infrastructure that can actually navigate across the stack as well as these scales. And I think the way to do that has been that, of course, in the beginning, as I mentioned, we were building like networks, like NOx and accelerators, we started putting together tooling for each. So if I'm just building an accelerator, I care about data for optimizations on chip. I'm trying to do study like on-chip networks or on-packaged networks. We built a simulator for that. So as I said, like for accelerators, my group had developed a simulator called Scale Sim.

Starting point is 00:40:57 for most systolic rebis-based accelerators and maestro, which were all around like single accelerators and really studying data flow. That's all they could do. Single accelerator maybe. We had even based from my PhD work and even my time at EMD back in grad school, we had developed Garnet and enhanced it to study like on chip or on package systems. As we started moving to more distributed systems, that's where we started developing this infrastructure called Astrosin.

Starting point is 00:41:24 And the key idea here, and Astrosin also started. of literally as, hey, I would like to study on package networks. Let me just create a, I mean, Garnet is in GEM-5. It's very generic for like many core systems. Let's actually create a traffic generator for that, which is tailored to AI traffic. So since we already understand AI traffic. I mentioned how you partition it.

Starting point is 00:41:47 So it kind of started off as a traffic generator into that simulator. But I think as we started, as I said, we started looking at training and scale, the other realization I had talking to, again, different people I literally mentioned, right? Like this, people care about different things and different generalities was that there is actually no one-size-fits-all. People will want to use different kinds of simulation infrastructure for their piece, like want to study their piece in a certain level of fidelity. And then I think trying to have kind of squeeze everything into this one infrastructure that models, everything is going to be challenging. So I think the design decision of tooling here,

Starting point is 00:42:25 that we went towards was that given that this is this cross-domain problem with a lot of different expertise and kinds of things you care about, everybody has their own tools and pieces that they understand. And what we really need is a mechanism for all of these to work together in some form of coherent manner. So if I want to, so if I can create like this AI traffic generator, that can generate traffic into an on-chip or on-package simulator, you can really study what are the effects on individual links and, and I don't know,

Starting point is 00:43:00 maybe you have a simulator that can measure thermals for whether or not your package is kind of working at whatever thermal limits you had versus you could use this traffic generator to inject traffic into a simulator like NS3 where you're studying things like RDMA. So I think this kind of an understanding is how Astrosin evolved into this ecosystem. So we basically realize that all we need to create a bunch of these. APIs. And just like very classic software engineering, let's try to think about designing these simulators in that manner. So APIs, that's the contract. And you can kind of plug in your own individual pieces. And that can be like open. That can be proprietary. So that I think

Starting point is 00:43:41 that design philosophy is how I think we were able to create this two league ecosystem. So Astrosim, as it started off as, okay, traffic generator into Garnet to study on package networks. NS3 came in through a network simulator, it allowed us to study networking protocols. We wanted to study things at scale. We developed and plugged in another network simulator to help us do topology studies. And again, each of these has their trade-offs. None of these network simulators can model everything,

Starting point is 00:44:11 but ultimately the plug-in-co ecosystem allows us to study all the way from like low-level details inside a chip or a package to things at scale. And similarly, there's a, then started evolving the traffic generator. Every time I've been calling Astrosema as a traffic generator, here we realized that, well, you know what? Again, we talked about these abstractions and co-design, right? There are different kinds of collective communication algorithms you can run.

Starting point is 00:44:37 So, Nvidia has a bag of these algorithms it runs, but we really wanted to build something that is obviously also vendor agnostic. So we said, okay, let's think in a very principal manner where from the workload, if we can get an understanding of what the, again, we talked about operators earlier, are the operators that I need to run, collective operators or computer operators, then you can now plug in, again, your own library to kind of transform that operator, say an all-reduce into an actual all-reduced algorithm, which is really the sends and receives. So you can, you can use astrosim. It has its own implementation,

Starting point is 00:45:12 or you can use some other library that you want to kind of customize that layer of the stack. And so on and so forth. So this whole philosophy, in fact, went all the way up. So, in fact, one big effort that spawned out of Astrosim was this effort called Chakra. We realized that we're talking about workloads and distributed AI workloads. There is a lot of, even beyond like trying to do simulations, architects are always thinking about designing the next system. But there's a lot of value in benchmarking of existing systems just to kind of optimize the

Starting point is 00:45:40 performance. There's so many software notes, right, that you can see around with. So I think here is where, again, thanks to one of my close collaborators at Meta Shinva Shidhar. And I think he was one of the original kind of co-developers of Albuquer, Astrosim back when he was at Intel. And so that's why Astrosim started off as an architect's view of, hey, I want to design, as I said, on package network. What should the thing be?

Starting point is 00:46:02 When he was at Meta, that's when we started looking at scale and RDMA, the network API came out. And that's where we realized that what companies like Meta are actually running a lot of jobs and collecting a lot of these traces of these jobs, which they typically feed back into some infrastructure to kind of to understand different kinds of software optimization. And so we kind of realized that there's an opportunity now to kind of combine all of this into a unified, standardized infrastructure.

Starting point is 00:46:31 So Chakra became this effort where we said, you know what, if we can agree on the format with which you describe a distributed AI workload. So remember, a distributed AI workload is fundamentally just a distributed graph of operators. Again, that same idea of operators, very course level operators. Fundamentally distributed AI workload is just compute and communication operators. that are running, right? So each GPU, TPU, NP, whatever is running a series of these operators. How exactly it runs it is where its own internal software stack comes in,

Starting point is 00:47:02 but fundamentally that's what it's running. So as long as we can agree on the format of this trace or graph, that's what we call as chakra graph. Then the advantage is this same graph can be used by several different tools. It can be used by a simulator like Astrosim to kind of simulate future. systems, or it can be used by proprietary simulators, which is what some folks do. Or it can be taken and replayed on the existing system to play around with, or partly replayed. I can just replay, say, the communication operators, ignoring the compute, just because I want

Starting point is 00:47:38 to study parts of the networking stack. And so that, I think, observation led us to then go and realize that this is this opportunity in the community, especially as we said, we're kind of in the space, we're trying to define things, let's go and actually pitch it. So we pitched it to ML Commons as an effort to kind of standardize the whole infrastructure. And so that kind of led to Chakra becoming its own standalone piece of the stack. So Chakra today is this workload benchmarking, I would say, methodology. You have like the standardization of the chakra trace.

Starting point is 00:48:12 They're support in PITORCH to collect traces in the chakra format. They support in NVIDIA's Nemo to also collect traces in the chakra format. So as long as you get traces in the chakra format. So as long as you get traces in the chakra format from real systems, or you could synthetically generate traces as well, as long as they're in the same format, right, they are compatible. And now you can use it to feed a simulator like Astrosim,

Starting point is 00:48:32 which internally, and you can plug in your own components in Astrosim, I mentioned like network simulators, or you could use chakra traces to drive other simulators. And there are companies that are using it to drive their own internal simulators or amulators. So I think, so basically I think in all of this, the key idea was as long as we can work, can agree on some of these APIs, it allows us to kind of plug and play a variety of these

Starting point is 00:48:56 tools to build like a tooling infrastructure that helps kind of everybody, right? Because otherwise, everybody is reinventing the wheel for similar kinds of things. So that's, I think, been the journey for some of the tooling. When you talked about Astrosim as a traffic generator, what I thought was really interesting, or an ML traffic generator specifically, my first thought was like, well, how are you defining what the traffic looks like? Because it seemed like your back in was at so many different layers. I mean, you talked about how it was going to garn it and it was going to thing and three and we're like where now you're thinking about RDMA versus like on chip. And so then I was like, but that is very different traffic. And so how is it more a body of

Starting point is 00:49:38 traffic patterns or is it more a way to take a high level traffic pattern and say which layer it is so that it breaks it down? I mean, I was, I didn't quite know. I did. No, I great question. I think It's more of the latter. So I think the thing I've realized, and that's why I think I've been using the word traffic everywhere is because, as I said earlier, it is all ultimately operate, it's all graphs and operators.

Starting point is 00:50:00 I mean, to me, I guess coming from a knock background, everything is a network, it's just a network of nodes and edges. So, like, for example, I mentioned, like the chakra is a graph, which is saying this is compute and communication, very coarse-grained. This is the compute that will run on this GPU.

Starting point is 00:50:14 These are all the operators, and this is a communication operation. The GPU's job will be to break down that compute into low-level kernels. And again, depending on, as I said, the software stack of the GPU, you will make it out. Similarly, for communication,

Starting point is 00:50:29 at that level of the granularity, this is, I would say, within Astrosim, maybe one level. So if I think of Astrosim as workload, then let's say the system software equivalent, and then let's say the network simulator. So Chakra is workload to system software. That's the boundary.

Starting point is 00:50:46 So, okay, at that bar granularity, it's coarse grain, compute communication. operators and the communication operators now enter Astrosim. Now, or whatever, the system software layer of Astrosim. Now, within that, this is analogous to a collective communication library like nickel or Rickle, where they get broken down into individual, send and receive messages. Let me just call them messages. These are, okay, this GPU has to send this like one MB of data from here to here.

Starting point is 00:51:14 That's it. And that's what I said. You can have your own algorithm to do this. You could use something existing and so on. Again, it's all plug and play. And now I enter, so at this point, I'm entering the network simulator. So at this point, quote and quote, traffic is at the granularity of individual point-to-point messages. So it is messages.

Starting point is 00:51:34 But now, depending on the network simulator, its job is to break these messages down further. So that's where, as we said, if I'm a garnet, I will take this message, break it down into like packets and flits and so on. for sending these out. If I'm like NS3, these messages will actually also involve some RDMA handshake. So that kind of message is more of a trigger to some other code in NS3 to do certain things, which will in turn generate the actual packets. So it's all kind of, as you said, like it's all kind of going down the stack to ultimately, like eventually there'll be some actual bits going around on some wires within the simulator.

Starting point is 00:52:17 But that's what it depends on the simulator. If it's a very naive simulator, what I called as a message here, it might simply take some equations and compute some delays and send it back. Like it doesn't even simulate any of those, any of that traffic. So I think long story short,

Starting point is 00:52:33 I mean, to me, everything is traffic like breaking down, like from one high course operator to finder to finder to find it as you're going down the stack. And we're kind of intersecting it at different points. And that's those are those abstraction layers. And for each of them, we are trying to use a specific tool.

Starting point is 00:52:48 That's kind of the high-level architectural idea behind the simulator. I see. I see. Okay, so within Astrosim, you have many potential different layers of breakdown, and your APIs are per layer. Okay, so I think the thing that's like that I'm tripping up over is a graph that's operating at, say, an operator granularity and a graph that's operating at, say, a flip granularity or in a graph that's operating at like an RDMA package. granularity, because graphs for the same workload look different.

Starting point is 00:53:19 And so I guess what I'm trying to figure out is you don't use the same graph for all the same for different factors. We do not. Yeah, yeah. Yeah, yeah. So we basically, I think, so there is a graph that comes to a simulator, that graph, the operators of those graphs get broken down further and further. And so for example, once the compute operator comes into Astrosim, that operator will potentially get broken down into further things for a compute simulator and simulate the

Starting point is 00:53:45 communication operator will get broken down further. And AstraSys API will call some network simulator, which will again break it down even further. So that's what I meant like. So ultimately, I mean, leaving the APIs aside, the overall strategy is, of course, operators getting broken down, broken down, broken down, depending on what is being run with their dependencies and then eventually everything getting coming together to see what the overall time was.

Starting point is 00:54:13 Yeah, that's the very high level idea. That sounds super handy. It sounded like Astrosim started, like was not necessarily planned to be a big thing. You had a need and you built it. But then it sounded like by the time you got to Shakra, there was a little bit more planning. And I guess from a meta sense, when people are developing these sorts of tools, everybody is faced with a choice. I can make this quick and dirty to make it do what I needed to do right now. or I can presume that I want this to have some amount of staying power

Starting point is 00:54:47 and kind of try and architect it. So was your approach to those two different tools different? Was one of them more quick and dirty first and then redeveloped and then the other one was more planned? Or were they both quick and dirty and then became what they are? I guess I'm kind of curious about that process. That's a good question. I think the very first iterations was of as to say it was quick and dirty

Starting point is 00:55:10 because it really, in fact, started off as a thing, a class project for one of my students for this collaboration I had with Intel. I think as we started doing that, and especially, as I said, I think, again, I have to credit my kind of collaborator, Shunovas. Like, I remember once he moved from Intel to Meta, and he was like, this is great because we already have this infrastructure. It'll be great if we can actually extend it to kind of handle these other networking questions.

Starting point is 00:55:36 And so that's where the API got developed and so on. I think maybe at that point, it was a bit of a pivot even for me where I saw that there is an opportunity for this to actually become larger and let's be more, let's actually be a little bit more systematic in terms of doing this. In part, I would say that, again, we talked about my kind of at AMD, right? I think I was certainly during my own grad school influenced and kind of saw the whole Gem 5 effort. And when I was at AMD, that was the time most of us interns were trying to kind of, again, and make sure the infrastructure is in a way that even though we develop things internally for AMD for our internships, we're developing features and code that can go into the public domain,

Starting point is 00:56:22 participated in early tutorials and Gem 5. So I think that's something that did stay with me. And even as an academic, I think what happened was I realized there's a value in making these tools a little bit more generic because I can then potentially use them for my classes to teach labs. And so I think with all of this, there was a sustained effort to make this. more than just a quick and dirty solution for one or two PhD students to use for their thesis. And again, I think, and again, this is just the thing. Because I was learning about AI systems along the way and I was kind of, I think I enjoyed that process.

Starting point is 00:56:55 I was like, I would love to educate more people in the community. So we started kind of organizing tutorials. And so once you sign up to do a tutorial and you're like, over the tutorial, we'll have a demo. I remember that just meant that suddenly it cannot be quick and dirty. It has to be something where now it's being released to the public and they're going to try it out and never tell by students. We don't want like a barrage of emails that, oh, this didn't work. That didn't work. So it kind of got to a point where we started thinking very systematically about doing it in a more robust manner.

Starting point is 00:57:24 I still feel like we didn't. I mean, coming from academia, I think it was still, I don't think it was as polished as say an industry product might be in terms of making sure there's like all of the, like, all of the, like, CICD happening on the code base and all kind of regression tests. But still, I think the evolution of the architecture of the simulator was keeping this in mind that we would like it to become more generic and kind of become more sustained. So that did become more of a planned effort, which is why I think once we got to the chakra piece, again, it was kind of a very natural evolution that let's create this new API and kind of separate these pieces out.

Starting point is 00:58:04 Because each of these does take effort, right? I mean, it works today, quick and dirty. It is an effort to kind of revamp the codebase, these interfaces. But luckily, I've been able to kind of convince my students and collaborators to do that and evolve it. So you've built a nice culture of quality software architecture, which is awesome.

Starting point is 00:58:24 So maybe metaphorically popping up the stack now. So you've built an incredible tooling ecosystem, a variety of ideas have come out of that tooling ecosystem as well. We've cultivated many partners across industry, academia and so on, maybe trained many students and practitioners who have used these tools, understood these ideas, developed pedagogy and so on. So what's next for this overall ecosystem? What's next for you? That's a great point. So you're right. I think one thing that's been fascinating has been a lot of the partnerships with a lot of consortia. So of course, I mentioned some individual

Starting point is 00:58:57 companies. I was very lucky to collaborate with Intel Meta. And right now, AMD is in fact one company that's very actively driving a lot of the Asterisim work and not surprising, like a lot of the ex-GEM-5 folks at AMD are really helping push this further, right? There's also been value with more consortium, as I mentioned, like semiconductor research cooperation, which funds a lot of my research, has a bunch of companies. So they found value in this, ML Commons OCP. But I think to even touch upon what you mentioned earlier, Sovone, which was that, and maybe even Lisa's question about doing this systematically, how do you kind of make

Starting point is 00:59:33 sure it is robust and so on. And so I think one thing that I'm excited about is I'm actually in the process of commercializing some of these tools. So I think I see a value in something that's a more engineered version that has a lot more of the robust support that you would need. Because ultimately, if we want companies to go and bet on a lot of decisions with tools like this, either for optimizing current systems or for future systems, I think there is value in a robust ecosystem for these tools.

Starting point is 01:00:05 So that's something that I'm personally excited about. I've been very happy to see the visibility of the tools. Also as an academic, I think I see that challenge where students who develop it graduate. So I feel like for daily maintaining continuity, I think it will be good to have a commercial entity that can manage. So that's something personally is a next step for me that I'm very excited about. So hopefully in some time, like I'll be able to. to share a lot more details about this.

Starting point is 01:00:32 That's exciting to Shar. Yeah. It would be your first commercial? It is. Yes. Yeah, it will be. Yes. Yeah.

Starting point is 01:00:39 Yeah. So far I've been kind of, yeah, I've spent a little bit of time in industry, mostly getting a lot of collaboration, but this would be the first time kind of venturing into that world. And that's also where it's, I think it's been an exciting learning opportunity right now, just to even talk to people to understand, kind of things that we care about purely from an academic hat versus like a commercial hat,

Starting point is 01:01:06 even, I mean, some of the practical things, right? I mentioned like for whenever we've written papers, even on these tools, there it's a lot more about the tool and like what is some of the unique capabilities it brings in, what is the novelty, right? While if it's something that's related to a commercial entity, it's a lot more about robustness and guaranteeing it will work and like making sure it works in a lot of production set up. I think just kind of that transition learning for me personally has been very, very exciting. Yeah.

Starting point is 01:01:35 So that part is also something I'm very excited about just to kind of learn how these things are actually done in industry, having not been in industry myself for very long. Well, that sounds cool. I think it might be a good time to then maybe listen to the Dan Soren episode or the Karu episode because I remember very clearly Dan Soren talked about when they were starting their company. It's like you tell people about a technology and they say, I want that. And I remember I can hear him saying, and then you say, like, well, what about this? Would you pay for this?

Starting point is 01:02:03 And he says, they say, I don't want that. And so then Karu also talked about that, like the proposition of getting somebody to pay you for this thing. They think it's cool until you say, yeah, but now how much would you pay for it? Yeah, exactly. So that's what I've been told. So it's great. Like, you'll hear that it's cool. Like, and that's the, but I mean, how do you translate that it's cool to, okay, I'm going to sign a check to pay for this?

Starting point is 01:02:25 I think that's the transition that is, that will be an exciting thing to do. about. Yeah. Well, best of luck with that. Thank you. Yeah, certainly wish you all the best with this new venture. I'm sure one year down the road when we're talking to you, you'll have a crisp set of learnings once again on ongoing. On the sale, getting people to pay for the product and delivering value, right? Awesome. Yeah, no, that'll be great. I hope I can do that and again talk about all of my learnings along the way. Yeah. Yeah. Exciting. Wonderful. So we did talk about some of your early career from grad school to Intel, getting to Georgia Tech. But let's sort of wind the clocks back a little more, maybe reflect on the overall arc of your journey, but maybe if you have some words of wisdom

Starting point is 01:03:08 to our listeners, because you have donned multiple hats, you've seen like multiple inflection points, train many, many students and so on. You have a passion for teaching. So just looking back, what got you interested in compute architecture, how did you get here? What were your biggest learnings, any words of wisdom to our listeners, many of whom are students, industry professionals as well. So I think we would love to hear more about your perspective. Yeah, no, good point. And I, of course, I think I'm sure a lot of the listeners have heard this line many times now, but this being the golden age of computer architecture, especially, you know, once, like I remember Hennessy Patterson,

Starting point is 01:03:43 mentioning that during their Turing award lecture, but it is, I think, so I think looking back for me, I mean, this was pre-again AI, pre all of this. And I think what got me excited, to be honest, about architecture was, I think I loved the fact at that point that I could, they were all, especially once I was thinking, I mean, what got me even closer to like on-chip networks that time was thinking about all of these interesting Lego blocks that I'm playing around with and building these interesting systems, right? So I think the, you kind of could think at various scales.

Starting point is 01:04:17 I think that's something I love, right? You can, you can kind of imagine a lot of these systems. You, I mean, unlike, say, a lot of my friends who are, circuits where they really had to kind of get it to work and get like the timing to be. I think here you had the, at least the ability to first think really out of the box about crazy ideas. Then you start, of course, going down to practicalities and thinking about what will work. So I think that that excited me for sure.

Starting point is 01:04:42 And I think that's what got me into architecture. I think what also excited me about being an academic was within, I would say, the realm of architecture itself, I was able to look at a lot of different kinds of problems. And again, I think AI just helped. As we mentioned, we talked about so much in this whole conversation about the full stack of AI, it just meant that there were so many interesting problems to solve across the stack. And today, with all of this cross-layer co-designed, everything is architecture. So we've even crossed the boundary of traditionally architecture and this, whatever, micro-architecture,

Starting point is 01:05:17 anything you do, I mean, if you're doing, well, like, I mean, you are basically changing your models and making them more hardware. Aware is also architecture. You publish an architecture conferences. you're doing a lot of software optimization, that's architecture, doing hardware. So I think that has also been exciting. So I think one thing is for sure,

Starting point is 01:05:33 like be open to ideas and kind of talk to as many people as you can. As I said, given how vibrant this domain is, I think the more people you talk to, the more interesting things you learn and you can kind of apply your learnings there. So I mentioned how for me, everything was a network, right? Even though like my PhD was on a relatively narrow domain of on-chip interconnects, I have been able to view all of these things as just networks of components and translate those learning.

Starting point is 01:06:01 So I think that is a very valuable skill. So if you have certain skills that you pick up as an architect, I think those are really, really things you can translate. I have seen a lot of my colleagues also mentioned that. This was actually a question I had when I became a faculty member of asking colleagues, how did you move from domain A to B to C? And ultimately, they kind of shared that even though it looks like domain A, B, C, a lot of the fundamental problems and their solutions are very similar.

Starting point is 01:06:26 Right. And so when you alluded to this, I mean, today I see a lot of research on AI, right? Everything around speculative decoding and so on. And these are a lot of our architectural ideas coming. So I think that is a nice thing to know because it just means that your skill sets are really transferable. So I think that's something I would say be open to talking to people learning about their fields and then seeing how you can apply those skillsets. The other thing I'll also mention we talked about tools is I definitely want to encourage people

Starting point is 01:06:52 to continue working on tools and what is traditionally called, quote-unquote, more engineering. I think there is a lot of value you learn a lot. You also get a lot of visibility as you kind of talk about your work with different people. And I think so I think there are a lot of these intangible skill sets that you can learn which are going to be really, really valuable. So even though as a student, it's hard, there is a bit of a basically a pressure of getting. the next disc of paper, I would say that some of these efforts, which take more of your time to kind of do like a little bit more of a broader outreach, eventually have a lot of value. Because I think when these same students, like when you're going on the market, right,

Starting point is 01:07:37 you do have an edge. Like I've seen like for my own students where companies are very excited to hire them because they're like, oh, you have all of these interesting things. And so it's always valuable to kind of do both like, of course, very innovative. innovative research, but also I think spending time in some of the engineering kind of really makes you very, very well-rounded. So that's something I'll also encourage a lot of the listeners to do. Awesome. Thanks. Yeah, I think you're right. I remember in one of our recent episodes, Ricardo, when he talked about what he looks for in hiring for his group, he wants people who can

Starting point is 01:08:12 engineer, not just do research, because ultimately producing some graphs through a paper, that is an accomplishment for sure. But then really getting something built, that is another level of accomplishment and getting something shipped, that's another level of accomplishment. And so those require a Venn diagram of skills. Thanks, Tushar. We've been really, really, really excited

Starting point is 01:08:33 to have you talk with us. There was a fascinating conversation, Tushar. Thank you for joining us today. It's been an absolute delight. And to our listeners, thank you for being with us on the Computer Architecture Podcast. Till next time, it's goodbye from us.

Computer Architecture Podcast - Ep 23: Cross-stack Design and Tooling for Large-scale Distributed AI Systems with Dr. Tushar Krishna, Georgia Tech

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.