@HPC Podcast Archives - OrionX.net - @HPCpodcast-83: Attack of Killer Chiplets, w John Shalf

Episode Date: May 10, 2024

Special guest and last year's ISC-2023 program chair John Shalf joins Shahin and Doug to discuss the rise of specialized architectures in the Post Moore'a Law era. This is a topic John will discuss a...t Wednesday night's keynote at the ISC conference in Hamburg, Germany next week. John is department head for computer science research at Lawrence Berkeley National Laboratory (LBNL). He formerly was CTO at the National Energy Research Supercomputing Center (NERSC). [audio mp3="https://orionx.net/wp-content/uploads/2024/05/083@HPCpodcast_John-Shalf_Attack-of-Killer-Chiplets_20240510.mp3"][/audio] The post @HPCpodcast-83: Attack of Killer Chiplets, w John Shalf appeared first on OrionX.net.

Transcript
Discussion (0)
Starting point is 00:00:00 In the ever-evolving tech landscape, HPC drives innovation, enhancing efficiency, and solving complex challenges. Lenovo has crafted a suite of HPC solutions that are not merely keeping pace with, but are driving the momentum of this digital revolution. Visit lenovo.com slash HPC to learn how. We need to rethink the way that we design and deliver these supercomputers. There's a Moore's law for engineering more transistors onto a chip, but there isn't a Moore's law for engineering more pins on a chip. As we figure out where the right points are to modularize your silicon into chiplets, then it'll take a similar period of time to build up that repertoire of reusable chiplets. If you look at the Heterogeneous Integration
Starting point is 00:00:51 Roadmap, which is an IEEE electronic packaging society, but it's got a lot of industry and SEMI support behind it, you can now build, using these chiplets, chips that are larger than the reticle size. From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everyone. Welcome to the At HBC Podcast. I'm Doug Black, and with me is my podcast partner, Shaheen Khan of OrionX.net. Hi, Shaheen. Hey, how are you doing, Doug? Really, really looking forward to this conversation. department head for computer science research at Lawrence Berkeley National Laboratory. He formerly was CTO at the National Energy Research Supercomputing Center, and by the way, was also last year's ISC 2023 program chair. John, welcome. Thank you. It's nice to be here.
Starting point is 00:02:02 Great. So we're going to be talking about the rise of chiplet-based specialized architectures. In fact, this is a topic John will be discussing at the Wednesday night keynote at ISC next week in Hamburg. And John, please give us an overview of your topic. When we spoke recently about it, you emphasized that the rise of specialized architectures is driven by the slowing of Moore's law. Yes, we're observing a slowdown, not in our ability so much to shrink transistors, though that is slowing too, but the cost effectiveness of using that to drive performance improvements and energy efficiency improvements for our microelectronics.
Starting point is 00:02:51 And so the go-to approach for computer architects when you run out of steam from shrinking transistors to get more performance and efficiency is specialization. It's something that is already applied to cell phones. You know, there's dozens of different kinds of specialized accelerators inside of your iPhone or your Samsung Galaxy phone. And it's also been deployed in things like the Google TPU or also in Amazon with their Graviton series of processors. They're creating specialized architectures in order to serve the high value parts of their workload. And it's something that we should pay attention to in the HPC space as well. And so one of the approaches in order to do that cost effectively, you know, it's something that we should pay attention to in the HPC space as well. And so one of the approaches in order to do that cost effectively, it's very expensive to do an entire chip from scratch
Starting point is 00:03:31 and a lot of the cost is the platform cost. So that's all the software, the compilers, the firmware, things like that, the debuggers, Linux, but it's also, of course, the silicon cost. It's interesting enough, the platform, of course, the silicon cost. It's interesting enough, the platform cost is more than the silicon cost, but I digress a little bit there. So one of the benefits of chiplets is it allows you to modularize your architecture so that you can retain your general purpose components like your CPUs or your GPUs, but you can drop in or rearrange those chiplets in unique ways in order to deliver
Starting point is 00:04:05 specialization, but at a cost that's much lower than attempting to own the entire chip like we used to in the old days of HPC when we had customized processors for ourselves. I'm interested in the origin of this whole notion of chiplets. Did it come from the hyperscalers? And how far does this go back? My guess is three to five years kind of thing. From a research concept, I think I first heard about chiplets in 2005. And the person that we consider the father of chiplets is Subbu Iyer. He was a UCLA professor, formerly IBM person. And Subbu has also pointed out, well, we should call them dilates technically. It's not a
Starting point is 00:04:46 chip. It's an unpackaged die that we put together. We co-package and then we build the package around it. And that's a chip. But chiplets has taken on and he's succumbed to that change of nomenclature. But it shouldn't be too surprising for us to move. Subu being an IBMer, if you remember the IBM 3090 and the early Power Series, IBM had a very advanced ceramic co-packaging technology, even dating back to the 90s and 80s even. This is the thermal conduction module? Yeah, exactly. Now, they had very fancy liquid cooling with a thermal conduction module, but you had multi-chip modules that had a ceramic substrate with the wiring to connect a bunch of discrete chips together into the same package. And then they had those cool spring loaded thermal cooling things, which are having a resurgence now. But people forget that history.
Starting point is 00:05:38 The thing that was interesting with chiplets though, is that Subu showed that you could go with an obsolete technology node like 130 nanometer and you could take a wafer of silicon and put your wiring layers on there make it completely passive just wiring layers and bond your silicon die with little teeny soldered micro bumps onto that to connect them together into a larger package. And Subu was doing this probably in the 90s, early 2000s. It kind of took off, though, when DARPA made a big investment into making chiplets a step closer to commercially deliverable platform. But it was still a lot of open engineering issues,
Starting point is 00:06:21 which took it a while to take off. When I was deputy director of the Exascale Computing Project, deputy director of the hardware technology portion of that, DOE made substantial investments in individual companies in order for them to create an internal version of chiplets. So that was AMD primarily, and they developed their semi-custom business around partitioning up their designs into the customer portion chiplets and then the part of the design that AMD provided. And our very first exascale supercomputing system, in fact, is using AMD's internal chiplets. Where we have the hyperscalers driving things, they're trying to drive a open chiplets economy, which is that they could
Starting point is 00:07:06 have multiple vendors interoperate their chiplets to deliver platforms to the hyperscalers. And so that's where a lot of big investments are happening. And that's where it's really driving the microelectronics industry in that direction actually is the pull from the hyperscalers. John, how far away are we from the time when you just go on a menu and click click with whatever permutation of specialized capability, and then out comes a chip that does exactly that? Right. So a long ways, I think if we look at the licensable IP market, so that kind of provides you a little bit of a roadmap for how this might play out. So ARM is an example of licensable IP, but there was also a lot
Starting point is 00:07:46 of MIP stuff and Cadence and Synopsys have a library. Over time, they've built up a portfolio of reusable IP blocks. And the amount of reuse in 2010 was down in like the 10 or 20% range. We're now up to the 80 or 90% reuse of those IP blocks, but that's of course for monolithic silicon. So you have to imagine over time, as we figure out where the right points are to modularize your silicon into chiplets, that it'll take a similar period of time to build up that repertoire of reusable chiplets. That's just my guess, is that things will play out very similar to the way that the licensable IP market played out. Do you see RISC-V being an accelerator there,
Starting point is 00:08:32 or just kind of vying for the IP bigger market like everybody else? You know, I think that you want your IP to be bulletproof and commercially supported, and the amount of investment it would take to make an open source thing be bulletproof and commercially supported. And the amount of investment it would take to make an open source thing as bulletproof and turnkey would mean that you need to sell it for about as much as you'd have to sell any other commercialized IP. So I think that RISC-V will probably play a role. It's certainly playing a bigger role in Europe than it is in the US. And something that I'm excited about, at least, is the notion that the RISC-V processors, they are pretty solid.
Starting point is 00:09:12 I don't know if they're as solid as ARM, but they're getting there. However, the thing that's not solid and is very, very difficult to pull off are high performance memory controllers and things like PCIe. It's very expensive, very boring, and there's only a handful of providers that do a good job of it. And I think the high-risk part of the design is that if you could modularize those pieces of the design, combine it with RISC-V chiplets, I think it could be a very powerful combination. John, you've said there's a looming crisis in advanced scientific computing, HPE infrastructure's unable to keep up with the growth and complexity and scale of scientific
Starting point is 00:09:51 problems. Pick up on that theme, if you would, a little bit, that we've run the course with the approach that we have now. Yeah. I mean, in the 90s, there was the attack of the killer micros. That's when we learned that we have to go with the larger market forces, but we can still have, it's not like you've given up on HPC by doing that. You're actually leveraging larger economic forces, but still specializing in the areas that are necessary to differentiate it and deliver what we need for our scientific customers. And we've ridden that wave for the past three decades. We went from Cray vectors to MPI distributed memory cluster systems. But if you look at the
Starting point is 00:10:33 past eight or so years, performance improvement, at least as measured by the top 500 list, which uses LINPACK, which is arguably an obsolete benchmark, but nevertheless, it's also considered a very easy benchmark. And even measured by this easy flops-oriented benchmark, we've went from 1000x per year performance improvement down to 10x or less every decade. And that's a huge drop. And I would say that it's going to be very difficult to justify the purchase of your next generation system if it's only going to be, you know, 2x or 3x faster than the previous generation. So not only do we need to rethink our benchmarks and how we measure our success, because moving forward, if you look at LINPACK, if you talk about Zeta Flops, you know, all that's going to be is more cost and more power. And it's unclear if that's the way that we should measure our success. If after a decade, we only get a 10x boost in performance. We need to look at alternative ways to measure our success.
Starting point is 00:11:34 So, you know, rethink what we use as our metrics for success. And the second is we need to rethink the way that we design and deliver these supercomputers because the current approach with just using commercial off-the-shelf commodity chips isn't delivering the kinds of benefits that it historically has. And that's where the chiplets or the specialization comes in, where we can get ahead of the slowing down of the Moore's law curve by adopting specialization, but also adopting what the hyperscalers are doing right now in industry, which is to find ways to use things like chiplets and other techniques to specialize for their target workloads. I think it's a new attack of the killer micros, but now it's attack of the killer hyperscalers. And we look at what the
Starting point is 00:12:22 broader industry is doing, look where the money is going and when you follow the money and and learn to adopt those techniques for the jet next generation of systems you get a lot of leverage and i'm concerned that we're overlooking that there is a new revolution happening already for warehouse scale data centers and hpc needs to be a co-traveler, just the way it was when we did Attack of the Killer Micros and switched to COT servers. To MPPs, right? And then clusters. Yeah, that's true.
Starting point is 00:12:51 We did that intermediate step of SGI shared memory systems. And we rode that as far as we could until, you know, the next shoe dropped and we went full distributed memory. That's right. With Beowulf and then etc. Yep. I think the history of Attack of Killer Micros might be interesting, but it was a 1990 phenomenon on, was it comp.arc when it first came out? I saw it for, I read it in a New York Times thing, but I heard that Brooks was actually originally credited with coming up with that
Starting point is 00:13:21 phrase. The only reason I bring up Attack of the Killer Micros is I feel like we're lulled into, we've had this 30 years of fairly consistent architecture that we continue to grow through Terra scale to Peta scale to Exascale. And people have this notion that that's the way it always was and the way it always will be, and that we can't do a transition like that again. But that was where I started my career was during that transition. And I remember how tenuous it seemed. But the history prior to the attack of the killer micros was constant technology upheaval in the HPC space and people forget about that. So one thing that I think attack of killer micros, I don't know whether it got it right or wrong. But anyway, the walkaway was microprocessors are here to rule the roost. But it got a little bit conflated with the instruction set and the architecture rather than the form factor.
Starting point is 00:14:15 And I know there were discussions that saying that, you know, the same vector architecture could be formulated into a microprocessor. And if you give it enough volume, it may actually be quite all right. And then fast forward a few decades, we had VIS like Spark did and AVX. And, you know, so all these vector instructions did come back. They did. And we even had, you know, VLIW. So I think the real salient point was the form factor of a microprocessor versus the actual architecture. Are we seeing something similar here? Sort of. I mean, vectors did come back, not in the true Cray form where vectors in Cray were not SIMD, they were true vectors and they did a good latency hiding. And that's something that GPUs are much better at doing than CPUs because the CPUs are really doing more or less SIMD,
Starting point is 00:15:06 which is more like a CM2 than it is a Cray. So we didn't learn the latency hiding lesson. But one thing that was happening as the old vector architectures were kind of slowly tapering off was that Cray had 16-way banked, then 128-way, then I think the Cray 3 was going to be 3,000-some-way bank switched in order to deliver the bandwidth.
Starting point is 00:15:28 And CDC was long vectors, yeah. Oh, yeah, the CDC. Cyber 205. Cyber 205, right, and the ETA-10. ETA, that's right. So super long vectors. Everybody's forgotten about N1-half, which is your latency hiding formula. But the real thing with the microprocessors was that we couldn't deliver the amount of memory parallelism necessary to deliver the bandwidth that we could using the traditional approach that the wiring was getting just impossible. So going to MPI with distributed memory was really as much about delivering more memory parallelism, but in a different format than banked memory and delivering it in a way that
Starting point is 00:16:16 could be packaged more cost effectively. So there is the aspect of vectors and delivering more floating point, but the big ticket item that forced us in that direction was also the memory bandwidth delivered through memory parallelism. HPC is more than workloads and calculations. It's about solving humanity's greatest challenges with the power of HPC. From car and airplane design, oil field exploration, and financial risk assessment, to genome mapping and weather forecasting, breakthroughs in computing systems have made it possible to tackle immense obstacles at an exponential rate. As the global leader in HPC, Lenovo develops, integrates, and deploys technologies of exascale-level computing to organizations of all sizes.
Starting point is 00:17:02 Learn more at lenovo.com slash HPC. John, you've mentioned a few instances from your background. One thing we like to ask our guests is how you became interested in HPC in the first place. Oh, yeah. You know, I was always interested in computers. There's a physicist at Randolph-Macon College. It was a small college in my hometown. Small town. The college was as big as the town, I think. During the summer, I'd help him clean his lab, and I learned how to build computers when I was like 10 or 11 out of, I think it was Z80 chips. So that was like in the late 70s, early 80s, and had 128 bytes of memory. And man, I was hooked from day one after that, thinking of all the
Starting point is 00:17:46 things you could do with a computer. But I was in college working on reconfigurable computing in the late 80s, FPGA-based computer. I was developing compilers for that. So I went to my first supercomputing conference, I think in like 91, maybe. And when I went to that conference, I was very frustrated with working on compilers. It's a frustrating area to, maybe. And when I went to that conference, I was very frustrated with working on compilers. It's a frustrating area to work in. And FPGAs are also very frustrating sometimes. But I saw this guy standing in a cave, which was a virtual reality environment. All the walls were 3D projectors. And he was having computer simulations from a thinking machine, CM5, were fed to it through a hippie 6400 cable direct to this cave.
Starting point is 00:18:29 And they were colliding galaxies in 3D and colliding black holes. And it was just fantastic. I saw Mike Norman and Ed Seidel in this cave. And I was like, wanted to drop everything. I was, you know, I want to do this, whatever it is. And- We call that metaverse now. Yeah. everything. I was, you know, I want to do this, whatever it is. We call that metaverse now.
Starting point is 00:18:54 Yeah. And I did indeed, you know, kind of drop everything. We had lunch together. We became fast friends because they were contemplating having to switch from the Cray vectors over to these parallel machines, like the thinking machines. And they were very unhappy about that. And everything seemed kind of difficult. But for me, it was like, I loved parallel computing. I was really excited about that. And whenever I'd interview with, for example, Intel, they would say, parallel computing, what are you nuts? And these guys were doing the completely the opposite. They're like, we have to figure out how to do this. And so I went with them. I became more or less a developer of codes for HPC and astrophysics and cosmology. I helped Mike Norman's group with their AMR code for cosmology, but I was one of the co-creators of the Cactus code,
Starting point is 00:19:37 which was for general relativity and colliding black holes. And I had a blast. It was seven years or so of that. But eventually, I found my way back to computer architecture again, after parallelism became mainstream. That's fabulous. John, I wanted to ask you, as we go through more and more specialization and customization, what about software? How do you use these? And especially if you get an IP menu situation, where the instruction sets could be all over the place, are we just counting on interpretation from now on, but then that doesn't quite give you the performance?
Starting point is 00:20:14 How do we get to have the cake and eat it too in terms of performance and portability? Yeah, like I was saying, this will grow slowly over time. So it's not like, okay, drop everything. We're all specialized now. And the sky falls and we're all doomed. We're going to continue to have general purpose instructions processors like GPUs or CPUs for the foreseeable future. We're going to build up our repertoire of specialization slowly.
Starting point is 00:20:39 I do think now is a good time, though, to do some pathfinding research, which we haven't done in the HPC side, to look at different ways to expose these capabilities of specializations to the end user. We'll continue to always be able to fall back on our general purpose processors while we sort this out. But I do believe that it's a big ticket item. I don't think it's solved by any means. As I said before, the big cost with taking on specialization is the software aspect. But if you are at least tied to a general purpose platform like the ARM ecosystem or the Intel x86 ecosystem or the various GPU ecosystems, I think most of the software costs are borne by creating the software environment for those general purpose instruction processors.
Starting point is 00:21:28 And we'll have to build up over a long period of time our capabilities with specialized accelerators. I do think it can be different in that when you create a fully general purpose processor, there's so many different applications, so many edge cases that it can be quite complicated to create the software environment for that. If you have things that are more narrowly targeted, like a memory controller, it's almost completely invisible to the end user. And yet, a very worthy chiplet, something that accelerates like BLAS or FFTs could be an easy or an early target. Those don't need to be like a full up compiler that can handle every single edge case. It really is the parameters that you feed to BLAS, the basic linear algebra libraries into the FFTs to do their thing. So I believe that it won't be as complicated as supporting a fully general purpose compilation environment. I see. Now, the other thing, there's just so many
Starting point is 00:22:22 topics that one could bring up. But let's talk about energy usage, because that is a looming, if not already, a massive problem, and it just must be solved one way or another. Yeah. What do you see on the horizon that really could move the needle there? There's a guy, Shekhar Borkar, who did a deeper analysis of this. He was at Intel. I think he's Qualcomm now. But he pointed out that after 7 nanometer technology, the amount of energy going into moving data around is actually much, much larger than the amount of energy going into the compute, into the transistors on a chip. And so copper is about as good a connector you'll get at room temperature. So we're kind of in a boxed in a corner when it comes to the on-chip interconnects. But specialization is really about optimizing your data paths for the kinds of computations that you're putting onto the chip.
Starting point is 00:23:18 It's not so much about the flops. It's really about the data paths and how you move the data through those floating point operations or other kinds of computing. When it comes to going off-chip, the really big up-and-coming technology are these photonic links. Right now, you take a huge hit when you go off-chip. You have these super high-speed CERTIs because there's a Moore's law for engineering more transistors onto a chip, but there isn't a Moore's law for engineering more pins on a chip. It's been stuck at about a little bit less than a millimeter pin pitch separation between neighboring pins on a chip, and it hasn't improved very much. So you've got this huge bandwidth bottleneck getting off of
Starting point is 00:23:59 the chip to just about anything. So with the silicon photonics and also with these chiplets, the co-packaging of the silicon photonics and also with these chiplets, the co-packaging of the silicon photonics next to a chip, now in the electrical domain, you've got resistance and capacitance, and it limits the length of the wires that enable you to escape bandwidth off of a chip. But if you go to photonics, once you get in that photonic domain, then you can go anywhere just about, and with these massive astounding bandwidths so we're seeing a lot of companies going from just laboratory prototype which they say that silicon photonics has been like the cold fusion of you know of link technologies but i'm seeing actual companies delivering
Starting point is 00:24:38 commercial products now like xscape photonics and ireyer Labs. And so... Starting to happen. It's starting to happen. It's really driven by these large language models too are requiring a massive amount of bandwidth. I mean, one of the challenges with the large language models is that they need high memory bandwidth, like the high bandwidth memory, HBM, but they need also more memory capacity, which is very difficult to do
Starting point is 00:25:02 when you can only go a couple of centimeters through the package before you run out of steam. So going to the photonic domain, now you can cobble together more GPUs, more memory, larger memory pool, and do it at these multiple terabytes per second bandwidths. And you don't have the losses that you have when you try to do it all with copper. Now there's a danger here that lasers consume power too. But the thing that I learned in working on these ARPA-E, they had an energy efficient data centers project with Karen Bergman was leading it from Columbia. And so we learned and we had our partners were NVIDIA and Microsoft. And initially, we're really focused on reducing the energy that went into the links.
Starting point is 00:25:44 But after the first couple of years, we realized, well, the maximum we could improve energy efficiency is only 30%. If you do that, you'd have an infinitely efficient link technology and that's the max is 30%. But as we learned that a lot of these workloads are held back in performance by the bandwidth that's available, that there's the other side of the value equation is performance. And you have to hit both ends. You want energy efficiency, and there's two ways to get it. Energy efficiency by reducing the amount of power consumed, but you can't forget that energy efficiency is also tailoring machines so that you deliver more on the numerator of that equation, the performance. And we keep on working on each
Starting point is 00:26:25 side. You get to the point where you're like, if you could just deliver me eight terabytes per second of bandwidth, then I could get through this performance bottleneck. And then you try to engineer that eight terabytes per second and you burn a hole in the side of the machine. So then you work on the link side and you just go back and forth between those two. And that's the path to energy efficiency. It isn't just energy use alone. Right. Excellent. One other thing, well, actually on data movement, like the best way to address data movement is to just not move it. Yeah. We can always go back to Monte Carlo if you want. Yeah, I think that that's true. And, you know, we learned a lot of that in the course of migrating from the Cray vectors to these more constrained microprocessor based systems. And they even have stuff like Jim Demmel's communication avoiding algorithms. Right. And Jack Dengara spent the first half of their careers, you know, data movement was the least
Starting point is 00:27:25 expensive thing, and it was the flops that were expensive. And so engineering algorithms to consume more memory bandwidth in order to reduce the number of flops executed. We have entered a period of time where we want to unroll a lot of that, of what they've done. And of course, the two of them have been very much involved in communication avoiding algorithms. Something that's kind of exciting about chiplets, though, is that if you look at the Heterogeneous Integration Roadmap, which is an IEEE electronic packaging society, but it's got a lot of industry and semi support behind it. You can now build using these chiplets chips that are larger than the reticle size.
Starting point is 00:28:02 So just massive, you know, you can work your way over time up to wafer scale. But the thing that's interesting is that the amount of bandwidth that you can engineer into chiplets, chip to chip bandwidth is set to double every technology generation for the next five generations of technology. And so when you look at that, combined with co-packaging of optics, there's a huge opportunity here to actually, you know, Oh, wow. in HPC would be excited about that. Absolutely. Yeah. Yeah. They've been begging for this, right? Totally. So this leads me to processor and memory and memory centric computing and all these large language models are very memory hungry. Yeah. What that means for system architecture. Sure. We've always had this love affair with processor and memory ever. It actually predates John von Neumann. You know, the machine that von Neumann wanted to build actually had the logic and the memory
Starting point is 00:29:07 co-integrated, but they realized there wasn't enough vacuum tubes in the universe to do that at the time. So they created this. Now, so von Neumann bottleneck, you know, the memory and the computer separated, but that wasn't what he wanted to build, but that's where he ended up. So I feel bad for von Neumann that his namesake is this memory compute dichotomy that he didn't even like. So yeah, I think we've flirted with processor and memory
Starting point is 00:29:31 since the dawn of computing and it's always back. And there's so many different permutations of it. The challenge is just like these large language models, you never know when you need a lot of memory for an application versus you don't need a lot of memory. And we came up with this Von Neumann separation so that you could pool your memory as a resource and you could carve off little pieces when your application needed a little bit and larger pieces for kernels needed a lot of it. And whenever you look at processor memory, you end up with a kind of a memory poor
Starting point is 00:29:58 architecture. When you do PIM, you get a lot of memory bandwidth, but you don't get a lot of capacity. So it's a trade-off and it's always difficult. So whenever somebody says, well, we're just going to get a good processor memory. I was like, which of the 150 permutations of processor and memory we've got here. So I do like processor and memory. Don't get me wrong. You know, our, most of our memory is based on DRAM, which is the passive memory. It's the highest density, but also offers us good performance, but it's not working out the way it used to. And that the cell cycle time, the amount of time it takes to read out a little bit of capacitance when you select a row in a
Starting point is 00:30:36 DRAM technology, and then you got to amplify it up to logic levels, the rate at which you can do that has not improved substantially over a number of generations around. So the result is that when you select any amount of cache line, a byte, you know, whatever, you end up having to light up about eight kilobytes of memory, rows of memory in order to get the bandwidth. So we're running into a similar problem with the old craze with the bank switching and stuff. Interesting. So for example, if you were to embed computing, I think Samsung has a product that has some compute built into the Sense Amps, so they're high bandwidth memory.
Starting point is 00:31:12 And the disappointing thing is that for linear operations, you can get 8x boost in performance by doing that. But it's not 100x, it's not 1000x, it's 8x. I think that there needs to be a significant revision of memory technology, having something that's as dense as DRAM cells, but has a faster cycle time. I have seen a few examples of this, some interesting technology as SRAM-like that uses negative capacitance that could hit the bill. But I think that there is somewhat of a crisis in memory technology that
Starting point is 00:31:45 would be necessary to unlock that performance boost of doing true processor and memory. What about technology that enables mix and match chiplet building, but involving multiple vendors? I would assume that's a big issue. It is. There's progress in that space. So yeah, there's so many outstanding engineering problems. There's thermal issues. There's how do I bond the things? Do I use copper pillars? Do I have to make the industry agree on the spacing of those pillars? Which foundries can do it? All those things have to be worked out. And then the protocol for speaking between those chiplets, even the wiring map needs to be standardized. So earlier I talked about what pushed chiplets over the edge. The thing that really started chiplets was the emergence of high bandwidth memory that forced the JEDEC
Starting point is 00:32:31 standard that forced industry members to collaborate. So there are really third parties delivering chips, but only high bandwidth memory. We're talking about now moving chiplets from just being the high bandwidth memory co-packaged with the GPU to other functionalities that are normally on a monolithic chip, partitioning those up into separate modular chiplets. And that's something that AMD was a leader in. They figured out all those engineering issues internally. So they only had to agree with themselves for that. Similar things have happened with Intel. So Intel has its Fiveros and all of its other names for its advanced packaging technologies, again, entirely within their own foundry and within their
Starting point is 00:33:10 own organization. What's happening now is we're seeing things like UCIE emerging as a potential multi-vendor protocol for die-to-die communication for these systems. There's a new JEDEC standard that came out for describing a chiplet, all of the mechanical, all of the electrical, the thermal, and the die to die interfaces so that EDA tools can now be used to engineer these co-packaged chiplets and a lot of work on standardizing the manufacturing process for assembly. There's still more things that need to be standardized, but the first steps, some very mighty trees have fallen in terms of the path towards multi-vendor chiplets in what they call the open chiplets economy. I still think it might be five
Starting point is 00:33:56 years out or so before we really see this pick up steam. A lot of people think it'll be maybe two companies collaborating together initially and then three and then they start growing from there. But a lot of the engineering work that's needed to make this reality is starting to happen. It's just a matter of time or a big question about when that applies to other functionalities that are in the package. That's awesome. You mentioned in our pre-call, your effort and your colleagues on the topic of reinvention of supercomputing. What are the pillars of that reinvention, some of which we've talked about? going full steam at that with things like the TPU and the Graviton. We need to ask the question, though, well, what if AMD or one of these vendors did offer us a space, a chiplet inside of their package? You know, what would a scientist put there? What's the most productive thing that we could do with our limited engineering resources? And then build from there. I think it's a deep
Starting point is 00:35:01 issue. We've gotten out of the practice of getting into that depth of hardware design. Like you mentioned earlier, we don't understand what's the most effective software interface to access these chiplets. I think that the most efficient thing to do right now, our paradigm is based on invoking optimized software libraries. And yeah, you could hide a chiplet behind that, where when I call that subroutine call, it goes to that chiplet, and then it returns a result. And that's all great, except that the most effective thing to do is to flow the data through all of the chiplets that have all their specialized functionality in a data flow fashion, so they're all lit up at the same time.
Starting point is 00:35:39 We don't write code like that right now. We don't have compilers that can transform code so that it can do that effectively. So that's a whole paradigm shift in the way that you express algorithms. It'll require applied mathematicians, not just compiler writers, but applied mathematicians have to rethink their algorithms. And then there's the economic model aspects. I think that there's no future where we plow this road on our own. We need to understand what the hyperscalers are doing. We need to understand the economic model.
Starting point is 00:36:08 We need to figure out where HPC could plug in and get the most leverage from an ecosystem that is being built to serve the hyperscalers, not us. But I think we did it before with the attack of the killer micros, the commercial off the shelf. And it's time for us to really consider how we plug into the new economic reality that's emerging with these specialized systems. John, there are also semiconductor advances that try to move towards reconfigurability in some fashion, maybe not quite on the fly, but close enough.
Starting point is 00:36:41 Oh, yeah. Do you see those as a manifestation of this? Potentially. Customization. Yeah. So, you know, I have a long history with FPGAs. I thought the future was reconfigurable computing until I spent four years writing a compiler for one of those.
Starting point is 00:36:56 Until place and routes became normal. Well, it would take eight hours and then you find there was something wrong or it would take eight hours and you found that it was going to run at 10 megahertz and you don't know why it's can be frustrating. The challenge with FPGAs is that 80% of the wires on an FPGA are there just in case you need them, but you won't use them, but they're there just in case. And so because of the data path, the, the granularity of reconfigurability is bit level, you know, in an FPGA,
Starting point is 00:37:24 it leaves a lot of performance on the table, at least if you're predominantly doing single or double precision floating point. It can be very challenging to extract the full performance capability off of an FPGA. So that's where these CGRAs come in. So if a lot of the density challenge in FPGA is that it's bit reprogrammable, even though I didn't need it to be bit reprogrammable, you have a much more effective data path with emerging data flow systems. And we've had some really great collaborations with Samba Nova and Cerebrus reconfiguring algorithms so that they can take advantage of those coarse-grained reconfigurable architectures. So I do think that they are also a promising direction. So rather
Starting point is 00:38:10 than having Chiplet as your means of specialization, using these kind of data flow-like systems is also very interesting. I'll observe that a lot of the programming challenges about redesigning your algorithm so that it can be amenable to that data flow are common to the chiplets and the CGRA-based systems. So there's a shared opportunity there to reimagine our programming environment, and people should be very clear that there isn't any future, whether it's data flow or it's chiplets, where you don't have to rethink how our programming paradigm is going to play out in the future. But they are also a promising approach. I'd caution you that I am interested in chiplets, but that doesn't mean it's the only approach. There's a lot of different ways to get that
Starting point is 00:38:53 kind of specialization. Right on. So John, thank you so much. What a pleasure to go through all of this. It is very clearly an exciting time in supercomputing as the reinvention takes place, and I'm just delighted to be exposed to the work that you do and your colleagues do. So thank you. Great. Thank you. I'm sure I can go for hours more. And I look forward to an opportunity to do that. Right. Okay. Sounds great. Perfect.
Starting point is 00:39:17 Thanks so much, John. Thanks. That's it for this episode of the At HPC podcast. Every episode is featured on InsideHPC.com and posted on OrionX.net. Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At HPC podcast is a production of OrionX in association with InsideHPC. Thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.