Big Compute - How Supercomputing Touches the World(s)

Starting point is 00:00:00 If this podcast is getting too serious, I need to stop. Hello, everyone. I'm Ernest DeLeon. And I'm Jolie Hales. And welcome to the Big Compute Podcast. Here we celebrate innovation in a world of virtually unlimited compute, and we do it one important story at a time. We're talking about those stories behind scientists and engineers who are embracing the power

Starting point is 00:00:34 of high-performance computing to better the lives of all of us. From the products we use every day to the technology of tomorrow, high-performance computing plays a direct role in making it all happen, whether people know it or not. So for this episode, we're going to do something a little bit different than we usually do. Yes, because we had the opportunity to talk to a who's who in supercomputing. And rather than just listen to my voice explain a topic today, we thought that we'd let our expert do most of the talking. Yeah, considering he's worked with high-performance computing projects in every category and doesn't have just one focus.

Starting point is 00:01:11 Right. People we've talked to recently specialize in an aspect of COVID, or they specialize in tsunamis. Whereas Dan, since he's a who's who in supercomputing, has pretty much worked with all of that. Though, as I understand it, before we even get going, Ernest, you have something you want to ask me first? Yes. Jolie, do you love Mars? Mars like the planet or the candy bar?

Starting point is 00:01:40 Or wait, is that a candy bar company? I think it's actually both. Now that you bring that up. I was primarily referencing the red planet. Do I love Mars? I've never been. I hear it's wonderful in the winter. I'm sure it would be really cool.

Starting point is 00:01:59 So yes, yes, Ernest, I love Mars. Tell me more about Mars. I think that's where we're going. Have you seen the movie The Martian? I have seen the movie The Martian. This will come as quite a shock to my crewmates. And to NASA. And to the entire world.

Starting point is 00:02:18 But I'm still alive. Surprise. So let me tell you a little story. Okay, I like stories. Surprise. So let me tell you a little story. Okay. I like stories. In the movie The Martian, the protagonist Mark Watney ends up stranded on Mars due to an unfortunate series of catastrophic events on the Martian surface. Who's the actor that played that? Wasn't it Matt Damon? It was Matt Damon.

Starting point is 00:02:38 That's right. Cool. Sorry to interrupt. No worries. Just got to get that picture in my head, especially if it's of Matt Damon. And Andy Weir is the author of the novel. anyone wants to read it i've read the novel it's excellent oh so i won't ruin the movie for you or the novel which i highly recommend you read before you see the movie oh too late but let's just say that mark watney is in a pretty bad spot

Starting point is 00:02:58 now you can either accept that or you can get to work the telecom equipment that he has is unusable at a certain point and his communication with earth is now dead silent oh that's so uncomfortable yes after some time Mark Watney realizes that there is one other place that has telecom equipment he could maybe use. And I emphasize the maybe here. So he sets out to find NASA's original Mars rover, Sojourner, but more importantly, the lander named Pathfinder. Pathfinder has the telecom equipment Mark needs to reestablish communication with Earth. Unfortunately, the original Pathfinder and Sojourner team ceased communicating with Earth back in September of 1997. Is that a real thing?

Starting point is 00:03:55 That's a real thing. And, of course, this movie takes place even in the future from now. So you're talking even more decades between when this original thing went offline as opposed to you know the movie timeline and we'll get into that here in a second Mars rovers have a long and storied history starting with Pathfinder and Sojourner and continuing with spirit opportunity and curiosity breaking news this morning and curiosity is a smashing success the NASA Mars rover touched down this morning right there on the red planet, a daring mission with more to come.

Starting point is 00:04:28 The landing capped a journey that lasted more than eight months and covered more than 350 miles. The original Pathfinder and Sojourner team had a total service life of 83 souls. Let me interject here and note that a soul is equal to one revolution of Mars, similar to an Earth day. On Mars, however, that revolution takes approximately 24 hours, 39 minutes, and 35 seconds by Earth time measurements.

Starting point is 00:04:55 Wait, wait, wait, wait, wait. Because I know soul is sun in Spanish. That's what I know of a soul. Okay, so a soul is a measurement of time? Right. And it's equal to one revolution of Mars. Right. So like here on Earth, we call that a day or a day-night cycle, right?

Starting point is 00:05:13 Yes. 24 hours. However, Mars takes- A little bit longer. A little bit longer. Okay. Right. So 83 souls is like 83 days and change.

Starting point is 00:05:21 Exactly. Okay. So- That's not very much. No, and there's a reason for that, right? So the design of these things has evolved over time. The Spirit rover landed on Mars on January 4th, 2004 and sent data back to NASA for over six years.

Starting point is 00:05:38 Whoa. It sent its last communication on March 22nd of 2010. And that was the first Mars rover? No, the first was Sojourner. This is the second one, yes. Gotcha. The Opportunity rover landed on the Martian surface on January 25th, 2004,

Starting point is 00:05:55 and thus far it has surpassed all previous service lives at 5,352 souls, which is approximately 5,498 Earth days, which is almost 15 Earth years or eight Martian years. What? Yes. How come it lasted so long? Well, again, the technology is getting better and better as these things evolve.

Starting point is 00:06:18 When this little rover landed, the objective was to have it be able to move 1,100 yards and survive for 90 days on Mars, 90 souls. And instead, here we are 14 years later after 28 miles of travel, and today we get to celebrate the end of this mission. Opportunity sent its last communication to Earth on June 10th, 2018. So you're seeing a common thread here. The original three units are no longer in operation. The Curiosity rover landed on Mars on August 6th, 2012. And as of December 2020, when we were recording this podcast, over eight Earth years later, the Curiosity rover is still in service and still sending data back to Earth. Interesting. So Curiosity is actually in use today. And then the other rovers are just hanging out on Mars, not doing anything now?

Starting point is 00:07:20 Right. They are currently defunct. Huh. I guess I always pictured us taking them back to Earth, but that wouldn't make sense to do so. Right. We can't retrieve them. Yeah. Now, keep in mind that Curiosity is not the longest serving rover. It's eight years and Opportunity was almost 15 years in Earth time. So a little over half, but it's still in service. Mm-hmm. little over half, but it's still in service. So imagine for a minute, if NASA had the technology

Starting point is 00:07:48 to extend the lifespan of the rover significantly, right? So not eight years, not 15 years, but let's say 50 years or a hundred years. Instead of Mark Watney risking his life to go find a dead rover and lander, NASA could have located the nearest functional rover to Mark and redirected it to go to him before he lost his original telecom equipment as a precautionary measure. Yes. So extending the distance traveled and overall service life of future Mars rovers is just one of the many problems that NASA scientists are trying to solve with the help of supercomputers. Oh, I see where you're going with this.

Starting point is 00:08:29 Yes. Okay. More specifically, the supercomputers at the Texas Advanced Computing Center, or TAC for short. Much of the computational heavy lifting is done within supercomputers at the TAC, but NASA is looking into onboard high-performance computing for rovers, and the TAC is helping lead the charge. This isn't the first time the TAC has surfaced in recent stories about supercomputing, some on this very podcast. I was going to say, I remember

Starting point is 00:08:57 talking about the TACC. We talked about it on one of our COVID episodes. Yes. So we wanted to delve a little deeper into what the tack is and why it is so important. Enter. Dan, is it Dan Stanzione? Is that how you say your name? That is correct. There's a broad family split between Stanzione, Stanzione, and the Italian Stanzione, but I'm a Stanzione. Yes. I love it when there are family disputes on how to spell or pronounce last names. My name's been butchered my entire life, so one day I'll share that story. Suspense. Dan is a pretty interesting guy.

Starting point is 00:09:34 When asked about what he does for fun outside of work, he said... In their stuff beyond work in life, I was aware of that. I know, isn't that crazy? And you know, it's election day, so today it's reloading election results over and over again. Hitting the reload button on various and sundry websites over and over again. I bet our listeners can tell when we recorded this with Dan. But no, I'm pretty committed to this stuff. Spend some time on a boat here and there when I'm not doing this or, you know, living in Texas.

Starting point is 00:10:02 I obviously watch a lot of football. What is your title and what are your responsibilities? The titles are long and sound more important than they probably are, but I'm the Associate Vice President for Research at the University of Texas at Austin. And at universities, if you're an Associate Vice President, that means you have a good parking spot. And that's really the only purpose of the title. But then beyond that, I'm the Executive Director of the Texas Advanced Computing Center. We're in a center here of about 180 people who are involved in building really big computers and then finding ways to use large computers to do science and engineering work for both

Starting point is 00:10:35 the University of Texas and for other people doing unclassified research all around the country and the world. Having a good parking spot definitely means something. I didn't have one at my last company. And let's just say I had a couple people hit and run on my car because it was parked on the street. That was fun. Man, that sucks. I know. The TAC is doing some really great work in the world today. But before we delve into that, what does the day in the life of Dan look like? I'm sort of a scientist who's never had to pick a field. One day we might be working on hurricane forecasts and impacts on buildings

Starting point is 00:11:11 and structures and doing simulation for that. The next day it's genomics and drug discovery. When you're involved in computing, you're involved in sort of every facet of science and engineering around the world. A lot of it is the less exciting part of keeping everybody paid and writing reports, but the fun parts are getting to work with the really cool scientists that we work with all around the world and getting to design and build the next generation of the world's biggest machines. That is so interesting. I mean, I'm curious, what did you study to land you in this place? Because it sounds like now you're probably very well versed in everything from physics to chemistry to, you know, life sciences. You probably know at least a little bit about all of it at this point.

Starting point is 00:11:51 A little bit is probably the functional phrase there. So I have enough to talk to people in those things. By training, I'm an electrical engineer as an undergrad for my bachelor's degree. And then my graduate degrees were actually in computer engineering. I wanted to design and build really fast computer chips. The sort of what I do now was not a thing that really existed when I was going through school. And I just sort of got sucked into it. By the time I was doing my PhD, I was working with a genomics center and the sort of chemistry and material center being the computing guy for the scientists. And, you know, it started out just something I did in grad school and then started doing projects with the folks here at TAC and started doing more and more.

Starting point is 00:12:29 And I came over here as the deputy director in early 2009 and took over as director in 2014. It's always interesting to see where people land and what they go through to get to where they are in their careers, because it's it's never what we'd think. Never. Even in my case where it's very close, it's not, you know, what I originally thought. It's crazy. Like my mentor when I was getting my master's in film. Peter Jackson. How did you know? He does have an Academy Award in screenwriting for writing The Sting. But the whole way that he launched his career is he got in a car accident

Starting point is 00:13:05 with somebody who worked for a film company. And then he ended up giving a script to that person and that launched his career. So he was a great mentor. But when it came to actual like, how do you jumpstart your career advice? It was not very useful because I couldn't figure out who to crash into. Yeah, it was literally luck. Yeah. Dan also spoke about the evolution of supercomputing from something that only government and military really had access to because of the cost to where it is today. This just wasn't a thing. You know, the National Science Foundation started investing in this in sort of the mid 80s. The rise of microprocessors and then the cloud has just made it much more accessible to so many more researchers and types of researchers that these centers have sort of grown and spread.

Starting point is 00:13:50 So it's been an interesting time. What do you enjoy most about what you do? You know, we get to work with some really fantastic scientists around the world and do fascinating, fascinating things. You know, I think in almost any job, the thing that is most rewarding is the relationships and the people that you get to deal with. Graduate fellowship deadline for NSF was last week, and I was writing reference letters for just some remarkable young people who've done amazing things we've gotten to work with. And, you know, you learn about their lives and stuff like that. But at the same time, we get to do really impactful science. And I mean, we're not the ones necessarily out there doing it, but we're making sure it can be done. We've assisted on Nobel Prizes. We've assisted on, you know, really some groundbreaking discoveries, some that are more sort of theoretical and basic science, some that are just fun, like better spaceship engines.

Starting point is 00:14:35 And, you know, this year it's been a huge amount of work around COVID vaccines and the structure of the virus and things that are going to have in the very short term, you know, real impact on people's lives. And I get to play with really big computers and show them off to thousands of people every year. So there's lots of things that you enjoy in this kind of job, but it's got a huge variety. And like I said, in the end, it's the people that make it worth doing. It's always about the people we work with, isn't it? Hands down, absolutely. But for our listeners, why don't we go into more detail on what the TAC is? Like, what do they do? Good idea. So TAC, or the Texas Advanced Computing Center, we are, in my humble opinion, the best of the academic supercomputing centers and one of the largest in the world. We run the largest university-based supercomputer in the world, and we run a bunch of other computing and data systems for folks, but really we exist to help people do scalable things for the challenges

Starting point is 00:15:30 that we face in science and society, right? If you're working on a problem and almost every scientific problem has a computational piece now, whether it's simulation or data or AI that you're dealing with, and eventually you're going to scale off your laptop and that's where we get involved. And it's the hardware and the people that make that happen. Got it. And I already know the answer to this question, but I'm going to ask it anyway for our listeners. Where are you located? We are here in Austin, Texas. We're part of the University of Texas at Austin. We actually live at the J.J. Pickle Research Campus. So we're about eight or nine miles north of downtown Austin, but we really serve users all around the country and around the world. This might be a dumb question, but are the supercomputers actually physically located

Starting point is 00:16:10 there as well? We are essentially one of the cloud providers for academic supercomputing. So most of our users don't ever actually see or touch the machines, but we have the actual physical data centers and physical machines in the building I'm sitting in right now where we can supply about 10 million watts of power to keep them going. How and when was the TAC founded? It was founded in 2001. And we really got on the map when we won the Ranger system, which was one of the big National Science Foundation systems that was the number four machine in the world when it first came up. And that was in 2008. And that's when TAC sort of

Starting point is 00:16:43 became one of the real leaders in providing things, not just here at UT, but around the country. Now, Dan, we interview a lot of undercover superheroes on this podcast, and we think you may know one, Romy Amaro. Oh, yeah. Romy has been our biggest user the last six months because of all of her COVID work. And you see her enthusiasm and energy and her ability to change the world. Who wouldn't want to help people like that? That's mighty big praise from the executive director of the TAC. Yeah, I could totally understand that. I mean, after talking to her, we were like, that woman is amazing. I totally agree. And I mean, props to you for helping her out and getting the research off the ground as quickly as you did. I mean, we did the math on the show and the amount of supercomputing resources,

Starting point is 00:17:25 like you said, that she was using, it was quite the chunk of compute power. And we started those runs at pretty large scale the last week of February. That was our first COVID-related research project. And by mid-March, it was the largest one we were running. So, and there's been 50 something others since then. But we got started at scale quickly, largely because again, it's back to the people in the relationships, right? I already knew

Starting point is 00:17:49 Romy and I already knew what she did and she's used our systems for years. So she knew how to get on and be effective right away doing what we're doing. And, you know, rather than going through some complicated bureaucratic process, you know, when COVID was becoming what was obviously going to be a big thing. She sent an email. And I said, hey, Dan, I don't know what you guys are up to, but, you know, this virus, this virus is looking pretty serious. I think we probably need to do something with it. And I said, I know you, I know the work you're doing is great.

Starting point is 00:18:16 So we can make that happen today. It was amazing. I mean, it was amazing to have that level of support, but that was really key in sort of getting time to solution very quickly for this effort. We were off and running. And, you know, again, we've done 60 something other projects with different people since in the COVID space. But you get off the ground quickly because the infrastructure is in place and the relationships are in place and the knowledge and the training are in place to make all of this stuff happen. And that's why we could start so quickly on that work. And I was so grateful for them. The output of Romy's work has been the

Starting point is 00:18:47 input to some other work more upstream in sorting through billions of possible compounds for good drug candidates. It was done by some people at Argonne National Labs and the University of Chicago and Rutgers and University College London, and just a whole bunch of other people around the world. Took that sort of basic structural work that Romy did and built it into this AI-enabled vaccine discovery pipeline. The 4 billion compounds we started with, they handed about 30 off to medicinal chemists to start fabricating and start clinical trials on by July or August. And so getting that work done early was key. And she's kept discovering new things. I'm sure she told you about, you know, figuring out that the spikes on the coronavirus sort of wrap themselves in a sheet of sugar and then, you know, they're going to get into a cell. Right. And all of that. And that's what helps it hide from the

Starting point is 00:19:34 immune system because it just looks like sugar molecules. Right. And so, you know, none of that is stuff that we knew in January and we know it all now. Also, you said 60 different projects that TACC is involved in when it comes to COVID-19 right now. Did you say 60, 6-0? Yes, I think that's about right. It changes a little bit every week. It might be 59, it might be 63, but we're in that neighborhood of different projects we've supported. Some of them are at the sort of structural level, like Romi, where we're working at the molecular level, you know, and that's 20 or so projects, 25, 30 in that realm. We have another 10 or 15 that are more on the sort of human side of that, right? You know, you're modeling

Starting point is 00:20:09 societies or using cell phone data to figure out how much people are interacting and where, but there's other pieces of that, right? You know, what are the causes? What are the projected spread? How do the aerosols spread on a plane? You know, all sorts of things like that. And then we have some that are sort of in the middle, looking at the genomic scale, figuring out the sort of evolutionary history of the virus, which helps, you know, if you know what the virus is, it's related to, that gives you some insights and treatments, but also the people it infects, right? We know beyond any doubt at this point that it affects different people differently. You know, there's a lot of sort of preexisting conditions that feature into that, but a lot of it is also genetic, right? You know, there's a lot of sort of pre-existing conditions that feature into that, but a lot of it is also genetic, right?

Starting point is 00:20:45 You know, what strings of genome do you have that somebody else doesn't have that make you more or less vulnerable, right? And the more we can understand that, again, we might be able to isolate and build therapies based on that or figure out who actually needs different kinds of vaccines to do a more personalized approach. And we actually started in March with a number of the other supercomputing centers and the cloud providers through the White House and the Office of Science and Technology Policy. We put together the COVID-19 HPC consortium. Yeah. And so at this point, about more than half of our projects have come through that mechanism where they write to the consortium and then they get stuck here at TAC or at the San Diego Supercomputing Center, or maybe on Amazon or Microsoft. So it's interesting to think about how you have multiple projects that are using data that was collected from other projects. So like Romy Amaro's work, right?

Starting point is 00:21:33 A lot of the data that she's been able to gather is now being fed into the supercomputers for different research. And I mean, that's pretty cool. It feels like the supercomputing world now is allowing us kind of this time machine forward. From supersonic jets to personalized medicine, industry leaders everywhere are accelerating innovation with unprecedented speed and efficiency by using Rescale, the intelligent control plane that allows you to run any collaboration on hybrid cloud, Rescale empowers IT leaders to deliver high-performance computing as a service with software automation with incredible security, architecture, and financial controls. And as a proud sponsor of the Big Compute podcast,

Starting point is 00:22:39 Rescale would especially like to say thank you to all of the scientists and engineers out there who are working to make a difference for all of us. Rescale, powering science and engineering breakthroughs. Learn how you can modernize HPC at rescale.com slash bcpodcast. I really love to hear about the great work that scientists and researchers are doing around the world with supercomputers. And also love to hear about the tech specs of the machines that they are working on. I remember when we were talking to Dan, things got pretty technical and it was so fascinating. Absolutely. So I asked Dan, what are the names, sizes, and the tech specs of the various machines that occupy the tech? We have about 15 different production platforms at this point.

Starting point is 00:23:28 The biggest one right now is Frontera, and that's our sort of leadership class system. It's actually a collection of several different kinds of systems, but the biggest piece is an Intel Xeon-based, a little over 8,000 compute nodes that can do about 40 petaflops with about 425,000 Xeon cores that make that up. There's also about 1,000 GPUs attached in various subsystems focused more on the machine learning side of things. It has about 50 petabytes of fast file systems from data direct networks. The network comes from Mellanox. We have a 200 gigabit InfiniBand interconnect for it. And then Dell was our integrator who put all the servers together. Although on the GPU side, we also used Green Revolution for some cooling systems. We use Cool IT for the water cooled parts to the chips. We're

Starting point is 00:24:13 using very high powered chips. So we just pump liquid directly across them at this point. And the GPU nodes we immersed in mineral oil. So IBM and NVIDIA and a whole bunch of companies were involved in doing all of that. But that's sort of our newest and sort of largest system. It debuted at fifth in the world. It's been about a year and a half, so it's dropped down to about eight in the world at this point. They age just like any other computer does, but we'll run that one for another four years or so. Our other large scale system is Stampede 2 that is also an Intel based supercomputer given about 6,000 nodes. It has a mix of Xeon and what were called Knights Landing Cores, the Xeon 5s, so it also has around 400,000 cores. It's about a 20 petaflop machine,

Starting point is 00:24:55 about 30 petabytes of disk. Frontera does a few dozen of the largest scale projects, so people get very large allocations, mostly running bigger jobs. Stampede's sort of our broader mission machine. It's a couple of years older, but it has more like 3000 projects on it. So, you know, that one has 15, 18,000 users competing for time on it. And so those are sort of our traditional supercomputers. We have other machines that have different missions. Chameleon is our sort of cloud test bed, but that one is where we focus more on computer science research. We have a whole host of storage systems and data intensive computing systems. We have some more for interactive use. We have some more for visualization. We try and just

Starting point is 00:25:36 provide that sort of whole computing ecosystem that we think you need for modern science and engineering. Awesome. So one of the, not surprising to me, but I know it's been surprising to a lot of people in this industry is the rise of the ARM processor and the ARM supercomputers. Like Fugaku in Japan. Like Fugaku in Japan. And I'm curious, what are the tax plans right now? What are you looking at in terms of ARM and the future and kind of the deprecation of the traditional x86 platform over, you know, obviously a very long period of time. Yeah. So it will not surprise you that we have some of those chips along with a whole host of other things. And certainly the AMD chips, the NVIDIA

Starting point is 00:26:16 GPUs and other ones. There's a number of interesting things about ARM, but specifically the Fugaku machine and what my colleague Satoshi Matsuoka has been able to do there is they had a very long-term partnership to really purpose design a chip with Fujitsu for this sort of big national supercomputer. So that machine was many years in the planning and design because most darn chips, the kind that are in your cell phones and things like that are conventional processors. They have some differences, but fundamentally they work just like the AMD or Intel processors you have in your laptops and your servers. But architecturally, they're the same.

Starting point is 00:26:48 But what makes Fugaku unique is that ARM chip they built with Fujitsu is not only a very nice processor, but it has a bunch of very high bandwidth memory that is integrated directly on the package. So you don't have to go off the pins to a separate memory chip somewhere else on the motherboard. That gives you much higher bandwidth. Right. So it has the memory bandwidth that a GPU would have. Interesting. So the RAM is on die. I believe it is actually stacked on there. Oh, not necessarily on die, but it's on package.

Starting point is 00:27:15 What does that mean? On what? RAM on die? Like it's what does that mean? That means like, so in a traditional computer, you know, you have a motherboard and then you have a CPU and then you have RAM sticks and then you have, you know, hard drive or whatever else. The new chip from Apple has unified memory and it's on die. And what that means is the RAM is now in the CPU. Oh, interesting. Instead of a stick that you plug in, it actually comes on the cpu it's inside the cpu so now you don't have to have that latency of like leaving the cpu socket going across the motherboard hitting the ram stick and then coming back okay and in apple's case they put the gpu in the in the cpu also so it's so you don't have you don't have a separate gpu anymore like everything is all in

Starting point is 00:28:06 one entire computers in one chip that's crazy now now to be fair they had already been doing this sort of and ipads and and iphones but this was like another evolutionary step because even on those i believe the ram was still separate but it was soldered on to the motherboard or the pcb but now they just rolled it all in so that the one chip handles everything. But are there any supercomputers that actually have the RAM on die or wouldn't they mostly be separate pieces? They would almost all be separate pieces. But Dan did note right here that in the case of the Fugaku one, it's actually stacked.

Starting point is 00:28:39 So what they did is like they have the CPU at the bottom layer and then they put the RAM on, they stacked a layer on top of it okay so it's like 3d instead of it being flat and what's the advantage of doing that just extremely available and extremely fast and available less latency or less latency it's much faster but then there's a downside in that uh you can only fit so much in that space. Whereas like with a traditional computer, you can put a terabyte or two terabytes of RAM on a node. There's no way you can fit that much on a CPU, whether it's stacked or on diet as a matter. Thanks for letting me know.

Starting point is 00:29:19 And I remember you asking him that. And I also remember not quite understanding what you guys were talking about. It's also very dependent on the software, right? Like if the software is written to take advantage of the architecture, it could be a lot faster than traditional computing where the software isn't as optimized, right? You know, I believe the memory bandwidth on Fugaku is something like a terabyte a second per node, right? So for a single CPU socket, right?

Starting point is 00:29:46 Which is about five times better than we can get out of a current, you know, sort of mainline CPU socket. But they're seeing some fantastic performance out of that on some applications. Now, the downside of that, of course, is that in that particular design, because they've squeezed all the memory onto the chip, they have a lot more bandwidth,

Starting point is 00:30:00 but they have a lot less capacity. So looking to the future, what are the TAC's plans in terms of new supercomputer designs? You might imagine, you know, we're always designing the next machine. And right now we're planning what the National Science Foundation will call the Leadership Class Computing Facility. It will ultimately replace Frontera and some of our other infrastructure and sort of the 2024, 2025 timeframe, Congress permitting. That's a whole other story. You know, so we're in design for that

Starting point is 00:30:26 machine now, and we're sort of taking apart the application space. And it's really more than applications, right? You have to understand the field of science, but you also have to understand the method they're using, right? But when you change algorithms, that can be a big deal for some of these. And we're sort of mapping that to the chip space across GPUs and CPUs and now TPUs and all these sort of coarse-grained arrays that people are building for AI chips and field programmable gate arrays, but also how can we get enough memory bandwidth to it? How can we get enough network bandwidth to it, right? Do we want to have a single type of chip on a node? Do we want to have heterogeneous nodes with a mixture of

Starting point is 00:30:59 accelerators and conventional processors, right? And trying to find that right mix for a few years out is what we're spending a lot of our time on now. So I know we have a ton of cool stuff coming down the pipe right now in the world of silicon. While Intel held a seemingly insurmountable grasp on the x86 chip market for years, AMD has now surpassed them as well as Apple with their new Apple silicon.

Starting point is 00:31:21 Although I did use AMD back in the day when they, for a brief period, had more powerful processors, but then they fell behind. And of course I used Intel until now when AMD released the more powerful processors. The irony is I waited so long for these things that I just jumped ship entirely and went with Apple Silicon. And you're a pretty diehard apple and i'm pretty diehard not and we'll put it and yet we coexist i know somehow we managed to live on the same planet mars mars yep yeah it's one of the things that i love to talk about in general is just the the the feedback loop that happens right so, we've had the ideas for innovation long before the technology was

Starting point is 00:32:11 produced. And that was mainly a function of material science and the ability to engineer these things. And now, especially in this sector, there's a feedback loop because the same supercomputing machines that are used to run the simulations for chip design and the simulations for material science that go into building these chips are being powered by the generation that was produced before. So the faster we advance our computing ability and AI with predictive analysis, as well as material science in general, it just creates this kind of vortex where it starts going faster and faster and faster. And part of it may not move as fast as another, but each of these feed into themselves at some point. Yeah, I think that's absolutely true. I've studied a little of this and it's often we're inventing the mathematics and the algorithms for these things decades before they actually become useful, right? The fundamentals of digital circuits are Boolean algebra, right? And George Boole designed Boolean algebra in 1854,

Starting point is 00:33:09 and I don't think he had Intel chips in mind. The math leads the application sometimes by 50 or 100 years. You need stronger computers to create stronger computers. That's really interesting. Exactly. And that's an excellent way to boil it down, I think. So one of the things we'd love to discuss here on the Big Compute podcast is cloud computing. More specifically, the intersection of traditional on-premise supercomputing and cloud HPC. You know, in many ways, I think our overarching stance on that is just simply that we are the cloud, but we are a very special kind of cloud for academic scientific research, right? So in some ways, we're a bit of a specialized cloud, but we're tuned both in our hardware and in our support and in our software stack around

Starting point is 00:33:50 scientific simulation and AI and analytics. So which means for services that aren't the thing that we're good at, we tend to rely on commercial clouds, right? But there have been impacts of just the sort of ubiquitous adoption of the commercial cloud that I think have been good for us and technologies that we've transferred back and in places where we partner. So in the Frontera project, we have bridges to the major commercial cloud providers, and we see our users using sort of both and doing crazy hybrid things in a good way. So one of them is climate simulation, right? We do a bunch of massive climate forecasts

Starting point is 00:34:25 of very high resolution on Frontera that take millions of processor hours. And then they dump out a vast amount of data. They do the simulation piece on Frontera and then they push the analytics piece to Azure. And we push the data back and forth and publish it. I think cloud has changed the sort of expectations of users and the usage model.

Starting point is 00:34:45 Part of it is that we see less tolerance to wait in batch queues and more demand for interactive things. But the other part is this sort of shift to almost ubiquitous but persistent web services. And so, you know, we wrap RESTful APIs on top of our supercomputers now, right? We still have tons of traditional users who come in, log into a Unix command line, run batch jobs, and work in that environment. But we have more and more where the supercomputer is just sort of an automated resource living behind an API for data processing for the Large Hadron Collider, for some robotic phenotyping work we do. There's just so many different things where it's becoming sort of HPC as a service. And that has grown out of innovations that have happened in the cloud space. You know, at this point, I think we're one of what will be a set of sort of boutique

Starting point is 00:35:33 niche specialty clouds that will exist within the context of the larger clouds, but will be linked together by services. And a lot of our users will use them sort of synergistically, I would hope. You know, it's sort of not an either or. And I think whenever it becomes an either or, it's a sort of silly conversation, right? It's a both. Yeah. And I love to hear that, by the way, that TAC has taken the approach of kind of bringing

Starting point is 00:35:54 the two worlds together is really going to make a huge difference in terms of not just the usability and the extensibility of the services you offer, but the ability for others to interact and engage with the data coming out of there in a larger context, right? A global context. Yeah. There will be specialties of the commercial cloud that we want to use, right? I mean, you know, image tagging, right? Things like that, where there's already a service, you know, language processing. There's some of these where we can just use a cloud service as part of a larger scientific workflow or use the cloud as the virtual desktop interface to let people get to the supercomputing resources. There's so many places where I think collaboration is not only possible, it's just the right thing to do.

Starting point is 00:36:34 I love Dan's vision here that the TAC is the cloud, just a special kind of cloud, and that interoperability with commercial clouds is just the right thing to do. I'm curious about the process that researchers go through in order to be able to access tech supercomputing. So if I'm a scientist or researcher, I need access to supercomputing. What do I do to get it? And is it completely complementary? There's a few ways to do it. And so for our big National Science Foundation supported machines, there is a process in conjunction with the NSF. And there's another project called XSEED that does allocations across the NSF supported supercomputing centers. So if you're a scientist, you write a proposal for time, you know, in addition to your proposal for funding, and you say, all right, I have this project, and I need X computing resources to do it.

Starting point is 00:37:31 And an independent review panel gets together and sort of looks at the suitability and recommends different machines for them to go on. And so essentially the NSF pays us to build the machine and operate it. And then we make an amount of time available on that machine every quarter. And then this sort of neutral third party project comes through and hands out those chunks of time to the users. And in that model, we're not charging the users directly. Right. All the time is paid for by the NSF in support of what they're doing. So so it is free to them. It is not free, as I like to remind them. It costs many millions of dollars. Right. For the time. We do also, you know, we have some Texas funded machines that are

Starting point is 00:38:05 a similar process, but open to Texas researchers. And then, yeah, the sort of third fallback mechanism is, yes, some people just pay us for time, right? When they can't get time through another process, corporate partners, and then some academic researchers or labs who want dedicated time will just come in and buy a chunk of time to make sure they have it. So they don't have to go through that proposal process every time they need more time. So we're all about undercover superheroes and we consider you to be one of those, Mr. Dan Stanzione, because you're helping all of these projects move forward and these projects are changing the world. So what are the most memorable or interesting projects that TAC has been a part of? There are so many and very few of the things we do

Starting point is 00:38:46 change the world directly. We're letting other people do that, right? And this being 2020, you can't talk about anything without talking about COVID, right? And the fact that we've been able to help world-class people who do incredible science really fight back against this pandemic and do things that probably less than a year

Starting point is 00:39:05 from when we run the computation will turn into therapies or policies or vaccines that will have a sort of immediate impact. And I sort of call that segment of what we do urgent computing. But the more I've thought about it, the more I realized that a huge amount of what we do is urgent computing, right? We just started a new computational oncology partnership with MD Anderson in various kinds of cancer research, doing sort of personalized dosage levels for particle and proton therapy, the sort of most advanced kind of radiation, instead of just sort of eyeballing it. And, you know, to people who have cancer, that's no less urgent than the work we do for COVID. And, you know, this year we've had 12 major Atlantic hurricanes. We've run simulations for all of those. We do the storm surge models that lead to the evacuation

Starting point is 00:39:51 orders. You know, we run a ton of that stuff and that, you know, within days of doing it becomes part of people's lives. We see the longer term stuff too. When a hurricane comes through, we have teams that we work with from universities around the country that go out in the field and start taking pictures right afterwards, particularly buildings and all the structures that are damaged. And then we bring all that data in and do analysis, one to make, you know, to sort of understand and make building codes better. And there's a ton of what's called vorticity, these little vortexes that form right at where a roof corner is. And if they can lift off the corner, cause it's not tacked down right, then water can get into the house and everything

Starting point is 00:40:23 else bad happens. Right. So we're starting to look at, you know, potential AI methods where you could go and say there's been an earthquake in a city, you have 50,000 buildings you need to go inspect, right? Which are the priorities, right? That's a great AI problem because if we can just help the first responders to prioritize, these are the buildings you really need to go look at before you can let people get back in, then that's a huge help. We did some work, you know, when a hurricane hits Galveston, we have enough data that we can do a model of the storm surge. We can overlay a GIS model of

Starting point is 00:40:54 all the houses in Galveston and all the buildings. And we know what the height of their foundations are. And we know that the electrical outlets are 18 inches above the foundation height. And we can model everyone that would have been inundated to the level of electrical outlets, which means FEMA has to go in and inspect before they can let people go back into those homes or buildings. There's new battery technologies that ultimately will really change the world.

Starting point is 00:41:15 There's first observations of gravitational waves that were predicted in 1915 and actually first observed in 2015. Or the Higgs boson discovery had an enormous amount of computation, right? The sort of subatomic structure of the universe. Food production. How do we do hybrids of corn to increase yield per acre, right?

Starting point is 00:41:34 There's such an incredible variety of things we get to play a part in that it's just wonderful. And so my favorite one usually is the one I've worked on in the last few weeks. For the layperson, my mom's a pianist and a semi-pro pickleball player, right? She doesn't do supercomputing. She doesn't even know what a supercomputer is. What would you tell somebody like my mom who has no idea what's going on behind the scenes if you were to describe what supercomputing is doing for her? What would you say to somebody like that? When you think about what supercomputers can do, right? If you look at your cell phone, the amount of computation that went into the design

Starting point is 00:42:09 of the case so that when you drop it, it bounces and doesn't crack and rattle the motherboard of the chips that are inside it and how they work of the materials that make up the batteries so that, you know, we can now take this tiny little device that fits in your pocket and have it run for days while wirelessly communicating. There's computational design in all these things. When you have a problem that's too big to solve on any other computer, that becomes a supercomputing problem. When you have a problem where you just have too many of them, we need to look at every possible design for how we build the drag on a wing for an airplane to make it more fuel efficient. You can't build a million 747 prototypes. You do most of that computationally. And then finally, when you have deadlines, right? There's a hurricane. It's two days from reaching

Starting point is 00:42:57 the coast. I can't spend six days figuring out where it's going computationally, right? When we can sort of predict how things move, right? With Newton's laws of motion, or if I can build a mathematical model of a physical process in the universe, if I want to ask a question of that model, I need to run a simulation, right? And that's what a simulation is. And that's what we do in computing is ask questions of these models that we can build that are looking at the sky and I need to process data or read all the data off a genome sequencer, right? That's analytics. And we're doing that computationally. And then finally now where we have these huge corpuses of data, like everything everyone has ever tweeted or every cell phone position of every person in the country, or, you know, any of these other vast data sets,

Starting point is 00:43:38 the genomic data for everyone who's ever had COVID, right? And we want to crunch through all that data and understand it. And we can't, for things like genomics, build the mathematical model. We can build the statistical model of those. And that's basically what we call artificial intelligence at this point, is asking questions of these sets of data. And supercomputers do all those things too. I heard JPL was working with ATT&CK on Mars rover tech. We have some work doing data crunching with Jet Propulsion Lab, where we're looking at orbital insertion for future Mars missions and how you compute that. It's not classified, but it's considered sensitive code and data that we deal with.

Starting point is 00:44:12 You know, it's export controlled and restricted because in this case, we're looking at Mars, but there's only so many ways to insert a big piece of metal into the atmosphere from orbit. And you want to restrict how well distributed that information is. So we have to protect a lot of the data around that stuff. But I know we're doing some, you know, what are good orbits and good landing vectors for getting into the atmosphere with JPL. What are you most looking forward to as you stare down the future path of technology advances,

Starting point is 00:44:37 specifically in supercomputing? For us, it's not just the technology, it's the science we're going to do with it. I mean, with the merger of sort of AI into the scientific workflow and the potential for that, I think, you know, we're going to do problems that we haven't even thought of yet at scales that we can't dream of. But for me, the next big step is building on this LCCF machine that comes after Frontera, which will involve us building a bigger data center and everything that goes with it and a bigger training program and picking that next set of technologies that are going to work. So that's going to be an awfully daunting process, but an exciting one as well. Well said, Dan. What an excellent way to put a cap on the meat

Starting point is 00:45:12 of this episode. Hear, hear. Where can our listeners find you on the internet? We're on Twitter and Facebook and all the usual places you would expect to find us. If you go to the website and you want to opt into one of our mailing lists, you can get our annual magazine, Texas Scale. It's like Exascale, but everything's a little bigger in Texas. So that's cute. Some of our coverage of the science stories and the new machines that are coming down the pipe.

Starting point is 00:45:37 We will be sure to link some additional information in the show notes for this podcast. Thank you so much, Dan. Thank you for being an undercover superhero. Thanks very much, Julie and Ernest. Appreciate the time and the coverage of this. Thanks again for listening to the latest episode of the Big Compute Podcast. Believe me when I tell you that I love recording these for our listeners, and I know Jolie does too. This seriously is such a treat for me. I love learning about all of these interesting

Starting point is 00:46:03 subjects and all of this technology that I hadn't been exposed to, and I love learning about all of these interesting subjects and all of this technology that I hadn't been exposed to. And I love sharing it with all of you. And if anyone out there wants to help us get the word out about the Big Compute podcast, you can leave us a five-star review and follow us on Apple Podcasts or your favorite... Or Google Podcasts. I was about to say it. Just making sure. Or your podcatcher of choice. Yes. And if you have any ideas of what we should talk about on our next podcast episodes, feel free to send those in at bigcompute.org, where you can also find a lot of great information. Anything to get you down the rabbit hole of the awesomeness of supercomputing.

Starting point is 00:46:40 Until next time. Adios.

Your Ad Here

Big Compute - How Supercomputing Touches the World(s)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.