Big Compute - Rethinking HPC in Academia

Episode Date: March 4, 2019

Gabriel Broner hosts Marek Michalewicz, Director of ICM, the HPC center at the University of Warsaw to discuss Rethinking HPC in Academia. With the advent of HPC cloud platforms, ...we may give every user access to systems on-premise, across multiple centers and in the cloud, to enable new research and accelerate time to research.  

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, I am Gabriel Bronner, and this is the Big Compute podcast. Today's topic is rethinking HPC in academia. Traditionally, HPC in academia has been on-premise. A system acquired by the institution is kept for five years. Utilization is high, so user jobs wait in queues, and relative performance declines over time. With the advent of HPC cloud platforms, we wonder if it's time to rethink HPC for academia. Instead of one institution, one system, we enable access to systems on-premise across multiple centers and in the cloud.
Starting point is 00:00:48 At the same level of spending, we may be able to accelerate time to science and enable new areas of research by having access to multiple architectures, to the latest technologies, and by reducing time waiting in queues. To discuss HPC in academia, our guest today is Marek Mihalevic. Marek is the director of ICM, the High Performance Computing Center at the University of Warsaw. With many years in the industry, Marek has headed academic and research HPC centers, including ASTAR Computational Research Center in Singapore. Welcome, Marek, to the Big Compute Podcast. Good morning, Gabriel.
Starting point is 00:01:34 It's morning in Warsaw. It's nighttime at your place. Great to talk to you. I'm very happy to answer your questions and to share some of my thinking on HPC in cloud or the progression of academic HPC computing. Marek, it's fantastic to have you here with your experience so look forward to having this conversation with you. Maybe we can start from the beginning. What are your views on HPC in academia today?
Starting point is 00:02:06 And what are the challenges we face? And maybe since you're in Warsaw in the cold morning, we can start with the University of Warsaw and how you see things from there. Well, I think that there are numerous challenges in the environment that are typical academic environment in Poland or other country. And one of the two biggest challenges is on one hand, there's an insatiable appetite and the requirements for computing and for storage, sometimes separate storage from computing. And on the other hand, the funding cycle is not predictable. So it's very difficult to do long-term plans for expansion, for maintaining the quality of service,
Starting point is 00:03:00 when very often the sources of funding are ill-defined and varied. Sometimes they come from within the university, sometimes from various ministries and bodies of the, sometimes from external grants, in our case, European Union grants. But, and they are substantial of course, and they help us meet those needs that we are charged to satisfy, but the planning is a very big thing.
Starting point is 00:03:38 Of course, in our industry, it's very difficult to predict expansion. So, for example, around me I see incredible explosion of interest in quantum computing. Very new thing. It's fueled by curiosity and excitement among young people. And, of course, there are no readily available hardware or resources to do to let young people explore and experiment with this and it would be absolutely fantastic if the great variety of modalities different possible computing platforms was available for for young, for researchers, academics.
Starting point is 00:04:29 This is great to hear. So, on the one hand, there's challenges like funding. When are you going to get money? On the other hand, there's the possibilities of interesting quantum computing, and how are you going to access? So, challenges and opportunities at the same time can you tell me a bit more about the funding situation so is it typical for universities to get funding cycles or is it grants that different research projects are going to receive or how do you how are you going to receive money or is it very unpredictable for you it's not entirely unpredictable because in the in case of a center like mine ICM we are funded and there are five high performance computing academic centers in Poland they are all in equal league each one has got the
Starting point is 00:05:19 really brand new data centers and about of the order of petaflop computing engine, each one of them. So the funding is provided through three-year cycle funding from Ministry of Science and Higher Education. So it goes directly from the ministry. However, over the last three years this funding was not sufficient. We experienced, we actually enjoyed a very fast development cycle. About four or five years ago Poland was extremely fortunate to receive larger funding from European Union that allowed us to expand in the unprecedented way. But with the very fast growth comes the period of sort of unmet ongoing needs.
Starting point is 00:06:15 So operationally, all five centers suffered. And they, of course, deal with that problem in different ways. One center is extremely renowned and active in the networking, so of course they can manage somehow with different source of income, but others were not so fortunate. So it was a period of difficult three years. Right now we are at the verge of a new three-year funding period. We'll see how it will go, but I believe that this is typical, not only of Poland, but other countries too.
Starting point is 00:06:58 Of course, there are interesting mechanisms of funding at the European Union level, and right now we are at the verge of embarking on the EuroHPC program. But of course, Euro believe, will be more appropriate for extremely seasoned users. And I always care about this group of sort of newcomers. I think we have to really think always in an academic environment
Starting point is 00:07:43 of that specific group of people who have not tried HPC yet. And for that group, there are special needs. They don't necessarily need to have a scale, but definitely they have to have a feel of how it is to use resources that can expand sort of practically without limit. That's great to hear. Yeah, I think when you talk about the newcomers, it's fantastic that you're thinking about them, many people coming into HPC, but without the 20 plus years of experience.
Starting point is 00:08:21 And I assume those are the people you want to be bringing along growing the community educating um are you seeing growth in the people who have not used HPC before coming to HPC today look I I see it because I want to see it and and actually I'm also actively looking for for people who who have not experienced HPC. And that's one of the reasons we have started training students for student cluster competition, which is a fantastic kind of event run at SC and ISC in Germany and also in China by ASC organization. I started the student team in Singapore. They are extremely successful to the extent that after a few years,
Starting point is 00:09:10 they have won the competition in America at SC last year. That was two years ago. It was absolutely brilliant. And now when I moved to Warsaw, I started the Warsaw team. And this year, they will go to their sixth final of the competition. So now I have two teams that I started at various competitions, competing against each other. Great to see.
Starting point is 00:09:32 And those people come to HPC not knowing anything. Of course, they are brilliant. They're very, very talented young people. And Gabriel, I tell you, it's great fun to see people who come. They are curious. They have bright minds. And suddenly they get excited in HPC. But that's just, you know, that's very select.
Starting point is 00:09:54 You know, they are the sort of almost, you know, they become almost professional after a few years of training. But what I'm thinking is that right now, we have all the order of thousand registered users on our HPC systems and some sort of semi-grid systems. And within university, we have 50,000 students of the order of 50,000. It's the largest university in Poland. And I don't see any reason why almost anybody at the university should
Starting point is 00:10:27 not have access to expandable resources when I say it's the resources that can go from from one core to thousands of course of tens tens of thousands of course or more depending on the needs and almost everybody has some some nowadays everybody has computing and I think university could provide that not necessarily through own resources as we know of course that lots of academics and students already use there's commercial clouds yeah so you'd like to see not a thousand users but 50,000 users using high performance computing.
Starting point is 00:11:06 This is a fantastic goal, fantastic vision. Gabriel, I would like to completely, completely obliterate, remove the obstacles to access to computing resources. And when I say computers of any scale, of course the requirement has to be justified, but the entry should not be difficult. Yeah, now this idea of democratizing access to everyone is great to hear. I like your vision. I'm rediscovering the world by listening to you and your students and congratulations to the progress they're making and look forward to get to the 50 students and let's make it happen. Let me ask you, you said a bit earlier, there's
Starting point is 00:11:58 five centers in Poland. Can you tell us a bit more? We're not always in Poland and we don't know. Are these five centers different in terms of specialization? Are they similar? Each of them focuses on a different area or how does it work? They are different. First of all, a slight sort of correction. There are five centers that are directly funded through the Ministry of Science and Higher Education. There is a sixth center which is slightly off-site but of course belongs to the same category. That's the center of National Nuclear Research Institute. And of course that center is mostly focused on nuclear research. They run nuclear reactor, they are very closely connected with very large scale European experiments, so that is very focused facility. Of course there is very good equipment there, facility is great, it's not very far from Warsaw,
Starting point is 00:13:01 and we collaborate with them. There are five centers and they more or less became their operation at about the same time, about 25 years ago it all started. My center is in Warsaw, it's connected to, it's part of University of Warsaw and we traditionally were focused on very large-scale computations, more of a capability type. And traditionally, we had very interesting computing equipment. There was Cell Computer at one stage, Blue Gene P, Blue Gene Q, Power 7 machine, water cooled. So some really, really interesting and more exotic type of computers. Whereas other centers, especially two major ones are Sifronet in Krakow,
Starting point is 00:14:00 related to Academy of Mining and a very highly regarded school. They actually operate the largest supercomputing equipment right now of the order of two and a half, close to three petaflops. Incidentally, it's HP system, liquid cooled, Apollo kind. And we also have Poznan Supercomputing and Networking Center.
Starting point is 00:14:34 That center is our leader in networking. So, of course, they have very substantial, very equal to ours computing capacity. They also focus more on capacity. They also focus more on capacity and it's a cluster system, FAT3 connected. For example, our system is CREAX C40. So it's ARIES. Again, different than most other centers in Poland. So in terms of equipment, we differentiate slightly. Going back to Poznań Poznań one of the most brilliant thing they have done and they are they were the leaders of this initiative in Poland about 10 years or more ago they started project called pioneer and they have built optical network, academic optical network that is fully owned by this organization, Pionier. It's shared by all the metropolitan centers and HPC centers. So now in Poland, we have 7,500 kilometers of optical fiber,
Starting point is 00:15:39 and we don't have to pay commercial carriers for that. It allows us also to do all sorts of experiments and tests and in the sense of networking, we really are world leaders. And this is of course due to the work of predominantly Poznan Supercomputing and Networking Center, with whom of course ICN collaborates very closely. So, Marek, you're very familiar with these different centers and the capabilities they have. So, I think it's a great segue onto the question I also wanted to ask you, which is, how do
Starting point is 00:16:17 you view the possibilities of assuming all these centers become a pool for us to use. So if I'm a user, I'm not just a user at the Warsaw Center, but I'm a user of this community. And when I submit jobs, the jobs are going to run on the best place to run a job. So we move from one system, one center to now I have access to all the system, I have access to a variety of architectures, I have access to systems that may have lower weighting queue than other systems. Me as a user benefit from that variety. If I'm an academic, I'm a researcher, I get the advantage of the multiple architectures
Starting point is 00:17:01 and the new architectures or even the reduced waiting queue. How do you see that as a possibility? What are your views on that? Gabriel, actually, we are going directly into cloud, HPC cloud solutions. But the interesting thing is that sometimes it's very difficult to be original. And in a sense, here, we are not original. What we are talking here is about, in a certain sense,
Starting point is 00:17:31 refresh of the technology, expansion of something that has already existed. Because one of the very, very neat thing that they have introduced in Poland is a solution called GridPL. And actually, GridPL was driven by another center, by the Sifronet Center, and that's their huge achievement.
Starting point is 00:17:53 And of course, we are part of it. So basically, we did have it, but it was not as scalable and not as flexible as what you can achieve with cloud solutions because grid grid pl was in a certain sense something like exceed program in America or you know predecessor of exceed but of course it's not based on on it's rather rigid it's basically a grid system with on solutions and some users, large group of users in Poland,
Starting point is 00:18:27 academic users, were already using resources from various centers. And we have one or two very substantial clusters that were explicitly or exclusively reserved for this grid PL work. And they were actually acquired through special funding from this very very large project. This project has been going on for three years, for three rounds, was three stages of grid PL, grid project and but what I see is that this moving to much more sophisticated and flexible technology
Starting point is 00:19:08 that is allowed now through cloud, HPC cloud solutions and provisioning, is very natural progression, a very natural step. I can't see a way out of it. Okay, that sounds very interesting because we could say that grid has existed for some time. It's not massively
Starting point is 00:19:33 adopted today. The concept has existed. You're seeing positives now with HPC cloud platforms. Are there any elements of that that you think make it more attractive than the way we used to approach grid in the last 10 years or something like that? Are you seeing some aspects of that that you particularly like? I'm curious.
Starting point is 00:19:56 Well, I always saw advantages of grid computing. And actually, all of those things, when I look back into how they developed, they're very natural progression. And I remember back in the 90s when I was in Canberra, one of my pals, Russ Standish at ANU, he was using cycle harvesting from all sorts of resources at the University and then we had this Condor solution and similar solutions. So there were solutions that were sort of addressing the problem of wasting of resources, extremely huge waste of computing resources. Then we had this progression to things like grid.
Starting point is 00:20:46 You see it in places like Poland with the grid solution. But they are very rigid. There's still configuration of the system
Starting point is 00:21:01 is different. So we had to wait for time where certain technical and business solutions had to be found. And I give you examples. First thing in the context of HPC is provisioning of topology and network. That was not possible and you couldn't do it in an easy way. Nowadays you can do it. Then you have to have things like, in order to merge those things,
Starting point is 00:21:30 you have to have interplay between cloud provisioning and batch provisioning, queuing systems and schedulers. If you can merge them, so you can actually mix interactive and batch processing. And one of the
Starting point is 00:21:49 fantastic, and of course grid was also batch processing. So it had these features of typical HPC environment and rather rigid, whereas with cloud we can actually treat supercomputing as your
Starting point is 00:22:04 desktop and moving from batch to interactive whereas with cloud you can we can actually treat supercomputing as your desktop and and moving from batch to interactive I think it's it makes huge difference then there are other other very important things that you can nowadays with the development of containers Dockers and singularity and whatnot you can actually package you package not only your program, not only your problem, but the whole environment. And then you touch into two most important things that I think are also emerging. Of course, it's very well recognized for forever, which is correctness and reproducibility.
Starting point is 00:22:43 Of course, you need to run it on the system that is stable, but you also have to understand that your results are correct and you can repeat them. And you cannot repeat them at different pieces of hardware, but also at different time points. So after a few years, you can go back to your computations and have similar results. So next aspect that was a little bit sort of causing difficulty was data movement.
Starting point is 00:23:17 But nowadays, that is also this obstacle is being removed. And you have various concepts like in-memory computing and of course you also have huge pipelines and you can move. And also if you make computing resource ubiquitous, then actually it doesn't really matter where you compute. You actually should move your compute part to where your data
Starting point is 00:23:50 is. And if there is a proliferation or there's a widely accepted cloud solution, it doesn't really matter where you compute. Then again, that will lead to reduction of costs. So all those things lead me to think that there's absolutely
Starting point is 00:24:07 no way out of these things. So people have been talking for a long time about utility, computing as utility. And you see it's happening. And I have to really congratulate people who started Uber Cloud, for example. UberCloud has just got some accolades and was recognized. And you see other things in industry. For example, the fact that IBM has acquired Red Hat is interpreted by many analysts as the move to cloud and provision of the sort and then you will see convergence and you already see it that that commercial
Starting point is 00:24:52 enterprise kind of cloud providers are slowly moving and crouching to HPC territory so suddenly you can have CRASE available as cloud instances or FPGAs or GPU enabled hardware as cloud hardware. And that's actually perfect. That's exactly what should be happening, the merging of the worlds. This sounds like, I like your thinking. I like how you go from, you know, we were always trying to get there, like grid was trying to get there, but maybe it wasn't as flexible. With the advent of cloud, now we have more flexibility
Starting point is 00:25:41 to enable this pool of systems to be together. And in addition, you have this variety of architecture you can take advantage of. So I think we always wanted to do this, but it's getting much, much closer with the cloud platforms being developed today. So I think your vision is one that maybe you see it happening, and you've been seeing it happening for a while. It's materializing slowly. The question I would ask you now is
Starting point is 00:26:09 what challenges do you see in terms of this transformation for HPC in academia? Is everybody with you or are you fighting this battle a bit in a lonely way? Not lonely. There is a battle for sure and there will be battle for sure and I think the main obstacles
Starting point is 00:26:27 there are various kinds of obstacles one is psychology human factor and people are actually very afraid of losing control and losing their own territory and I've seen it for
Starting point is 00:26:43 when I worked in Singapore, my colleagues, excellent technical people, were so skeptical and critical about cloud. Of course, the time was different. That was about 10 years ago. And surely there was no certain pieces of solution like provision of interconnect in Finiband, for example. But now those technical obstacles are being conquered. They are not a problem anymore. But people are always constantly,
Starting point is 00:27:16 you know, afraid of losing their sort of special position or their expertise. But I never worry about it. Because whatever way you provision resources you still need to have expertise to guide users and there will be hordes of new users so I think it would be if boom you know it would be excellent time for us to offer our expertise. It's not threatening at all for me. Of course, there are other things like the way of thinking. People who live in the enterprise or commodity type of
Starting point is 00:27:59 environment, they don't necessarily fully understand, are not attuned to the specific needs of HPC and supercomputing. And so there's a, when you merge those two words, people might not fully understand. Of course, there are very smart people on both sides, but there will be some differences. And I can give you an example. For example, in many places, you have certain resources available as cloud at various universities.
Starting point is 00:28:37 But usually those resources are managed, administered by different group of people who come from this commodity world way of thinking. And when we start talking, you know, they also afraid that we would be encroaching on their territory because basically it is about removing the barriers. But for me, those barriers are for users, not for operators. So, of course, the difficulties on the side of those who manage, those who get funding, those who decide about what kind of resources should be acquired or merged or accessible. And so the greatest factor is always human psychology.
Starting point is 00:29:34 Not technology, not hardware, not space, and nothing else. Of course, money too. Yeah, of course. There's always money in the equation. Marek, I'm left with this great impression of your vision, which is we are moving to this new world where cloud technology and cloud platforms are going to enable this merging or this ability to use multiple systems across academic institutions, to use multiple architectures that they become available, whatever they are. We always try to do this. This is previous times. We tried with grid, but it wasn't as flexible.
Starting point is 00:30:12 It's happening now. And maybe because it's happening, it represents change. And some people are seeing this like encroaching in their territory. But that will happen, and people find new ways of work of that it will all benefit from the changes coming along so it's great to hear your thinking in the process and try to learn as I as I hear this before you close I like to ask anything you'd like to add to this well I think I think just to talk two extra things that
Starting point is 00:30:45 I was thinking when collecting my thoughts before our discussion. There are two things. One is this human in the loop thing that really interests me and things
Starting point is 00:31:01 that are related that is interactive programming and visualization and also ability to test arbitrary kind of hardware that might be very rare or very expensive or exotic and with cloud and with sort of globally distributed resources. Incidentally, we have been working on this globally distributed resources. Incidentally, we have been working on this globally distributed resources by building Infinii Cortex project for three years in 2014 to 2016 with about 40 to 50 different organizations
Starting point is 00:31:40 in the world. So that was moving into that, into sort of making, basically breaking down all the barriers of distance, country borders, and the divisions between, you know, continents. If you sort of merge it, you can build one huge, humongous resource that can be chopped in different ways. It could be, sometimes it could be used as one single humongous computer of unprecedented scales.
Starting point is 00:32:10 On the other hand, it can be used by a great number of people for smaller tasks, because not every task and a very, very interesting scientific problem, academic problem, doesn't have to be huge in size. So I would like to very much thank our guest, Marek Michalewicz, director of ICM, the High Performance Computing Center at the University of Warsaw, for sharing his experience and his vision
Starting point is 00:32:38 to help us understand the future of HPC in academia. Until next time, I'm Gabriel Bronner, and this was the Big Compute Podcast. Thank you. Thanks. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.