PurePerformance - How CERN analyzed 1 PetaByte per second using K8s with Ricardo Rocha

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready! It's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome everyone to another Pew Performance episode. This is one of the rare moments when you tune in and you don't hear the voice from Brian Wilson but you hear the voice of Andy Grabner, which means Brian is not there with us today. But nevertheless, even though I miss him dearly, I have a great guest today that I was fortunate enough to bump into at the recent KubeCon in Salt Lake City. And without further ado, Ricardo Roja, I hope I kind of got the name right, the pronunciation. Welcome so much to the show and thanks for having time with us.

Starting point is 00:01:00 Yeah, it's a pleasure. Thanks for the invitation. Hey Ricardo, when I saw you on stage at KubeCon, you had a talk, a keynote called Multi-Cluster Batch Chops Dispatching with Q. I was fascinated. First of all, the topic is really cool, but then you also did a live demo on stage. And that's always very nerve-racking if you know you have like 10,000 people in the room. But you did it folks, if you want to watch the recording, the link to the YouTube video is part of the description of the podcast.

Starting point is 00:01:35 Before Ricardo, I want to go into some of the Kubernetes topics, the performance topics and the solutions that you've built for CERN. I would like to learn a little bit more about you. I looked in your LinkedIn profile. You spent overall about 20 years at CERN, is this right? This is right. Yeah, it's been around 20 years I came as a student originally at CERN. So it's been quite alright. I had the pleasure to change roles quite a bit, but yeah, it's been a while right now. And so I think you had two times almost 10 years. I mean, you're still at CERN. What did you do in the middle?

Starting point is 00:02:14 Yeah, so I came to CERN and at some point I thought I would like to experiment a bit with something different, especially in the industry. And I was kind of fascinated since I was a child with remote places. And so I moved to New Zealand. And while I was there, I helped build the first public cloud provider in New Zealand. That was really a big challenge, a great opportunity and a huge pleasure actually. I had a great time there. But eventually I came back to stay closer to my family as well. Yeah.

Starting point is 00:02:54 Well, I think building a public cloud provider is not something that many can put on their resume. So that's quite phenomenal. Is this, what type of service is this? I guess this was about 10-15 years back? Yeah. So it was early days and it was open source based. So the company is called Catalyst New Zealand and they wanted to experiment with something different. So they gave us some freedom to experiment with launching such a service.

Starting point is 00:03:27 It was the first API-based cloud in New Zealand. It was based on open source tools. At the time, it was OpenStack. We launched it and it was quite popular. It's still there. So if you're living in New Zealand, it's likely that if you have some cloud services, you're running some stuff there. So I still keep in touch and yeah, really an enormous pleasure to work there. Yeah, cool. Now we'll definitely be, I know we have some listeners from that part of the world and I also have some friends.

Starting point is 00:04:00 I will make sure that they will listen into this once this airs. I guess open source, that's then also an interesting segue because you are, besides working for CERN, you're also very active in the CNCF and the Cloud Native Computing Foundation. Can you tell me a little bit more on what brought you into the CNCF and what you do there right now? Yeah, so currently I have a couple of roles in the CNCF. This came out of the work on containers and Kubernetes. So when I first proposed this internally, I saw that we would need, it's like, it takes a village. So we would need to engage with the community,

Starting point is 00:04:42 get really our use cases seen, and collaborate as much as possible. So I got turned to join the CNCF and the Cloud Native Competing Foundation as an end user member. This introduced me to the community. I also had been invited for a couple of talks in Cookons right at the start. We had some cool use cases, so people are always

Starting point is 00:05:02 willing to hear about them. And in the end, we had some cool use cases, so people are always willing to hear about them. And in the end, this gradually grew into becoming more involved, especially in the technical oversight committee, which I'm still part of. And also, I helped build the new end user technical advisory board, which I'm also part of. And this is really where we stand.

Starting point is 00:05:23 We are very engaged, but we are there representing end users and helping as much as we can. And maybe as I have you on the line on this topic, can you explain, because some people are confused, what does end user mean in the CNCF? Can you briefly explain to me what does qualify, what is an end user qualified for, or what does qualify an end user? That's the right way to phrase it. Absolutely, that's a very good question. So this is also

Starting point is 00:05:49 something we struggle in the technical oversight committee because, well, the main role of the technical oversight committee is to oversee the maturity of the projects. So when projects join the CNCF, they are reviewed and when they apply to graduate between the different maturity levels, which is sandbox, incubation, graduation, they also have to go through some due diligence, but in particularly they have to go through what we used to call end user interviews and we now call it adopter interviews. And This is where end users have a huge impact in the community, which is we will interview not the vendors, not the project maintainers, but we'll interview the end users, the actual people using the project and they will explain to us in which level of usage they are.

Starting point is 00:06:41 Is it production, pre-production, just experimentation? And what's their experience with interacting with the project and the community around the project? And this is really, I would say, this is also kind of my view of the whole community. But I think this is the core of the CNCF and the Cloud Native community is the end users. We are the ones that will eventually

Starting point is 00:07:06 say if the projects are successful or not successful by adopting them. So an end user in the CNCF is someone that is an adopter of the project, but does not have anything to sell or doesn't necessarily have a huge engagement in the project, apart from contributing to feedback and testing and improving the project but just by providing feedback and user.

Starting point is 00:07:35 Yeah cool yeah I remember that we've launched years ago the Open Source Descends F Project Captain and brought it from Sandbox to Incubation. We also launched OpenFeature a couple of years ago and I think that's that as you said, right, as you mature in a project in the CNCF you need to prove not only that the project is stable but that it's adopted and I think that's exactly what I find great and also when you look at the CNCF landscape there's obviously a lot of sandbox projects, but once you make it over the next hurdle, you really know there is a community behind it, adopters behind it, and it's actively been developed.

Starting point is 00:08:16 So that's great to know. Exactly, so there's the part of helping the projects in this journey towards maturity, towards graduation and helping them building a community and having all the best practices that are needed for a sustainable project. And then there's the end user part, which is we provide some sort of certification or some sort of like badge towards the maturity of a project, which simplifies the work of an end user when they're choosing the best project for a task.

Starting point is 00:08:50 They get some assurance from this kind of diligence that we do in the TOC towards the sustainability of projects. And this is all together, like we work together. It's not like challenging projects, it's really helping them out, building this kind of maturity. Well, thank you so much for giving this brief explanation. I wanted to bring it up because I know

Starting point is 00:09:16 the question always often comes up, what is an end user? And thanks for the detailed explanation. Now I wanna go back to your line of work and also what brought you to speak at KubeCon last year. Thanks for the detailed explanation. And I'm actually, what I learned, I didn't know, that you are analyzing all of this data, at least the way I got it from the talk, using Kubernetes as the core platform. And then you're using Q as a way to schedule the jobs because obviously GPUs are very expensive. And then you had a very nice talk where you are explaining how you can best use Q to then

Starting point is 00:10:03 schedule all those jobs so that they most efficiently leverage the underlying resources that are available. Now before I go into that details, I first would like to ask a couple of questions about how much data are we talking about? Because this would be if we can give any numbers, like what data gets collected on the regular basis. Yeah, this is one of my favorite topics. I'm glad to go through it.

Starting point is 00:10:27 So, as you mentioned, the main experiment right now at CERN is called the Large Hydrogen Collider, which is a large particle accelerator, the biggest scientific experiment I ever built, actually, which is a 27 kilometer in perimeter particle accelerator, which is a 27 kilometer in perimeter particle accelerator which is 100 meters underground. And what we do is we accelerate protons to very close to the speed of light and we make them collide in these special places where we've built very large detectors that act as sort of like gigantic cameras that will look into what happened into these collisions. Of course

Starting point is 00:11:03 these are not like traditional cameras, what we do is there's several layers in these detectors. We collect a lot of data and we try then to store it and analyze. But we actually collect something like one petabyte of data per second on each of these experiments, which is not something we can... Exactly. One petabyte of data per second, Okay. Yeah. So this is not something we can store and analyze with current technology. So what we do is we have these filters that are very close to the detectors. And these filters will filter the majority of the data very quickly on the nanosecond. And then we will actually store on the nanosecond and then we will actually store something like 10 gigabytes per second per experiment that then we have to re-process and analyze. Still this comes to around 100 petabytes of data every year which we need to store and analyze. So it's a lot of storage, it's a lot of computing capacity to handle all of this, but the main challenge

Starting point is 00:12:08 we have at CERN, and it's not new, it's not new with Kubernetes, is that we always have to do more with the same budget. So what we're talking about is the experiments will always push more. So for example, we are doing an upgrade in a couple of years that's called High Luminosity LHC. And what this is translating to is 10 times more data. But we have to analyze, store and analyze 10 times more data with the same budget. And this is why we are constantly searching

Starting point is 00:12:39 for new technologies, new paradigms that will allow us to do a lot more with the same resources. And this is where Kubernetes, Q, and all these tools, GPUs, they come into the picture. It's that this is where we are looking into to be able to handle these levels of data in just a couple of years. Hey, I got maybe a little controversial question first, and then I want to go back to the way you collect the data. With every layer in every system that you add, you're obviously always using throughput or efficiency, right?

Starting point is 00:13:16 And with every layer, I mean, if you think about we have our physical hardware, we have our virtualized operating system, We have our layer over layer. Is Kubernetes the right choice then? Because you have so many layers and you need to squeeze out the most out of the hardware. So I think it's, I guess it's always the balance between convenience and also like efficiency. So I'm just curious, have you ever thought about implementing it more closer to the hardware than having the orchestration layer in the middle?

Starting point is 00:13:49 This is a brilliant question and this is I can explain how we do things today and what we are trying to do and I think it will reply at least part of your question. So up to now, this very quick filtering nanosecond level that we are doing for one petabyte of second, a second, are actually done with custom hardware. These are electronics we've built 20 years ago that will filter this immediately with very low latency. And then we will have what we call level one filters

Starting point is 00:14:23 or high level filters that come right after that will be CPU based. And this is a CPU farm that is very close to the detector that will then still already be able to reconstruct some of the events and do event selection a bit slower, but still on the microsecond or millisecond latency. Now, what is happening is that technology has evolved to a point where we can actually consider replacing definitely the CPU farm, but also some parts of the custom electronics part with things like FPGAs, with GPUs, and this is what we are looking at.

Starting point is 00:15:03 And what you mentioned about convenience is exactly what's happening. What was custom hardware, custom software, custom deployments in these special farms close to detectors is actually being replaced with very large Kubernetes clusters for the next round of upgrades. And in there we are planning to move a large fraction of the event selection to GPUs, in some cases to machine learning and model inference, that are served and managed by Kubernetes because this gives us all the convenience, all the monitoring, all the nice things we know about Kubernetes. And we got to a point in technology where this is possible. So this is something that is happening now. For example, one of the very large experiments called Atlas is already changing in the next run, we call it, coming in a couple of years, to replacing custom electronics in some parts and then their custom deployment for the

Starting point is 00:16:09 farm with one single Kubernetes cluster with around 5,000 nodes. Yeah, thanks for that. So that's really interesting. So like the progression that you're making from the custom hardware level 0 to CPU based and then as you said right the custom hardware has been built 20 years ago? Something like that a lot of it yes like there has been there have been upgrades in between but like the first the initial design for this hardware was around 20 years ago. They came to operation around 15 years ago. So yeah. Yeah, obviously, you know, a lot of things have improved and changed on the hardware side.

Starting point is 00:16:55 So it makes a lot of sense. Coming back to then, even though you explained it a little bit now on how the filtering and you're, I call it the data pipeline, right? For me, a data pipeline is where the data originates until it ends up where you can then actually store and analyze it. You mentioned that you are the level zero, like you're filtering from petabytes to gigabytes of a second.

Starting point is 00:17:21 How do you make the decision? Do you really filter? Do you aggregate? Do you have a fear of losing relevant data? I mean, that must be a challenging decision to make. Yeah, so this is really on the side of the physicist and it's very specific to each detector. So I could bring a physicist to explain this much better. But in a very quick summary, the In a very quick summary, the majority, large majority is really noise. We know it's not useful. The remaining things from the first level filtering are the ones that then we need to

Starting point is 00:17:56 do some sort of very quick event reconstruction to do selection on the interesting events. But this is already a large, a very small fraction of the original data. And this is where we can optimize the most. Of course, some physicists would say like the more we if we could store everything, it would be great, because this is where you might find what's unknown physics, new things. For the rest is kind of tuned to things that we know we want to look for. But still, the large amount is noise. But the way it works is that we actually tune the knobs, like we turn the knobs so that

Starting point is 00:18:38 we store based on the budget, on the computing and storage we have. So if we would have 10 times more computing we would probably turn the knobs a little bit to be less strict. But this is how we do it. So if we manage to come up with new paradigms, new ways of doing computing, then the physicists will start having ideas of course. Yeah, I was going to add, so just to complete, so this is what we call traditionally online computing, which is the stuff that is really close to the detectors and the data coming out. Then we store that and that's what we call raw data.

Starting point is 00:19:17 And this is where we start what we usually call offline computing, which is the reconstruction of the events from the raw data to see what actually happened there, and then all the different steps to come up with what we call analysis objects, which is what the physicists are actually looking at. And this is where it's more this kind of very high throughput computing with very large batch farms that are at CERN in many centers around the world, as well as public clouds, HPC farms, everything we can get hold of we try to use for this kind of offline high throughput computing. And this is also then, hopefully I got this right, This is also where you're now leveraging projects like Q to schedule those jobs and find the right hosts, nodes, resources that have the right

Starting point is 00:20:14 specs that you can then run your batch jobs in a fast way. Yeah, so the big advantage or thing that we look into when we look at Kubernetes is that it became kind of a commodity, something that everyone is sort of exposing. And this simplifies the access to resources quite a bit. So in the last 20 years ago, and this is actually my first project when I came to CERN as a student, we built this grid computing infrastructure, which is a sort of middleware that abstracts different infrastructures around the world and with some interfaces that were sort of common. Well, that would allow us to make use of around 200 centers around the world and expand our computing capacity from something like 400,000 or half a million CPUs to a million CPUs. This is what we

Starting point is 00:21:13 have been relying on for the last 20 years. Now, all this middleware was built on the pre-cloud era. We had to actually write all the software ourselves because no one or very few people had big data 20 years ago. But as it became sort of a commodity, we have tools like Kubernetes and all the ecosystem around it. So what we do, what we try to do is simplify this stack or even replace it completely in some cases with something that is more standard, like exposing a Kubernetes API.

Starting point is 00:21:45 So a lot of these sites now have replaced all their stack with a Kubernetes endpoint, which means we integrated that into our infrastructure. But it also means that if, for example, integrating a hyperscaler, as long as we have a Kubernetes endpoint, becomes much easier to integrate those resources into our existing tools.

Starting point is 00:22:04 The motivation is really to get access to things we don't necessarily have on premises. And the demo I did was especially focusing on GPUs, things like AI ML, where we don't have necessarily a lot of resources on premises right now. But we would love to have access to more GPUs, to specialized accelerators like TPUs or other sorts of specialized hardware accelerators. And this is where we are exploring the flexibility that Kubernetes and tools like Q offer. For many years we struggled a bit with Kubernetes because it was really designed for

Starting point is 00:22:45 initially for IT services. So we advocated a lot for scientific computing, the need for batch primitives, things like advanced scheduling, co-scheduling, queuing priorities. It took a while, but actually, Gen.AI was the time where everyone now wants it. So suddenly, we got a huge investment and we are really benefiting from that. How do you solve? So now with Q, you're distributing your workloads, right? You're distributing your jobs, but how can you make sure that these jobs have the data local to them? Because I think that's the big problem, right? How do you get the right data?

Starting point is 00:23:30 Are you already, when you're collecting the data, are you then already distributing it to all these different data centers? Because in the end, data residency is a big challenge because otherwise you need to constantly pull the data from some remote location in the world I guess Fred. Yeah so this is one of the benefits we have from having built this computing grid infrastructure is that we already had a distributed infrastructure of something like 200 different centers where we had to deal with these issues. So we built the services that allow us to distribute the data to where the computing capacity is or might be. So we have this data management systems that allow us to have

Starting point is 00:24:15 some sort of subscription system for datasets where we know that we can define the subscriptions of the type of data that should go for the different computing centers and we do this in advance. This simplifies the workload scheduling because in many cases we already have the data where we want to send workloads. In some cases we don't and you will have to pull and wait a bit. We do have also an advantage, which is we have a pretty extensive network infrastructure with 100 gigabits, or in some cases, multi-hundred gigabit connections

Starting point is 00:24:54 between these centers. And we've extended to public cloud providers as well. So we have dedicated links to those regions that we depend on. So you will feel the latency but in terms of network capacity and so what we are actually quite good. Yeah, yeah it sounds like you have no problems when you start a Netflix movie that you have to wait for a long data set. It should be okay. Well, do you have SLOs somehow defined on how fast your team needs to kind of provide this compute to come up with results?

Starting point is 00:25:39 Is there any type of SLO concept like service level objective or whatever KPIs that you have to say, hey, we need to ensure that this type of data, this amount of data is processed in this amount of time, because otherwise we will not be able to analyze all this data that comes up in a year. Yeah, yes, but they are very predictable. So all this is well known in the fans. Each experiment has what they call the technical design report, which kind of predicts what will be the data rates for the next run, which usually will last a couple of years. So we know very well what will be the data produced and the computer capacity required. Of course, we have some safeguards, some buffers to in case of like disasters so that we don't lose the data.

Starting point is 00:26:30 So there are some buffers close to the experiments that will basically cover for some major issues in the data centers. But they are pretty large. And so we don't really have to handle with this very detailed or up to the minute or up to the second availability. We know for the large majority, we know how much capacity we need. Where things get interesting is more for the interactive analysis, for the kind of chaotic

Starting point is 00:27:05 analysis from the physicists, and especially now that everyone got interested or more interested even in machine learning. There it's a little bit more chaotic and this is where we try to use as much as possible on-demand capacity and opportunistic capacity because we don't want to procure for peak capacity because that would be too expensive. We want to procure for what we know is the nominal need we need and that is predictable and then complement that with opportunistic resources and this is where this kind of more flexible infrastructures come to play. And then on this AtoC analysis where data scientists come in, I guess this is also where

Starting point is 00:27:54 the convenience of Kubernetes comes in again because I guess they can write their, and again, I'm not at all informed as well so I have no clue how they write their algorithms, but they basically package it up in the container and then they send it over into your queue and then you take it and then you deploy it and then in the end give them the result back. Is that kind of like the high level view of how this works? So this has been the case even pre-Curbernetics. Just physicists are quite IT knowledgeable. They got used to working with very low-level interfaces, submitting to HPC centers, but in particular has

Starting point is 00:28:36 never been super easy. So they know what they're doing. But basically that's it. They wrap their job But basically that's it. They wrap their job in some sort of definition that is submitted. The interesting part here is that actually for the software distribution, traditionally we actually built our own internal systems of hierarchical caches for efficient software distribution. And we have a system called CERN-VMFS or CVMFS, which is basically hierarchicals with caches. And this is the way that we still distribute software today. It is a read-only file system. So in some sense, it's kind of similar to containerization, but we don't have this kind of notion with that system of a single unit of a container

Starting point is 00:29:27 that can be easily shared and packaged. This is where containers brought some benefits to our ecosystem as well is that this notion of reproducibility became more clear. So we are actually in a kind of complementary world where people use containers, but they will grab some additional software from the default releases from the experiments from this CVMFS system. So it's a very interesting world, but there are cases where the full software is packaged in containers. And this is a challenge because these containers are not necessarily how you would expect them to be. So we have containers with several tens of gigabytes. And yeah, this has been other parts of the work we've been doing to improve things in the ecosystem for these

Starting point is 00:30:23 workloads. So we talked about your kind of your early days. You started 20 years ago or you had 20 years, you know, at CERN in total. If you look a little bit into the future, if you think about KubeCon in let's say 2030 in five years from now, what do you think will be something you will be talking about? What will have changed? What is your goal of having things optimized?

Starting point is 00:30:55 Yeah, so that's a question we keep asking ourselves also in the technical oversight committee. This is always something that we get asked about. What's next for cloud native? I think right now the main effort is to make AI work properly and make cloud native the best place to run AI. And there are reasons for that.

Starting point is 00:31:22 The main reason internally, and I think for other end users as well, is that we made such a big investment into having an infrastructure that scales and that works well with this kind of ecosystem. Doing something totally different for AI would be too expensive, too much of a change, and there's no real push for that. So it's actually the last year or year and a half has been all about ensuring that these tools are there. So I would say that in five years the challenge will be to adapt and that's my

Starting point is 00:32:00 view, it's not Cern's view necessarily, but I think the challenge will be to adapt to the new types of hardware that we are seeing. We lived in a pretty calm world of CPU expansion with Moore's law and very clear technology evolution. We had a big change with the advent of GPUs and sort of paradigm shift on how we develop software and how the software works and how the machine learning platforms are using this kind of hardware. I think what I see is this becoming quite big. Like if we look at the trend of the size of the GPUs

Starting point is 00:32:42 and the capacity of these GPUs, the needs for very low latency between the different nodes that host these GPUs. It's kind of funny because we went from very large mainframes before I joined CERN. That's how scientific computing was done, sort of commodity computing and scaling out for multiple nodes. And we, I don't know, I kind of see a trend of putting things back together with very low latency interconnects

Starting point is 00:33:10 and things that do even lower latency than InfiniBand with things like NVSwitch from Nvidia. And they start offering this very large and hardware accelerators or app, I would almost call them mainframes where people will actually get slices of compute in these very large computing devices. And this is a challenge for I think mostly for the data centers because we learned how to manage these individual nodes at scale with very large amount of nodes. We'll need to learn again how to host mainframe-like computers. But it's also a challenge for the software because we will stop thinking about nodes quite a bit with Kubernetes

Starting point is 00:33:59 in terms of workloads, but we will need to learn how to share very large pieces of hardware between users efficiently. So I think that will be the main challenge. But I would assume this should not be the concern of the developer that builds the workload. It will be the concern of whoever builds that orchestration layer, right? To make sure that whoever, like coming back to the Q project, right? You have like 10,000 jobs that need to be executed and then Q figures out where to best place them and whether this is now 10,000 nodes around the world or slices in one big mainframe, like modern mainframe, it should not concern whoever submits the job.

Starting point is 00:34:51 I think you're right. So this is one of the advantages of Kubernetes and the separation of concerns that we got from the declarative APIs of defining the workloads and the actual orchestration and execution of those workloads. But we see things like there is a feature that I think I also mentioned or Marcin mentioned during the keynote, which is called DRA, Dynamic Resource Allocation.

Starting point is 00:35:19 This is, I think, a key feature that is being developed right now, and it will mature during this year. But this will be key to actually expose these resources in a very flexible way and even allow the orchestration and the scheduler to reshape the resources dynamically. Things like partitioning GPUs that right now is kind of a manual process has to be done in advance. It has to become more dynamic advance, it has to become more dynamic. The system has to become smart enough to optimize the whole infrastructure,

Starting point is 00:35:52 partition it if needed, and assign slices as needed for each workload. I think a lot of it will be in this kind of optimization of resource usage but hopefully this will have little impact in the upper layers and the end user workloads. Ricardo, did I miss anything? Is there something in the end where you say hey I would have wished Andy would have asked me or anything you want to say about CERN, about the CNCF as some closing remarks? I would say I think I've repeated this in other places but this is something I'm very keen on repeating as much as possible which is all the community effort and all the ecosystem, every single component that is available in the ecosystem is making a huge impact. Not only at CERN, definitely at CERN because this is where I see it, but I would say in the whole scientific community. This is changing the way easier for anyone to approach without having to have access

Starting point is 00:37:08 to very large laboratories or scientific infrastructures. This is really changing the way we do science, improving things quite a bit. Even if it's not obvious from the daily work, I always like to express that appreciation for everyone's contributions. So, whatever you're working, I hope to meet everyone at QCon, to meet everyone in the CNCF, different CNCF bodies and have an opportunity to exchange ideas, but also exchange ideas but also give my appreciation for everyone's work. That actually means the next CubeCorn is coming up in a couple of weeks. I guess we will see you in London. Absolutely, looking forward to it. See you everyone there. Perfect. All right, Ricardo, thank you so much for the time. I know it's tough to take time out of your busy day and out of

Starting point is 00:38:07 you know not only the work day but also the personal day because we're recording this at a late hour. Thank you so much for doing this and I'm looking forward to seeing you in London. Thank you, thank you for the invitation, see you soon.

Your Ad Here

PurePerformance - How CERN analyzed 1 PetaByte per second using K8s with Ricardo Rocha

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.