Tech Over Tea - Developer Of ZLUDA: CUDA For Non Nvidia GPUs | Andrzej Janik

Starting point is 00:00:00 Good morning, good day, and good evening. I'm Azuija host Brodie Robertson, and today we have the developer of a project on that... I don't know, I'm sure a lot of you've probably heard of, but maybe haven't had a chance to use just yet. Is it pronounced Zaluda? Because that's how I've been saying it. It's a very good pronunciation. It's not 100% correct, but i don't think i can give polish lessons to the whole world so let's let's let's say with zluda you can certainly try how how would you say it if if like you're trying to say it correctly well it's it's written

Starting point is 00:00:38 incorrectly but you would pronounce it zluda ah okay, I was never going to get to that from the way it's written. Yeah. So how about you just introduce yourself and we can go from there. Yeah. So I'm Andrzej Janik. It's nice being here. I'm a software engineer. I wrote Zluda.

Starting point is 00:01:07 I'm still working on Zluda and hopefully I'll be still working on Zluda in the near and far future. Hello. Hello. And I have my tea because I heard it's a requirement for this podcast. I didn't even mention it. It's in the name. Yeah, yeah, yeah. It works.

Starting point is 00:01:25 Hey, look, if you actually bring it, you are better than most of the people and better than me most of the time as well yeah so um i'm gonna try to not look really shiny right now because it is i'm in australia and it is oh 9 30 p.m what time is it yes it's 9 30 p.pm and it's still 29 degrees Celsius. So... Oh, wow. Yeah. If you notice me looking shiny, that's going to be why. I've got my fan on, but we'll see how it goes. Hopefully it cools down a bit later. Anyway, enough of me complaining about the terrible weather here. I guess we can just start with what the project actually is and why was it initially created?

Starting point is 00:02:11 Oh, sorry. Those are two difficult questions. Yeah. Yeah, it's a long story. What it is is easier, I think. Hopefully I will get to the end of it before you collapse from the heat. So, Hopefully, I will get to the end of it before you collapse from the heat. So basically, it started because I thought I will not get in trouble writing this project. So it went wrong from day one. So it started when I was working at Intel. So I was working at Intel on a certain project, which... Okay, it will remain unnamed.

Starting point is 00:02:51 Let's call it Project X. Sure. I worked on Project X with... Those two guys are important for this story. Alexander Lyashevsky and Alexey Titov. They are grandfathers of this project. So work at Intel was... So some details I can provide. So it was a project, certain GPU library for a certain workload for obviously Intel GPUs.

Starting point is 00:03:23 workload for obviously Intel GPUs. It was one of the first projects actually using Intel SQL implementation called DPC++. So we can think of it as sort of Intel's answer to CUDA. This was a very, very early version of SQL. CUDA. This was a very early version of SQL and we had a lot of discussion about how useful

Starting point is 00:03:49 it is, how good it is and one of those discussions or arguments I don't remember who Alexander or Alexey told me something that stuck with me is that, okay SQL might be good, might not be good but what our something that stuck with me. Okay.

Starting point is 00:04:06 SQL might be good, might not be good, but what our customers want, they want CUDA. They want CUDA. They want CUDA on Intel GPUs. I don't remember what we discussed in detail, but it's

Starting point is 00:04:24 really a huge of the discussion. He was right. I had no comeback from this, but the thought stuck with me. So time passed. The Project X was canceled, and I started working as a manager. I started working as a manager. As a manager, I had almost no opportunities to write code and it started to become slightly frustrating. If you don't write code, you sort of become...

Starting point is 00:05:00 Every day you get slightly more divorced from the daily realities of writing code especially if it's Intel GPU code where at the time it was pre pre what is the public name? Alchemist it was still

Starting point is 00:05:19 in flux, still moving quickly so I was getting a little bit more frustrated and 2020 came in. With this COVID, we're all stuck inside our homes, me included. And at the same time, Intel, when it comes to Intel GPU development, started moving to a different host code library. So previously, most of the development used OpenCL, but OpenCL is relatively high level.

Starting point is 00:05:58 And if you want to have some kind of extensions to OpenCL, you have to, to some degree, negotiate with other vendors. So you don't create something that is useful to you and another vendor comes something useful to them and these are just very minor differences. So Intel came up with something called Level Zero. This was the new premier way to write host code. And it looked fairly good. And we were all just starting to use it, starting to learn it. And I wanted to know more about it.

Starting point is 00:06:40 So in my mind, I had two choices. Either resurrect Project X, which was canceled. I was specifically asked not to do this, not to do this in the open source. And I thought, well, OK, if I do this, I'll get in trouble. And there's another project I thought would be interesting, implementing CUDA on top of level zero. And my thinking at the time was, okay, what I really wanted to do was learn level 0,

Starting point is 00:07:10 understand how it works, what are the... what I can do with this that I cannot do with OpenCL. If it works, well, it's not going to work, but if it works, it's going to be very cool. So I started working on it, and I worked on this through most of the 2020. And well, that's how Zulda came to be. Surprisingly, it worked.

Starting point is 00:07:38 Surprisingly, it reached decent performance. And I released it sometime late 2020. What would be decent performance in your mind? Like if you were to take an equivalent Nvidia GPU and Intel GPU, like what percentage of the performance were you sort of getting at? Right, so this is a difficult question. I guess it depends on the workload, yeah. It's difficult to compare the performance because

Starting point is 00:08:05 you need a workload where you have uh implementation native cuda implementation and opencl implementation that works natively on intel so so a good target was geekbench that why Geekbench was pretty much the only thing that was working. So in my mind, something like 80% of the native OpenCL performance would be really good, surprisingly good. I don't remember what was the exact performance. It was something like 90%.

Starting point is 00:08:38 So Geekbench is actually a suite of tests. Surprisingly, some tests were even faster than the native versions. Wow. Some were slower. But the performance was, I think, around 90% overall, which was much better than I expected. Well, it's much better than the alternative, which was zero. Yeah, yeah, yeah. So for anyone who may be unaware of what zeluda

Starting point is 00:09:07 actually is like at a high level can you just explain like what the general purpose of the project is like general purpose of the project is you have your software and your software doesn't work on your hardware so that makes your software work on your hardware, so Zuda makes your software work on your hardware. That is. Okay, maybe you can level up on that. Yeah, yeah, no, no, no. So you understand what is the good about this project and specifically targets applications, libraries, whatever, plugins using CUDA. And the reality is if you have your GPU and maybe use your GPU for gaming, which is horrible. Why would you do this? GPUs were meant to multiply matrices. GPUs were meant to calculate chemistry and physics. So if you have a gaming work workload if you have a game then you're going to use a specific api to to do things that game do i don't know what they do uh

Starting point is 00:10:16 generate polygons yeah yeah so things like directx uh vulcan opengl But if you want to do computations, so physics simulations, chemical simulations, whatever, machine learning, then you have specific APS for compute and by far the biggest one is CUDA. And the thing about CUDA is written by Nvidia, it's written and created for NVIDIA's GPUs. But it turns out, and this is what Zluda does, you can implement the runtime environment of CUDA and then run some other GPU from another vendor. You can use your application on another GPU, non-NVIDIA GPU. From the perspective of the application, it's a normal NVIDIA GPU.

Starting point is 00:11:12 It's slightly strange, has a little bit strange configuration. Some things don't work, but from the perspective of an application, it's the same. It's similar to how Wine works on Linux. You can run Windows applications on Linux. So there's something I was going to say. I had something I was completely blanked on. lower and lower level explanation so if you have if you have a CUDA application so for those of you who wrote CUDA they might be thinking okay how is it possible because if you write CUDA CUDA is also CUDA has many things but other than the runtime environment

Starting point is 00:12:03 it's also a programming language. So write your program in CUDA. It's a dialect of C++, and you can mix GPU code and CPU code. But this is an illusion, because what happens at the end, at the sort of bottom level that's where it's nice you have uh it's a normal application which calls into uh nvq.dll or libq.so it's a normal application which uses certain functions provided by a certain library there's no no magic. And if you implement this runtime driver library, then you can... then you have Zluda. Zluda implements this library. And this library can be relatively complex because you need things... So what runtime library does, it allows you to control your GPU. So you can query your system.

Starting point is 00:13:06 Okay, I have GPU number one with this name and with this much run, with this many flops. I have GPU number two or three. And on this GPU, I can allocate memory. On this GPU, I can execute a computation. So runtime also contains a compiler for virtual assignment. This is a bit more complex, but you have to also provide this. And Zilda does it.

Starting point is 00:13:43 So obviously, NVIDIA likes CUDA being on the Nvidia GPUs. Were you ever concerned with Nvidia having any issue with you doing this? So I had initially but the case has been settled with Oracle versus Google and I have been contacted by a fair bit of organizations. I mean, initially, including Intel. So, and pretty much everyone tells me, they checked with their lawyers and lawyers think that after Oracle versus Google, implementing APIs is fine.

Starting point is 00:14:22 You can just do this. And actually, I saw this when I was at Intel. So I released end of 2020 next version early 2021. I don't remember how much later but I think three months later

Starting point is 00:14:36 Oracle Google was settled. The judgment was made in the United States Supreme Court and shortly after i was contacted by intel that and they told me hey oracle google is settled uh we think this is we can do this yeah that yeah i know about that case i didn't realize it was only settled that recently yeah yeah yeah well three years ago yeah well yeah still it's still pretty recently like when did the case actually start if you remember what year it was like early oh i don't know maybe 2018

Starting point is 00:15:13 2019 those cases take a long time sure yeah yeah but basically you are you just effectively assume like okay that case has gone well now that it's actually done yeah by the way I'm not a lawyer neither of us are lawyers here so don't take anything we say why is the camera following me don't do that Jitsu why is

Starting point is 00:15:39 yeah assume that nothing we say is legal advice here oh no now where is it even focusing? Oh, it's off there. Is it going to move? No? What is it? I hate, I hate Jitsi.

Starting point is 00:15:53 I don't know. On my screen, the camera is fixed. Oh, okay. I'll, sure. Okay, I'll just fix it up on here then. Fine. Anyway, getting distracted by that um technology's hard yeah there's probably a setting to turn that off i just don't know where it is um so what is it

Starting point is 00:16:14 about cuda that makes it such a driving force in this space like why does everybody want access to cuda is it just the fact that nvidia is like has that market share now or is there more to it than just that? There's much more to it. So if you look at how the current market for server GPUs look like for GPU compute. So you have three vendors right?vidia who is competent and intel and amd who are not and i'm talking about software software because um that's my expertise well i can't really speak much about hardware hardware looks fine from my well other than some embarrassing failures theware looks fine from my... Well, other than some embarrassing failures, the hardware looks fine, but the problem

Starting point is 00:17:09 is with the software. So, you are a developer, then you expect a certain level of quality in your software. So, your compiler will not miscompile your code. Your profiler will give you profiling information.

Starting point is 00:17:28 Your debugger will work. And those things are, for some reason, possible in CUDA land. They're totally possible in Heap land and DPC, in whatever it's called, Intel land. And that's it. Things just work on NVIDIA. So it's not like NVIDIA stack is perfect.

Starting point is 00:17:51 For example, we run into a miscompilation in the NVIDIA compiler. So they have their own bugs, but things work. You have profiler, it works. You have compiler, it works. You have debugger. So I can specifically talk more about AMD

Starting point is 00:18:08 because at Intel, I was working in pre-Silicon environment. So there was expectation that things will not work. So I didn't really use the public version as much. But on AMD side, so imagine that your runtime, for example, operation on images doesn't work, like silently give you wrong results. So, and it's arbitrarily for only just some formats. So we have image with floating 32 bit floating point pixels, it works. If it's 16 bit floating point pixels, you get wrong results for no good reasons. You have 8 bit unsigned integer, it works. 8 bit unsigned works sign gives wrong result for no reason

Starting point is 00:19:07 and it is sort of thing uh like you cannot you just cannot live the life where you have to second guess everything you do with your uh compiler and some tools are just just not usable so Some tools are just not usable. So if you're writing GPU code, you're not doing this for your deep love of GPUs. You do it because you want to have performance. And what you need, you need a profiler. You need a profiler to tell you, okay, what are my performance problems? And you're going to use it a lot when writing GPU code. I think an average GPU programmer have a higher need

Starting point is 00:19:50 for than average CPU programmer. So, so one example on AMD performance profiler. So you want to provide workload and the workload will have, let's say, performance events. And let's say coarse-grained performance events, meaning copying data between CPU and GPU and dispatch of a kernel

Starting point is 00:20:22 and some other things that take time on a GPU. And in a normal workload, you will have tens of thousands, hundreds of thousands, maybe low million number of those events. And you open, imagine this, you open AMD imagine this you open a amd gpu profiler and well it has certain limitations it can only capture a certain number of performance events so i want you to

Starting point is 00:20:55 to guess how many it can capture uh if you were talking about dealing with millions, I would hope that you could handle that much. But if there's a problem here, maybe in the range of a few hundred thousand? Fifty. Oh. Fifty. Oh, okay. That's the limit. At least it was last year when I was using it.

Starting point is 00:21:24 So, you cannot measure performance. I see. That sounds like a problem. It is a problem. I think they improved it. You can now capture low000, maybe low 100. I mean, it's better, but it's still worse. Or is the magnitude better?

Starting point is 00:21:49 Yeah, but still, if you have NVIDIA profiler, you can just capture the things. So I only work with smaller workloads when it's maybe tens of thousands, and it just works. when it's maybe tens of thousands and it just works. So that's more of an AMD problem. So, I mean, going back to a higher level, if you look at AMD, they have the right strategy and bad execution. Where Inter had bad execution.

Starting point is 00:22:26 Where Intel had bad execution, had bad strategy, but I think decent execution of the strategy. And could that work? Let's talk more about Intel.

Starting point is 00:22:42 I want to talk more about Intel because right now they are poor. There's no risk that they are going to sue me. And Intel has the... So if you look at what AMD is doing, their compute stack is relatively similar to CUDA. Meaning the goal here is you have your CUDA code, existing CUDA code,

Starting point is 00:23:09 and porting to AMD world is, if it works, it's relatively easy. What is the AMD? Yeah, I was going to ask, what does the AMD stack actually look like? They're following CUDA, basically. So the APIs are the runtime, the performance libraries are relatively similar.

Starting point is 00:23:32 The tools might be different, but the tools are not super important. They're not part of your program, right? So imagine you have a function to allocate memory. On CUDA side, it would be called QMalloc. Take some arguments. On AMD side it's going to be pretty much exactly the same, but it's going to be named HipMalloc. And this is what programmers actually want. Because the world we live in is the world where most of the GPU compute code is already written and using CUDA.

Starting point is 00:24:09 So this is objective reality. Intel rejects this objective reality and lives in their own dream world where there's no GPU code. And every programmer wants to write code from scratch so if you look at their uh dpc++ or SQL in some ways it is actually better than cuda if you look at the uh purely at the the APIs they have but the problem is the API is very different from CUDA. So every porting you do, it's one-way ticket to Intel ghetto. And there's also the social aspect. It's somewhat difficult to trust Intel,

Starting point is 00:24:58 who hasn't seen much success in GPU compute world to port to Intel platform and just abandon your existing working CUDA code. With HIP, you can always port back HIP to CUDA. You can do this. With DPC++, it's

Starting point is 00:25:18 much more difficult. It's very different. And another problem with DPC++ is it's very strongly C++ based. With CUDA and heap, you can have your application written in some other language and it's going to interoperate with your CUDA C++ code relatively easily.

Starting point is 00:25:47 So there's a C API, which you can use from any language. With DPC++, it's not so easy. Everything is C++. There's no C API. So using it from something like Python is a tad difficult. And even if you do the mapping, it's very difficult to have interoperability between your back from C++ to Python. And we are talking about C++ and Python, but there are other languages that also want to do some degree of GPU compute. And they have the same difficulty.

Starting point is 00:26:29 And you're not going to be able to interoperate between all of them. And you need to have strong C++ support. So it's already a sort of losing position. So this strategy is just not good. It has no future. Unless they did, unless we port Zlodat into the GPUs. Hopefully we'll do this soon and rescue them from their own wrong ideas. Yeah, when you bring up supporting things like Python, like Python is a really popular language

Starting point is 00:27:03 when it comes to like doing that in like the research space you have a lot of people who just you know they're already doing a lot of their math computation with python and they're going to do a lot of their other stuff with python if they can so it just makes sense to make that or make it easy to interact with other systems through that as well, where this Intel system sounds like it's a whole different way of approaching things that adds some extra challenge. CUDA makes it easier. Python is actually in a relatively good spot

Starting point is 00:27:42 because, as I understand, they have some good ways to bind to C++ libraries. Other languages are not in such a strong position. Right, sure. That makes sense. So it's slightly less of a problem from a Python perspective. Okay, so we kind of went down. I don't even know how we exactly got to this bit. But let's shift focus a little bit.

Starting point is 00:28:07 So now the project has been through its new revival into what it is now, what are the current goals of the project? What sort of applications are intended to be supported? are intended to be supported and what would be like a dream support that maybe right now is a bit outside of the scope and then what are things that you could support theoretically but you're just not going to touch right so so the goal the main goal didn't change Total world domination is something we want to do. But more specifically, since our team is relatively small, just me, and external contributors for whom I'm very thankful. So we have to... So there's limited amount of time. So need to focus so right now we are focusing on machine learning workloads all the python sensor flow starting with something similar like lmc but we want you to be able to whatever machine learning workload you have to run smoothly on Zluta.

Starting point is 00:29:28 And we're also making some other choices. So we are focusing less on Windows because Windows needed extra support. And it's going to work, but you will have to do... It's not going to be as smooth. Well, it wasn't really smooth, but it's going to work, but you will have to do user role. It's not going to be as smooth. Well, it wasn't really smooth, but it's going to be even less smooth. We are going to support less GPUs, only focusing on the GPUs that are sort of similar to NVIDIA GPUs. And this is specifically RDNA GPUs,

Starting point is 00:30:03 RDNA 1, 2, 3, future RDNA GPUs, RDNA 1, 2, 3, future RDNA GPUs. Yeah, these are the main areas of focus. What is it about the Windows support that actually makes it a challenge? Right, so loading libraries. It's tricky. So in the perfect world, how it would work?

Starting point is 00:30:31 You have your executable. It doesn't matter what sort of executable. It doesn't matter if it's Python or, I don't know, or Blender or something else. Language is not important because at the bottom level you are talking to a library. You launch your X through some kind of Zulu launcher and every time you load a library,

Starting point is 00:31:01 we check is it a CUDA library? If it's a CUDA library, we replace it with Zluda library. So it's all transparent and efficient. And every time you launch a new subprocess, we

Starting point is 00:31:24 also insert ourselves into this sub-process and if the sub-process loads a library, we also replace it with Zluda library. It turns out it's not really possible. So what Zluda launcher settled on is... And why it's not possible. Firstly, surprisingly creative ways you can load CUDA into your process.

Starting point is 00:31:48 So some applications will have the main actually will have dependency on the NVQ.dll and no, we can't replace it. It's fine. But some other applications,

Starting point is 00:32:03 I think, and it's relatively think But some other applications, think of things like Python. Python actually doesn't have dependency on NVQ.dll. It's just some Python script is going to load NVQ.dll using dynamic loading APIs,

Starting point is 00:32:22 load library. So, okay, we can overwrite your load library and replace the libraries, but there's also some applications where you load DLL, and this DLL has a strong dependency on NVQ.dllll and we cannot detect this. On Linux world it's slightly cleaner because anytime you load a library, the system runtime loader will go through the public API.

Starting point is 00:32:59 It will be called to dlopen. On Windows it's split. If you're doing it yourself, it's public API, but the system loader will have its own function, and we don't want to overwrite system functions from

Starting point is 00:33:17 kernel 32DLL or whatever. It's not a robust way. So what we settled on, you run your application under zoodaloader, zoodaloader will inject our own NVQ, you want it or not, you're getting it. Just already not nice, but presumably you want this. And then if you're explicitly loading library,

Starting point is 00:33:50 then we explicitly replace it. If you're explicitly loading NVQda, we're overriding it to explicitly load our own NVQda. Just this part of implicitly loading Zluda into every process whether it's necessary or not, it's sort of not nice

Starting point is 00:34:14 but it's the most robust way and you need to have this support for every DLL and those DLLs there's some degree of complications because also you have sort of two kinds of DLLs on Windows. When it comes to code, I have loader DLL, which lives in Windows system 32. And you have real library, which lives in one of those driver paths.

Starting point is 00:34:41 On Linux, it's simpler. You don't have this loader DLL. Linux gives you official supported way to inject yourself into process and all its children through environment variables.

Starting point is 00:35:00 So it's less work. It's less work. And it works. If you could just suddenly force all the Windows developers to do things in a certain way, what would you want them to be doing? I don't want to force them to do things my way that is easy to me because they just have different...

Starting point is 00:35:24 It's a different system, different needs. This is kind of baked into the system, the way the dependencies are resolved. It's more difficult for Zluda, but in some ways it's actually better than on Linux because on Linux you have a dependency to a function. So if we say the simplest function there is malloc, right? So on Linux is going to be what is backed into your executable library is the name of DLL and the name of the function. On Linux side, it's just the name of the functions. Your name of the library is not included in the dependency resolution, and it quite often leads to conflicts. So, I mean, for my purpose, the Linux way is easier and smoother,

Starting point is 00:36:17 but it has its own problems. That's understandable. It does sound like, though, that the Windows side That's understandable. It does sound like though that the Windows side just makes things more complex. For me, yes. Because what we do is not a normal way of doing things. alternatively just throw away your throw away your NVQ, your official NVQ.dll and just install ourselves into Windows System 32. Right. But I don't want to do this because

Starting point is 00:36:56 it can crash some applications. Maybe we don't support something and even if you don't have an NVIDIA GPU, it's going to be more robust if you use official library because there's some hidden APIs, there's dark API, we might come to this in our discussion but if it's not supported on our side, your application is more likely to crash and it's going to happen with games. So suddenly you installed Zluda into your system and some of your games start crashing for no reason.

Starting point is 00:37:34 And it doesn't show it's fault of Zluda, it just crashes and you don't know why. We don't want to do this. We want you to launch your application with zluda launcher so it crashes you know it's not something else no that actually makes a lot of sense because um my understanding is there are certain parts that you don't at least right now want to bother touching just because you know as you said there's only so many people working on the project there's only so much time to work on things so there's there's things that many people working on the project. There's only so much time to work on things. So there's things that just are not going to be included, at least for now.

Starting point is 00:38:11 Yes, yes, yes. If you look at the NVIDIA API, if you look at the CUDA API, it's huge. It's gigantic. And, you know, it's 80-20 thing. If you look at applications, they're going to look they're going to use certain functionality much more than any other functionality. So pretty much every application is going

Starting point is 00:38:33 we want to allocate memory on a GPU. Every application want to launch kernel. Not every applications will have multi-GPU support. Not every applications will want to do runtime linking of kernels and some other niche functionalities. So generally, how we approach things in Zluda,

Starting point is 00:38:57 we don't track CUDA version and add APIs added in a CUDA version, we look at applications. And if application uses certain APIs, then we implement those APIs. And next applications, what API does it use? We implement those APIs. So sometimes I get the question, which version of CUDA is implemented in Zuluda I don't know it's whatever applications are using it's a mix we do run applications to work we we don't care about some you know some artificial standard that doesn't exist right actually that's a I think a really sensible way to approach it because like

Starting point is 00:39:45 you know you could approach it as i don't know how cuda versioning works let's just say version one like start at version one get a complete perfect implementation but then nothing else like no modern applications are supported because all you're supporting is like that basic core functionality and you're going to take a really long time to get to the point where things are actually working with real world like software yeah yeah yeah yeah yeah yeah we care about the application we want your stuff to work that's the goal whatever it takes right you're not trying to pass like Vulcan conformance tests no so you bring up that term dark API you said that is one of the things you want to talk about

Starting point is 00:40:31 I've never heard that term before what does that actually mean and then we can go into like what you're actually talking about with CUDA dark API yeah it's it's a pain so something you should understand uh so so could obviously not not an open source api but there's something more to it is that

Starting point is 00:40:58 your code is always a second class citizen on cuda platform you may not realize it second-class citizen on CUDA platform. You may not realize it. And I will explain why. So first thing, there's really two and a half APIs on CUDA. So there's two APIs most CUDA engineers will be aware of. There's what they call runtime API and driver API. And they're extremely similar. Runtime API is slightly higher level, driver API is slightly lower level. We implement driver API. And there's also one more API hidden from a public view.

Starting point is 00:41:46 So we have API. Typically, if you have API, you have a name of a function, number and type and name of the parameters and return values and some documentation. With Dark API, you have nothing. Every function that is supported by

Starting point is 00:42:02 Dark API is some kind of unique identifier git and the index so so you ask your runtime you for you you ask your cuda driver hey give me table with function pointers for this unique key. And this is used by Intel Runtime really for no good reason. I don't know why they do this. And for some first

Starting point is 00:42:36 party libraries. Because there's some things NV doesn't want you to know. They don't want you to be able to do. And a classic example is ray tracing. So you might think, okay, I have my GPU and I want to know if my GPU is capable of hardware ray tracing. And Vita doesn't expose this information. At least doesn't expose it to you.

Starting point is 00:43:00 It exposes this information to its own libraries, its own runtime, and it goes through this hidden API. It has no name, has no documentation, so we call it Dark API. I don't know what is its proper name. Nobody knows. And the Dark API is frustrating experience for something like Zoda because we have to implement it. It actually almost kills Zoda because first time I'm enabling very simple application

Starting point is 00:43:35 just adds two numbers. All it does adds two numbers on the GPU. So I have this good application and I look at the interactions between application and CUDA driver and everything goes as expected. Every function that I wrote in the source code is being called. And there's one small addition I did not write. And there's call to this And there's one small addition I did not write.

Starting point is 00:44:07 And there's call to this co-export table. And I look at it. This is some kind of unique key I never seen before. And this gives me a table of pointers. I don't know what are those pointers. OK, I said break pointer to each one of them. And it calls one of them with, for some reason, I know no arguments. I don't know what this function does.

Starting point is 00:44:31 I decided that's too much for me. It's too much for me. I give up. And I gave up for, I don't remember, for two weeks. After two weeks, I thought, well, okay, maybe, well, I'm in mood for some pain. Let's try it. It's probably not going to work. And I look at this function. I should look at what function it calls, and it's for instructions.

Starting point is 00:44:56 It calls allocation. It calls malloc to allocate some memory and returns this memory to the driver. Okay, I thought, well, it's not so bad. Actually, and it's really motivating me to keep going and pure luck because it was by far the simplest and easiest function from the Arc API. All other ones have been slightly more difficult. And generally, so generally I don't,

Starting point is 00:45:28 been slightly more difficult. And generally, so generally I don't, I was wrong approach to look at what this function does because it's most of the time it's not super useful. Nowadays what I do, I look what are the inputs and what are the outputs? Because usually that's your first clue. If a function returns some kind of pointer and you see across the rest of the application that this pointer is used as context, then you know, okay, this dark API is creating context with some internal bits being set or unset. And those bits are not really important.

Starting point is 00:46:05 We might as well create a context for slowdap because it probably does something, but there's no, across the applications we had, there's no observable effect to this function that is different from the public API. So that's how it goes. And we implemented those only parts of dark API that are necessary to run your applications. Mainly interactions between the high level runtime API and driver API. Runtime API doesn't always talk to driver API through the public APIs. It also talks through dark API. Runtime API doesn't always talk to driver API through the public APIs. It also talks

Starting point is 00:46:45 through dark API. And then I noticed that the first party libraries use dark API for various reasons, which I don't care about. So it's basically just those internal functions that there's no point for them to document because you're never actually supposed to call them yourself. I mean, I think they should document them. Well, they should. Because it would be nice to know how can I use everything that I paid with this GPU,

Starting point is 00:47:22 but they decided not to. that I paid with this GPU, but they decided not to. There might be some thinking why they do this. It's obfuscation. Maybe some of those things they just don't think are useful. Maybe some things

Starting point is 00:47:38 they just want to hide. Well, when it comes to ray tracing, they definitely want to hide those things because optics also doesn't expose those things. Whatever the reason for it, it makes your life a lot more complex. Yeah, yeah. It probably makes also life of CUDA engineers more complex

Starting point is 00:48:01 because some things they do in this dark API, I don't know why they are doing this. It's just... For example example there's something they added recently and my suspicion is they they wanted to make my life more difficult but compiler optimized out all they wanted to do in this function because this function what does it do you have let's call this function foobar, and it returns if function foobar, so itself, starts at an even or odd byte in memory. And this is completely totally pointless because your compiler will pretty much always align your function to, not only to even address, but most likely to your natural size to 64 bits.

Starting point is 00:48:54 So it's always going to return zero. I don't know why they do this. It's just a mystery to me. And literally two instructions. Or three. I don't know. I don't know. I don't. I haven't answered for you either.

Starting point is 00:49:14 But my expectation is there was more complex body inside. They wanted me to implement, but optimizer optimized everything out. Who knows what they are thinking. It's a mystery. I only care about the applications running not about their creative ideas. Right, right. So with these dark API functions, the way you're basically approaching them is effectively like a test suite where you're throwing data into it, you're looking at the

Starting point is 00:49:41 data that comes out of it, and you're hoping that whatever you're doing in between is getting you the result that would happen on an actual NVIDIA GPU with CUDA. Yeah, most of the time it's sufficient. Sometimes you have to look what it does internally. But it's usually too complex to do this

Starting point is 00:50:02 because if you have a function, it's going to call any number of internal functions. So observable properties is what matters. If it sets a flag in some kind of object, it tells me nothing. We're going to do the same thing without setting this flag. So you said earlier that

Starting point is 00:50:32 CUDA is a really big API. How big actually is it? I can open my... If you give me a minute, I can open my ID and tell you. Too good?

Starting point is 00:50:49 Just give me a second. So just the... It's not 100% accurate. It might also count some function pointers, but it's just the driver API is 575 functions exported. We don't implement all of them. We implement, well, currently we implement nothing, but the old Zulu I implemented below 100, or maybe slightly above 100, and it was enough to run a whole lot of applications, most of them

Starting point is 00:51:45 okay so this is driver api and there's also performance libraries which have their own apis and they also have a lot of functions so those hundred like a lot of the main functions that sort of everything is going to need to deal with as opposed it was going to be a lot of these like little functions here and there that are important for those workloads but may not necessarily be something that most applications are using. Right, if you have... we implement, as I said, we do enablement workload by workload, so application by application. And we can see that there's a core of operations, functions that everyone is going to use, and then it gets less and less common.

Starting point is 00:52:39 Some of them might be used by nobody because they are added for completion, some kind of getters of properties, this sort of stuff. One thing I didn't really ask earlier is, what does the name of the project actually mean? Why is that the name? I kind of get it Uda, so Kuda, but why Zaluda? Yeah, so, okay. I have to give an excuse for myself. So, the name of the project, I literally came with it the day before release. Initially, and for the whole of 2020,

Starting point is 00:53:20 and you can say it if you go back enough in history, it was simply called not CUDA. And, you know, I'm not a lawyer, but I'm relatively certain that I cannot release a project name like this. And the day before release, I said, well, let's go with something that sounds slightly like CUDA. There's no such words in English, I think, really. Let's try Polish.

Starting point is 00:53:50 I'm Polish. And this sounded nice. So CUDA means something like mirage, illusion. OK. That's actually, you know, that sounds kind of cool, actually. There's no hidden second meaning, I thought it's going to be nice, use polish, have a word that sounds nice. And it's not going to get used to it.

Starting point is 00:54:17 Yeah, but I was aware that it's going to be impossible to pronounce, so it's spelled differently, so I simplified it a little bit yeah yeah as i said like i i i was never gonna pronounce this word correctly but i don't think i don't think anyone was going to if it was uh spelt correctly how would it actually be spelt um i can write in chat oh Oh, yeah. Let's see if I can... I don't even know what that's... Yeah, yeah. This L is spelled differently.

Starting point is 00:54:52 Okay. I don't even know what that symbol is. It's like English W, the pronunciation. Oh. For anyone who is just listening, it's an L with like a slash through it. Yeah.

Starting point is 00:55:11 Yeah. I have Polish listeners who will be like, you're a moron. I can barely speak English in the best of days. Don't get me started on that. Don't get me started on Polish. That's what I get for being Australian. Yeah. So, hmm. Where do we go from here? Oh, we haven't talked about it being in Rust yet, have we? I don't think so. Yeah, I get this question a lot. So Rust

Starting point is 00:55:48 is being perceived as still sort of exotic, alternative to more mainstream language in this area for solving this sort of problem. So the mainstream solution would be to C or C++, but the thing is, I have known Rust for

Starting point is 00:56:11 a long time, for over a decade, so I learned Rust before version 1.0. Yeah, I was gonna say, if you've known Rust over a decade... Yeah, okay, yeah. over a decade. Yeah. So I've been interested in this early because I always wanted to do so professionally I've been writing C sharp and then F sharp but I always had a certain level of interest

Starting point is 00:56:36 in lower level development. And I always had suspicion that kind of lower level development system level programming is not only difficult because it's

Starting point is 00:56:53 maybe less mainstream tricky but also because C++ is just not a good language so the first time I learned about Rust and saw the sort of semantics it has, I realized, wow, that's something I always wanted for system level programming. And I would never write Zlodai in C++. It's just too much pain, too difficult.

Starting point is 00:57:28 And I knew Rust, and I always wanted to do a project in Rust. So things I wrote in Rust were sort of small projects that I think never were released to try some features or try something out with the language. to try some features or try something out with the language. So I think this is the first big mainstream project I did in Rust. And I'm really happy with the language. So language is relatively good for writing system level code. Well, there are some problems with it.

Starting point is 00:58:11 But I mean, very small minority, my opinion at least. So maybe don't listen to me, but the build system is really anemic. So it works well if you have a relatively simple project that is all in Rust, but we have things that are slightly more complex. There's interrupt between, there's CodeGen and Cargo is just not good enough for this purpose. So we can have our own solution, but it would be better better if cargo had a real build system so i actually prefer cmake i'm one of the few people in the probably only person who prefers to have a simeric or ms build or anything else other than cargo for building cargo cargo does package management is relatively good at package management, but build

Starting point is 00:59:05 system is just not it. But when it comes to language semantics, the availability of libraries, it's good. That's what I wanted. Why would C++ be painful? What about the language

Starting point is 00:59:22 is a problem? It doesn't have the feature I want in the language. So I want to have enums, discriminated unions. So I've experienced writing in F sharp and professional experience writing

Starting point is 00:59:37 F sharp. In F sharp you are going to use discriminated unions a lot. And a lot of really fundamental types in Rust are expressed as enums, like F sharp Rust styles enums, where you have not only the value, but also data associated with it, discriminated unions. And so things like

Starting point is 01:00:05 instruction. This is enum. Statements, directive, pretty much everything in the compiler. And C++ doesn't really have a good support for this. Other things like memory management.

Starting point is 01:00:23 Much easier once you learn how to deal with Borrow Checker, it's much easier to do in Rust than in C++. And obviously Rust is not all, you know, not every, not whole world subscribes to russ ideas about memory management and object and code safety so if you look at things like where we have to sort of interact with the outside world like emitting color vm assembly the code to emit lvm assembly is just every second line is unsafe is unsafe, but it's okay. And unsafe in Rust is the same semantics or even more strict semantics than C++. So we're still coming ahead with Rust. So that's what I want, basically. So Rust is basically just,

Starting point is 01:01:26 at least for you, the better tool for the job. Yeah, for me it's better C++. And as I said, Rust is not perfect. There's some areas where it's just unusable, like writing GPU sidecodes. So C++ has, and specifically

Starting point is 01:01:42 Clang, has all those little attributes that are useful for me. So, and things like support for address spaces. This is very niche features, but I want them for writing GPU code because if you look at Zluda compiler, certain functions are implemented as calls to full blown functions in LVM module

Starting point is 01:02:16 to which we link to during compilation. And this LV module it's written in C++, right? C++ compared to LVM. And you cannot really write this sort of code in Rust because it doesn't have good support for writing GPU code. Fair enough. I thought you were going to say more there,

Starting point is 01:02:38 but no, no. The delay sometimes throws me off when people are going to stop talking. Yeah, you're literally on the other end of the world, so there must be a fair bit of delay. Yeah, I can make it work, though. I've made it work so far, so, you know, it hasn't gone horribly bad.

Starting point is 01:03:03 One thing about the project itself, I don't think you really talk about this anywhere, but the choice of license on the project, and again, neither of us are lawyers, so don't get in our case about any specific points, but why, because it seems to have two licenses attached to it, the Apache and MIT license. Yes. Why is have two licenses attached to it. The Apache and MIT license. Yes.

Starting point is 01:03:27 Why is there two licenses? And why those licenses in particular? So I want my software to be used by everyone. Fair enough. For any purpose you want. So this has been sort of... Actually, it comes from Rust community. It's just been popular solution in Rust community

Starting point is 01:03:51 to do our license under MIT and Apache licenses. You pick whichever one you want. And as we agreed, we are not lawyers. And this is my understanding that this is the path to be the most compatible with all other

Starting point is 01:04:14 open source and closed source licenses that's why so your goal is basically just getting making it so people can actually use it rather than you know ensuring some free software perfection about the software where you know it's everyone that uses it has to also be free software all that sort of stuff yes yes yes yes actually while we're we're actually down this route,

Starting point is 01:04:45 what is your general stance when it comes to open source and free software? Clearly, from this, you're more in favor of the open source side, but do you have a position you stand on more generally? So my thinking is, and this comes from working at relatively big companies if you want your software to be used the way to go is to use

Starting point is 01:05:16 MIT or Apache license if you're licensing under GPL big companies will not touch it unless it's something that is extremely critical like Linux kernel. Those companies have explicit policies that, hey, if it's MIT or BSD or Apache, then you can use it in some only computer, link to your software, give it GPL, not really. Before release, and this is for internal stuff, for external stuff, there's usually,

Starting point is 01:05:57 or rather stuff that is being released from within the company, there's going to be a legal review within the company, there's going to be a legal review and they will check if you're using something that uses GPL. And if it's using GPL, then it's probably not going to be released with GPL. So maybe it's your goal. So the corporations are not using your software. Then I would even just recommend using GPL. software, then I would even just recommend using GPL. It's just my stance or politic, but this is the reality I observed. Yeah, the Linux kernel was in a weird position when it came along, because it was there to

Starting point is 01:06:39 replace the proprietary Unix systems that were coming up the BSD world did exist but it was this weird like mix of proprietary BSD and then 386 BSD was there as well and I we could just turn this into just me ranting about the early history of Linux because this is one of my very

Starting point is 01:06:59 one of the topics I really really enjoy and why the whole GNU herd thing just didn't work and why it should have been based on 3d6 VSD but then they didn't end up wanting to do that and chucked away that entire project to wait on this mythical kernel that was never going to come around anyway and then Linux came along before they even started the project so no one cared about heard after that anyway uh yeah yeah but just be aware that uh gpl gpl uh licensed staff has a sort of special position

Starting point is 01:07:36 when it comes to licensing in corporations so they're going to use it they're going to contribute but it has to be big enough so they're going to use links, they're going to contribute, but it has to be big enough. So they're going to use Linux Canada, they're going to have their own fork of GDB, which is also GPL license and stuff like this. But if it's something smaller than a lawyer who is giving you a review, it's not going to probably give you a special exception for it. review is not going to probably give you a special exception for it. If your goal is to make a license, to make a library,

Starting point is 01:08:13 just don't even bother. Don't even bother with the GPL style license. I mean, I'm not going to tell you how you should life your life. Sure, sure. If you're maximum usage, then GPL is probably not the solution. So what is your background in programming? I don't just mean your corporate background. When did you actually start doing programming? How did you actually get yourself interested in it? Well, during my first programming lesson at university,

Starting point is 01:08:44 I didn't program before going to university. So I hope it's not going to be a letdown, but I always use computers, but wasn't really interested in programming as such. I went into computer science degree because I was broadly aware that you can have a good life if you have a computer science degree. Whatever you do, programming, security, databases, it's going to be fine those were different times

Starting point is 01:09:28 that was oh wow so long ago long time ago it's still I had to understand that it's going to be great and I was actually how do you say, overwhelmed?

Starting point is 01:09:52 Maybe not the best word, but I did not see myself as a possible programmer, because my thinking was that programming, extremely difficult. Every programmer has this sort of galactic brain and the tools they are using are really high technology and everyone is excellent. And I started programming and what I realized that all the programming languages are old garbage

Starting point is 01:10:22 from 30 years ago. Programmers, they cannot program a FizzBuzz. So, you know, it doesn't matter if I'm bad at programming. It was extremely encouraging because I realized it doesn't matter if I'm really bad at programming.

Starting point is 01:10:38 Other programmers are even worse. So my bad code is not going to make things worse on average. So yeah, let's start programming. And it was interesting. That's how I became a programmer. I actually found my passion for programming

Starting point is 01:11:01 when I was studying for computer science degree In that first programming class you did, what language were you actually working with? My first programming lessons was C It's a sensible

Starting point is 01:11:22 language because I've heard some real weird answers from people before where it's like objective C or you know just other random things that don't make any sense right so I learned a number of those languages

Starting point is 01:11:38 I think I'm using many of them so I learned C, C++, Rabi Python I did So I learned C, C++, Rabi, Python. I did... So C Sharp I learned by myself and this was my first professional language I used professionally.

Starting point is 01:12:01 Okay, so all completely normal and sensible languages. I started with Java. Oh yeah, so all completely normal and sensible languages. I started with Java. Oh yeah, so when it comes to Oh yeah, I learned some Java. We learned Prolog. But I remember nothing about Prolog.

Starting point is 01:12:17 I don't remember that Prolog existed. Yeah, it's fairly interesting in its niche. But I wouldn't be able to write anything in Prolog nowadays. So now, is Zaluda your main focus at this point? Yes, yes, yes. And my journey has been, when we want strange languages, was sort of C-sharp, then, then F sharp, then assembly,

Starting point is 01:12:48 then Zluda, which is Rust and C++. No, there are some, and in between there are OpenCL and C4Media, which you definitely didn't hear about and none of your listeners heard about, but it's a great language. Ironically, it's a great language for writing code for Intel GPUs. It's a dialect of C++ for writing in the GPU code, and it's really good if you want to do just this.

Starting point is 01:13:22 It's mainly used internally at Intel. Oh, okay, that makes sense then. But I think they did some public releases. How do you spell the name by the way? CM, C for... No, I think it's C for Metal nowadays, previously called C for Media. Ah, yep, yep, okay. Has nothing to do with Apple Metal. Everyone asks this question, nothing to do. C for Metal is a programming language that allows for creation of high-performance compute and media kernels to inter-GPUs

Starting point is 01:13:56 using an explicit single instruction, multiple data programming, blah, blah, blah. Open CL and SQL applications using implicit SIM, program model, blah, blah, blah. Okay, I don't want to read all this. Yeah, so by the way, if you have an Intel first party library that has GPU code, check which language does it use. If it's written in C for Metal, those people know what they are doing. If it's written in something else, probably not.

Starting point is 01:14:36 But they're trying to include C4 metal in DPC++ as a different mode of programming. It's called now eSIMT, explicit SIMT. Ah, okay, okay. But CM is still more ergonomic if you just want Intel GPU code.

Starting point is 01:14:58 eSIMT has the advantage that it interrupts much better with the normal DPC file class code. So let's actually get back to the project itself. When you wanted to sit down and actually start working on a project like this, where do you even start with something like this? You see the API there, you see you have a GPU. What do you even do to start getting anything, doing anything? Start with the... So in this case, it was starting with trying to understand

Starting point is 01:15:43 what CUDA is, because it was my first CUDA project. I mean, I know what CUDA is, but I had years of experience writing. Oh, this was also, I didn't mention, the secondary goal of this project was to actually understand how CUDA works or how we write CUDA code. So at this point, I had

Starting point is 01:16:07 years of experience writing Intel GPU code, but Intel GPU, it's always worth it to understand what is the competition doing, how do you write CUDA code. So firstly, understand

Starting point is 01:16:23 how do you even simplest things with CUDA code. So firstly, understand how do you even simplest thing with CUDA. And the goal has been, the first goal has been an application that does the simplest thing possible at two numbers on a GPU. But it still needs to do a fair bit of setup. So get a context, create, load the source module, allocate memory, all those stuff. So look with tool like, so on Linux you have tool like L trace, where you can lock all the interactions of your application with certain libraries. So one of the first things to do was understand what sort of host functionality

Starting point is 01:17:08 has to be implemented and think really hard, can it be implemented in terms of Intel GPU? So all this host functionality. So don't think big. Don't think big.

Starting point is 01:17:23 That's what worked for me. It was just a simple application because it was a research project. I had no goals to have complex things running. If I had two and two returning for on the GPU, I would be successful. So I have a simple application. I know what are the interaction. Obviously it was a lie because I didn't know about dark API at the time. And then I look at those calls and each of them, there's like 10 of them. And I know, okay, all of them can be expressed in the terms of Intel level zero. And this is host-code interaction. Since I have experience with writing

Starting point is 01:18:09 GPU code, I know there's also the GPU side code. So next step was try to understand what is the format of the NVIDIA GPU code.

Starting point is 01:18:29 Is it some kind of intermediate representation? Or is it some kind of architecture-specific assembly? I mean, depending on the level of complexity, either can work. And here I was lucky because, as I said, NVIDIA is competent and they have a virtual assembly called PTX. And this is a text format. So I have a text format and this text format contains and it's somewhat

Starting point is 01:19:07 abstract at least from my perspective so it's virtual abstract representation it's not it's not written for a specific assembly but it's going to work after compilation with any going to work after compilation with any CUDA GPU. And I look at the instruction set, and actually, good thing, NVIDIA documents the PTX format, and I look at those instructions, and all those instructions I'm familiar with, and I know they're going to work on Intel GPUs,

Starting point is 01:19:44 some things like load memory, store memory, do multiplication, do addition. All those things are going to support that on Intel GPU. And I was still thinking small. I have my very simple module in text format, and I have my host code. So it looks possible. It looks possible that I started implementing, starting with the PTX, with the compiler, because it looks relatively easy, but as I start

Starting point is 01:20:20 implementing, there are complications. And the way to implement this sort of classical compiler design is you have your text code, you start by parsing it using a parser generator. In my case it was a Rust library called Lullerpop. Rust library called Lullerpop. That's the name. Yeah, it comes from, it comes from approbation of this kind of grammars it can parse. It's not super fun, but library is good.

Starting point is 01:21:02 No, no, this is a solid library. And I start parsing those things. And then once you have things parsed, you have sort of in-memory presentation of the source code. You now start writing your transformation passes. And every transformation pass makes it slightly closer to your target. So I didn't even think about what my target is going to be. At first, I thought maybe I can compile it straight to the Intel GPU instruction set.

Starting point is 01:21:38 But relatively early, I decided, no, no, no, it's not going to work. I'm going to compile it to SPIR-V which is a bit more abstract, much easier to compile for, which was the right choice. So I'm going to compile to SPIR-V and in every compilation path I try to

Starting point is 01:21:58 make it a little closer to what SPIR-V is expecting. And it little closer to what we are expecting. And it sounds simple, but there's a fair bit of weird behaviors in

Starting point is 01:22:17 PTX format. There's entirely, for example, there's entirely too many other spaces. On GPU, you are generally going to have like four different address spaces, global memory, shared memory, private memory, generic memory, maybe constant memory.

Starting point is 01:22:42 And PTAs has like seven or eight eight and even in the simple kind of seven or eight different address spaces some of them are i mean if things were designed from the ground up are really not necessary in the current world but they exist and you need to have some way to translate them to other spaces that are available in in spear v some instructions have no uh direct uh replacement in spear v so you have to uh replace with a series of simpler instructions this sort of thing and. And it took a long time. So it was almost a year just to have a translation that actually translated correctly and worked.

Starting point is 01:23:41 So if you are attempting something like this, I don't have any advice other than start small and never give up. So I gave up like three or four times. One of those things was, I said previously, dark API, first time I encountered this. Another time was when I was very technical. So I did one pass that was completely unnecessary, translation to SSA format. And I realized that, well, it's completely unnecessary. LVM and SPV are going to do this for you. You don't have to do this.

Starting point is 01:24:17 What is the SSA format? single static assignment. This is sort of representation on which compilers operate, optimizing compiler, because this sort of compiler that is inside, it's not an optimizing compiler, it's just translation. Right. So it's completely unnecessary. And there's some two or three other things I don't remember. But I do remember just giving this project a rest for a week or two several times and still coming back to it when you said this was your first cuda project the first thing i instantly thought of was the um the linus torvald's linux email it's just a hobby project it won't be anything serious like a new yeah yeah as i said i

Starting point is 01:25:05 didn't expect it to be any big uh it was the the main reason i wanted to do this i wanted to learn about level zero and mainly about level zero because it was new for all of us at intel at the at the time so it it was at least initially just basically a hobby project to learn how this works. Yeah. But it sort of came along to some extent

Starting point is 01:25:36 and became something a lot more than that. Yeah, it was even when I was joining KMD to work on this project for the first time, full time, we didn't know if it's going to work. So we sort of expected that it's going to be either going to work and it's going to be great or just as likely it's not going to work. We are going to run into some kind of impossible roadblock. And in this case, we are going to release the updated Zluda for AMD. That doesn't work for some reason, but with explanation why it doesn't work. So it's

Starting point is 01:26:21 anyone can read the source and learn from it. anyone can read the source and learn from it. I didn't expect nobody expected it's going to work, but it does. Performance is actually not bad. Yeah, I didn't bring up the AMD stuff earlier because I wasn't sure what you can and can't say

Starting point is 01:26:38 there, because I'm sure there's... At least judging by what you have publicly said, there certainly were issues that have gone on. Yeah, so I can only repeat what happened. So I worked as a contractor for AMD for two years. Towards the end, I released the source code. AMD decided that they shouldn't have allowed me to do so.

Starting point is 01:27:03 It's not legally binding. They shouldn't have allowed me to do so. It's not legally binding. We rolled back the code to pre-AMD version and we are starting again. Yeah, that's... I'm sure there's a lot more there that you would like to say if there weren't issues there. Maybe one day, but...

Starting point is 01:27:23 Yeah, I certainly won't push you on that because I'm sure there's certainly some't issues there. Maybe one day. I certainly won't push you on that, because I'm sure there's certainly some legal issues there that you don't want to... Avoid getting sued by NVIDIA, you don't want to get sued by AMD. So what was interesting about the situation is that something I learned. There's so many

Starting point is 01:27:41 internet lawyers who want to prosecute the case in front of internet judges. So one reason why, the main reason why I'm trying to dodge this topic is that it's going to bring internet lawyers into the comment section. I see, I section. I see. Sorry, it has been already handed by actual lawyers. Yeah. Another thing is that

Starting point is 01:28:14 there's a certain demographic of people who blame NVIDIA for everything, even if NVIDIA has nothing to do with things, at least to my understanding yeah I've definitely seen a lot of those internet lawyer comments myself you go on reddit you'll see plenty of them like oh well even just in the um the github discussion you had like there were

Starting point is 01:28:42 people arguing there like what was the position of the person who told you to take the thing down are they at the correct level of the chain to do that like they like this you're not even if you even if you were in the right like i don't i i can't imagine you're in a position to like go to war against amd or even want to do that no no no AMD is a billion dollar company I'm not going to fight them they have lawyers they have

Starting point is 01:29:13 PR department like it's not helping anyone yeah it's much better to just comply with whatever whatever they want to reasonably get you to do and then just go from there. Yeah, focus on the technology, focus on the nice things we can build, all the applications that we can have running on your GPU. So, what is the current state of things?

Starting point is 01:29:47 So, current state of things is very little works. So, currently, I don't think even the whole... So, this is the situation as of today. I don't think even the whole project builds because it's being

Starting point is 01:30:02 written from the ground up starting from the most basic things like even parsing of the ptx so we have a new parser and then this new parser the the way it works has effect on the rest of the compilation. So I rewrote the parser, rewrote the compilation passes to be nicer, simpler, and recently finished writing the code responsible for emitting Alvian bitcode. So there's a number of unit tests inside Zluda. And the unit test works by unit test for the compiler.

Starting point is 01:30:55 So compiler takes a simple handwritten PTX module, compiles it, check if the result is as expected. And there's like 90 of them. So those tests for compiler works, but we don't have host code working. And that's the next step, have the host code working, have some other tooling around the project working. And once we have host code,

Starting point is 01:31:23 has some of the necessary tools, then we'll start focusing on the specific workloads. The first one is going to be probably Geekbench because it's relatively simple and then work on those machine learning workloads. So LLMC. LLMC is probably going to be the first machine learning core code that works. With some caveats. Flash attention support will have to wait a little bit.

Starting point is 01:31:58 But yeah, LLMC is the first goal, first milestone. So now that you've done a lot of this work like in the past now you're doing a rewrite what is it like other things you've maybe learned from that experience that maybe you're gonna make things easier now that you know because as you said you started it it was like your first attempt at doing this I assume there's a lot you've learned from doing it at least once. Right. So one big lesson, don't trust AMD things to work. It might sound funny, but it was an extremely stressful situation.

Starting point is 01:32:36 One of the first workloads I did when I contracted for AMD was a closed source application. I don't remember which one, either 3DF Zephyr or RealityCapture. And so I enable everything that is required for this application, both host code and in a compiler. But for some reason it doesn't work. And I, since it's closed source application, it's relatively tricky to debug, but I eventually, I found the kernel that is producing incorrect data. And I run this,

Starting point is 01:33:23 And I run this. So I extract this kernel and the data it operates on, and I run it both with, on the CUDA stack and on the Zluda. And it works correct. And it works correct on both every time. For some reason, the same kernel fails when run inside the application. And at the same time, I'm starting to panic because it's, well, impossible situation,

Starting point is 01:33:54 impossible to debug. It should work. It just defies the logic and the laws of physics. And I tried it both ways. So I run the whole workload on Zluda. tried it both ways. So I run the whole workload on Zluda. And because I thought, well, maybe earlier kernel under Zluda is computing things wrong. And I run it on both source from CUDA, source from Zluda and try it. Zluda to CUDA, CUDA to to zuda, it all works well, gives right result. And purely by luck, purely by luck, I realized that there is a bug in AMD host code. For certain for certain memory copies, it copies them incorrectly. It was my first revelation. Do not trust host code, even if it's simple mapping,

Starting point is 01:34:55 even if it looks simple. Do not trust that MD host code is going to work correctly. Always double check. So there's, you know, it sounds pessimistic. There's not so many cases, but generally, if it's something that has to do with textures, then double check if you're using heap. So that's what I learned. There's some other lessons that are very Zluda specific. And it was actually somewhat frustrating looking, starting from this point two years ago,

Starting point is 01:35:32 because I look at my code from before the rollback and I look at this compilation pass and I think, well, that's wrong. It's wrong. There's better ways to do this. There's better ways to do this. But I have to fix all that other code

Starting point is 01:35:48 before I arrive at this. So you can see where the goal is, but there's just so much to get there. I remember. I got rid of this part because there are better ways to do it. I have to live with this again.

Starting point is 01:36:09 But hopefully the second time around, you can get there quicker. Like now that you know what you're doing. Yeah, yeah, it's much quicker right now. Because firstly, because the workloads are different. So oh, one thing I did not mention that we're getting rid of, so much sadly, I have mixed feelings about this. We're not going to do ray tracing because Sluzloda has had an implementation of Optix. Optix is NVIDIA's framework to do ray tracing and it's

Starting point is 01:36:45 very complex. During my contract work at AMD, I think Optics took I don't know, almost a year. It was very complex. Not only Optics itself

Starting point is 01:37:02 but also applications using Optics. So the goal has been Arnold. It's closed source, very complex rendering solution. Debugging was a lot of effort. So it's going to be so much faster without optics so much easier it would be cool to have it you know if you had infinite time yeah yeah i know i agree i agree you know yeah yeah so so one of the things so as i left uh md or yeah as i left m AMD, released Zluda. So something that became really

Starting point is 01:37:50 clear, really quickly was that there's a lot of interest in machine learning workloads. Relatively little interest in sort of professional graphics workloads. So I've been focusing on...

Starting point is 01:38:06 So there's a time when I left AMD and had nothing else lined up. So I was still working on Zluda, but focusing on the workloads for which there's no commercial interest. I almost got

Starting point is 01:38:22 GameWorks... Well, I got it running in one application but never matched the code and then things similar to it so like I don't remember there's this suit

Starting point is 01:38:36 for 3D photogrammetry I don't remember what was it called so also has been requested but you know has no there's no commercial interest in it right but i i like to just well most people will describe nvidia at this point not as a gpu company they're an ai company right like i i like to call them a uh a machine learning shovel company. During a gold rush, don't dig for gold, sell the shovels.

Starting point is 01:39:09 That's what NVIDIA does. Yeah, yeah, yeah, yeah. That's true. If you look at the machine learning market, machine learning hardware market, there's two markets. There's NVIDIA market and non-NVIDIA market. And, you know, a lot of people want zluda for machine learning because uh they want okay not by the means of hardware but the means of software take part of

Starting point is 01:39:34 nvidia market because nvidia market is much bigger than non-nvidia market yeah and that's that's just not going to change in the real world like it's no matter what AMD I mean AMD making effort and AMD is I think closest because as I said they made the right strategic call to make

Starting point is 01:39:57 their APIs close to Nvidia APIs maybe the closest the problem is their execution is just just not good. And purely talking about software, there's also some hardware

Starting point is 01:40:14 choices, some design choices they made in the hardware that make porting from CUDA to heap more difficult than it should be. Is there anything you want to get into there, or...? Uh... Yeah, so...

Starting point is 01:40:33 And it's no secret. Do we have a drawing board? Because I need to draw it. In Jitsi. I don't know if Jitsi has one, actually. I don't use it. We had one. Yeah, there's a whiteboard. OK.

Starting point is 01:40:53 Excellent. Yeah, beautiful. OK. So core difference between CPU and GPU is that if you have a hardware GPU thread, it's going to operate on a vector of data. What it means is that you have your CPU and CPU... How do I draw? Okay, that's what I meant. Give me a second.

Starting point is 01:41:26 Okay, this one. So, okay, you have CPU. Okay. CPU operates on a single element at a time. Uh-huh. So you add

Starting point is 01:41:40 two plus two, you get four. Single element at a time. Right, right. GPUs GPU operate on a fixed vector of elements. So in this example

Starting point is 01:41:58 this is GPU where the instruction set had a width of 4. So maybe there's 1, 2, 3, 4, and another, I don't know. Where's the eraser? There's no eraser on this thing. I think I have to, well, whatever. There's 1 plus six, seven.

Starting point is 01:42:28 Yeah. And the result, we'll leave it as an exercise for our readers, watchers, listeners. But it operates on a vector at a time. This is a key difference. This is why CPU code is not going to be efficient when just taken and run on a GPU code because you're going to use only one element of the vector.

Starting point is 01:42:54 And that's why only some workloads are efficient on GPU because you need to be able to vectorize your things and in this example you have with four of the vector and this is sort of the most basic parameter of your GPU if you look at the all Nvidia GPUs in the history that are programmable with cDA and probably every GPU in the future, the width of the vector, the number of elements you operate at the time is 32. And AMD, for some reason that is impossible to explain. Okay, I'll explain in a second. But for reasons that are... Maybe I shouldn't explain.

Starting point is 01:43:45 It's going to sound silly. But this is an explanation I received inside AMD. So AMD has really two architectures. One architecture that is not meant for compute. That is meant for gaming. Has Win32. Which makes it easy to port compute code. And their compute architecture is difficult to port compute code. And their compute architecture is difficult to port compute

Starting point is 01:44:07 code. So are we done with the whiteboard, by the way? Are we done with the whiteboard, by the way? Come again? Are we done with the whiteboard? Do we need that on the screen? Yeah, yeah, yeah, yeah. Just I don't know how to close it either.

Starting point is 01:44:27 Uh... Unpin? What? Now the white... Yeah, okay. You suffer with the white water, I'll continue my explanation. Whatever, we've got a few minutes left of the show anyway. It's going to just be a broken layout.

Starting point is 01:44:35 It's fine. Okay. Yeah, yeah. So, on AMD GPUs, you compute GPUs you have with 64. It's relatively difficult. Well, it's not impossible, but it's not easy. It's not easy. It's not easy.

Starting point is 01:44:43 It's not easy. It's not easy. It's not easy. It's not easy. It's not easy. It's fine. Okay. Yeah, yeah. So on AMD GPUs, you compute GPUs you have with 64. It's relatively difficult. Well, it's not impossible, but it's extra difficulty to compute CUDA code written for vector size 32 to AMD GPUs with the vector size 64. vector size 64. And if every AMD GPU had vector size what's called warp size

Starting point is 01:45:09 32, it would be much easier to port. Much easier. But what I learned at AMD, there's a guy who has a spreadsheet and according to his spreadsheet warp size 64 is better for center workloads spreadsheet warp size 64 is better

Starting point is 01:45:25 for centa workloads than warp size 32 and that's why AMD spends millions of dollars porting from warp size 64 to warp size 32 okay otherwise it would just replace the names and it should work. Mm-hmm.

Starting point is 01:45:46 If it's 32 to 32. You know, I don't think I ever had a visual demo in the middle of a show before. That's certainly a fir- Oh wait, no. No, I did have a game dev on who did one once. This is the second time. I've not had a whiteboard though. A whiteboard's a new one Yeah, but I think it's nice It's a nice change, I like it I've done like 200, 250 of these or so and finally there's something new and weird

Starting point is 01:46:22 So by the way, so previously Zluda has the two modes to run. Expected why CUDA warp 32 and hardware warp 64, but it was a time-consuming feature because it sort of applies to every layer of Zluda, both compiler and both host code. So we're living it out.

Starting point is 01:46:49 And AMD announced that they're merging those two architectures into one. And I mean, I don't have special knowledge, but I expect it's going to be Warp32 to be similar with NVIDIA.

Starting point is 01:47:06 Otherwise, it would be self-sabotage, not using Warp32. Even if it's not efficient in its hardware sense, it's much more efficient when porting the software. And porting the software is the bottleneck. Right. Even if it's faster, if no one's writing software... Yes, yes, yes. Sadly, we live in the sort of CUDA-shaped world when it comes to GPU computing.

Starting point is 01:47:41 That's the objective reality. That's the objective reality. Well, on that note, I guess we could start wrapping things up. So, let people know where they can find the project. I know you've got a Discord server linked here as well. Yeah, we do. How active is that? I haven't gone into it myself. I mean, it's fairly active.

Starting point is 01:48:15 Can you put the link to GitHub and Discord in the description? Yeah, I can do all that. There's no nice and easy link to it. Please do. Okay, so we have a Discord. So I was worried at the beginning that there's either going to be, I don't know, three people

Starting point is 01:48:36 and it's going to be totally empty or it's going to be 3,000 and I will be spending time moderating the Discord. But there's, I think, like 100 or 200 people. It has a healthy level of activity. It's nice. And I encourage you to join

Starting point is 01:48:56 unless you're one of those people from comment sections who is going to write really ugly things about NVIDIA or AMD about Intel in this case please do not come but if you're a normal person please do like

Starting point is 01:49:17 when I have something working in Zloda I'm going to share it first on Discord. And later when I have more things batched together, then I'm going to write a news. So if you want to build a bit closer to the development as it happens, or you have some questions, then please join Discord.

Starting point is 01:49:46 So if somebody wants to get involved with the project, head over there and head over to the GitHub and just go for it? Yeah, if you want to get involved, look at GitHub, join Discord. It's probably... So we are not getting so many external contributors. I think one of the reasons that project Yeah, and it's sort of not mainstream programming queue. If it's something web related, you're going to have a much bigger pool of developers

Starting point is 01:50:20 who can contribute GPU development, much more niche. So it's not like we are overwhelmed by contributors. If you want to add something, then you will be a really special person in this project. Oh, there's 12 contributors listed. Yeah, yeah, you will be a very special person. Yeah, and even if you cannot contribute code, then if you can contribute, I don't know.

Starting point is 01:50:52 Changes to the documentation or documentation itself, very welcome. I'm not a native English speaker, so you might have an advantage over me and ways to improve this project. It is really normal for a project to have, you know, a big drop between their core contributor and then the second top contributor. I have not seen a distribution like this before, where you're like 400,000 total lines of code added and removed. The next person is two.

Starting point is 01:51:24 You will be a very special person if you write like a a serious patch for this project yeah yeah yeah yeah if you contribute 20 lines of code you can probably be the second biggest contributor yeah yeah yeah so is there anything else you want to direct people to or is that pretty much just it I think that's it awesome oh yeah yeah one more thing

Starting point is 01:51:53 watch the next episode and previous episode of this podcast yes do that I'll do my outro and then we can sign off okay so my main channel is Brody Robertsbertson i do linux videos there six-ish days a week i did a video on zaluda coming back so if you're not seeing that one yet uh go check that one out it'll be out it's like it'll be like three weeks old by the time this

Starting point is 01:52:16 video comes out so i don't know if you haven't heard about it coming back yet go watch that video i go for the blog post to go over the history of the project all that fun stuff uh if you want to see my gaming stuff that is over on brody on games i stream there twice a week i also have a reaction channel where clips from that stream go up so if you want to go over and watch just clips of the stream do that as well and if you listen to the audio version of this you can find the video version on youtube at tech over t if you like to see the video version on YouTube at Tech Over Tea. If you'd like to see the video version, it is on YouTube, Tech Over Tea. I will give you the final word. What do you want to say? How do you want to sign us off? Well, I finished my tea. It's time for another tea.

Starting point is 01:52:58 That's a good plan, actually. That's it.

Tech Over Tea - Developer Of ZLUDA: CUDA For Non Nvidia GPUs | Andrzej Janik

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.