Tech Over Tea - Developer Of ZLUDA: CUDA For Non Nvidia GPUs | Andrzej Janik
Episode Date: November 15, 2024CUDA is one of the primary reasons people buy NVIDIA GPUs but what if there was a way to have this compute power on AMD and Intel GPUs as well. Well there is a project to make that happen and it's cal...led ZLUDA. ==========Support The Channel========== ► Patreon: https://www.patreon.com/brodierobertson ► Paypal: https://www.paypal.me/BrodieRobertsonVideo ► Amazon USA: https://amzn.to/3d5gykF ► Other Methods: https://cointr.ee/brodierobertson ==========Guest Links========== YouTube: https://www.youtube.com/c/ericparker Twitter: https://x.com/atEricParker ==========Support The Show========== ► Patreon: https://www.patreon.com/brodierobertson ► Paypal: https://www.paypal.me/BrodieRobertsonVideo ► Amazon USA: https://amzn.to/3d5gykF ► Other Methods: https://cointr.ee/brodierobertson =========Video Platforms========== 🎥 YouTube: https://www.youtube.com/channel/UCBq5p-xOla8xhnrbhu8AIAg =========Audio Release========= 🎵 RSS: https://anchor.fm/s/149fd51c/podcast/rss 🎵 Apple Podcast:https://podcasts.apple.com/us/podcast/tech-over-tea/id1501727953 🎵 Spotify: https://open.spotify.com/show/3IfFpfzlLo7OPsEnl4gbdM 🎵 Google Podcast: https://www.google.com/podcasts?feed=aHR0cHM6Ly9hbmNob3IuZm0vcy8xNDlmZDUxYy9wb2RjYXN0L3Jzcw== 🎵 Anchor: https://anchor.fm/tech-over-tea ==========Social Media========== 🎤 Discord:https://discord.gg/PkMRVn9 🐦 Twitter: https://twitter.com/TechOverTeaShow 📷 Instagram: https://www.instagram.com/techovertea/ 🌐 Mastodon:https://mastodon.social/web/accounts/1093345 ==========Credits========== 🎨 Channel Art: All my art has was created by Supercozman https://twitter.com/Supercozman https://www.instagram.com/supercozman_draws/ DISCLOSURE: Wherever possible I use referral links, which means if you click one of the links in this video or description and make a purchase we may receive a small commission or other compensation.
Transcript
Discussion (0)
Good morning, good day, and good evening. I'm Azuija host Brodie Robertson, and today
we have the developer of a project on that...
I don't know, I'm sure a lot of you've probably heard of, but maybe haven't had a chance to use just yet.
Is it pronounced Zaluda? Because that's how I've been saying it.
It's a very good pronunciation. It's not
100% correct, but i don't think
i can give polish lessons to the whole world so let's let's let's say with zluda you can certainly
try how how would you say it if if like you're trying to say it correctly well it's it's written
incorrectly but you would pronounce it zluda ah okay, I was never going to get to that from the way it's written.
Yeah.
So how about you just introduce yourself and we can go from there.
Yeah.
So I'm Andrzej Janik.
It's nice being here.
I'm a software engineer.
I wrote Zluda.
I'm still working on Zluda and hopefully I'll be still working on Zluda in the near and far future.
Hello.
Hello.
And I have my tea because I heard it's a requirement for this podcast.
I didn't even mention it.
It's in the name.
Yeah, yeah, yeah.
It works.
Hey, look, if you actually bring it, you are better than most of the people and better than me most of the time as well yeah so um i'm gonna try to not look really shiny right now because it
is i'm in australia and it is oh 9 30 p.m what time is it yes it's 9 30 p.pm and it's still 29 degrees Celsius. So... Oh, wow.
Yeah.
If you notice me looking shiny, that's going to be why.
I've got my fan on, but we'll see how it goes.
Hopefully it cools down a bit later.
Anyway, enough of me complaining about the terrible weather here.
I guess we can just start with what the project actually is and why was it initially created?
Oh, sorry. Those are two difficult questions. Yeah. Yeah, it's a long story. What it is is easier, I think.
Hopefully I will get to the end of it before you collapse from the heat. So,
Hopefully, I will get to the end of it before you collapse from the heat.
So basically, it started because I thought I will not get in trouble writing this project.
So it went wrong from day one.
So it started when I was working at Intel.
So I was working at Intel on a certain project, which...
Okay, it will remain unnamed.
Let's call it Project X.
Sure.
I worked on Project X with...
Those two guys are important for this story.
Alexander Lyashevsky and Alexey Titov.
They are grandfathers of this project. So work at Intel was...
So some details I can provide. So it was a
project, certain GPU library for a certain workload for obviously Intel GPUs.
workload for obviously Intel GPUs.
It was one of the first projects actually using Intel SQL implementation called DPC++.
So we can think of it as sort of Intel's answer to CUDA.
This was a very, very early version of SQL.
CUDA. This was a very early version of SQL
and we had a lot
of discussion about
how useful
it is, how good it is and
one of those
discussions or arguments
I don't remember who
Alexander or Alexey
told me something that
stuck with me
is that, okay SQL might be good, might not be good but what our something that stuck with me. Okay.
SQL might be good, might not be good,
but what our
customers want, they want CUDA.
They want CUDA. They want CUDA
on Intel GPUs.
I don't remember what we
discussed
in detail, but it's
really a huge of the discussion.
He was right.
I had no comeback from this, but the thought stuck with me.
So time passed.
The Project X was canceled, and I started working as a manager.
I started working as a manager.
As a manager, I had almost no opportunities to write code and it started to become slightly frustrating.
If you don't write code, you sort of become...
Every day you get slightly more divorced from the
daily realities of writing code
especially if it's
Intel GPU code where at the time
it was pre
pre
what is the public name? Alchemist
it was still
in flux, still moving quickly
so
I was getting a little bit more frustrated
and 2020 came in.
With this COVID, we're all stuck inside our homes,
me included.
And at the same time, Intel, when it comes to Intel GPU development, started moving to a different host code library.
So previously, most of the development used OpenCL, but OpenCL is relatively high level.
And if you want to have some kind of extensions to OpenCL, you have to, to some degree, negotiate with other vendors.
So you don't create something that is useful to you and another vendor comes
something useful to them and these are just very minor differences.
So Intel came up with something called Level Zero.
This was the new premier way to write host code.
And it looked fairly good.
And we were all just starting to use it, starting to learn it.
And I wanted to know more about it.
So in my mind, I had two choices.
Either resurrect Project X, which was canceled.
I was specifically asked not to do this, not to do this in the open source.
And I thought, well, OK, if I do this, I'll get in trouble.
And there's another project I thought would be interesting,
implementing CUDA on top of level zero.
And my thinking at the time was, okay,
what I really wanted to do was learn level 0,
understand how it works, what are the...
what I can do with this that I cannot do with OpenCL.
If it works, well, it's not going to work,
but if it works, it's going to be very cool.
So I started working on it,
and I worked on this through most of the 2020.
And well, that's how Zulda came to be.
Surprisingly, it worked.
Surprisingly, it reached decent performance.
And I released it sometime late 2020.
What would be decent performance in your mind?
Like if you were to take an equivalent Nvidia GPU and Intel GPU,
like what percentage of the performance were you sort of getting at?
Right, so this is a difficult question.
I guess it depends on the workload, yeah.
It's difficult to compare the performance because
you need a workload where you have uh implementation native cuda implementation and opencl implementation
that works natively on intel so so a good target was geekbench that why Geekbench was pretty much the only thing that was working. So in my mind, something like
80% of the
native OpenCL performance
would be
really good, surprisingly good.
I don't remember what was the exact performance.
It was something like 90%.
So Geekbench is actually
a suite
of tests.
Surprisingly,
some tests were even faster than the native versions. Wow. Some were slower. But the performance was,
I think, around 90% overall, which was much better than I expected.
Well, it's much better than the alternative, which was zero.
Yeah, yeah, yeah. So for anyone who may be unaware of what zeluda
actually is like at a high level can you just explain like what the general purpose of the
project is like general purpose of the project is you have your software and your software doesn't
work on your hardware so that makes your software work on your hardware, so Zuda makes your software work on your hardware. That is. Okay, maybe you can level up on that.
Yeah, yeah, no, no, no. So you understand what is the good about this project and
specifically targets applications, libraries, whatever, plugins using CUDA. And the reality is if you have your GPU and maybe use your GPU for gaming,
which is horrible. Why would you do this? GPUs were meant to multiply matrices. GPUs were meant
to calculate chemistry and physics. So if you have a gaming work workload if you have a game
then you're going to use a specific api to to do things that game do i don't know what they do uh
generate polygons yeah yeah so things like directx uh vulcan opengl But if you want to do computations, so physics simulations,
chemical simulations, whatever, machine learning, then you have specific APS for
compute and by far the biggest one is CUDA. And the thing about CUDA is written
by Nvidia, it's written and created for NVIDIA's GPUs.
But it turns out, and this is what Zluda does,
you can implement the runtime environment of CUDA and then run some other GPU from another vendor.
You can use your application on another GPU, non-NVIDIA GPU.
From the perspective of the application, it's a normal NVIDIA GPU.
It's slightly strange, has a little bit strange configuration.
Some things don't work, but from the perspective of an application, it's the same.
It's similar to how Wine works on Linux. You can run Windows
applications on Linux.
So there's something I was going to say.
I had something I was completely blanked on. lower and lower level explanation so if you have if you have a
CUDA application so for those of you who wrote CUDA they might be thinking okay how is it possible
because if you write CUDA CUDA is also CUDA has many things but other than the runtime environment
it's also a programming language.
So write your program in CUDA. It's a dialect of C++, and you can mix GPU code and CPU code.
But this is an illusion, because what happens at the end, at the sort of bottom level that's where it's nice you have uh it's a normal application
which calls into uh nvq.dll or libq.so it's a normal application which uses certain functions
provided by a certain library there's no no magic. And if you implement this
runtime driver library, then you can... then you have Zluda. Zluda implements this library.
And this library can be relatively complex because you need things...
So what runtime library does, it allows you to control your GPU. So you can query your system.
Okay, I have GPU number one with this name and with this much run,
with this many flops.
I have GPU number two or three.
And on this GPU, I can allocate memory.
On this GPU, I can execute a computation.
So runtime also contains a compiler for virtual assignment.
This is a bit more complex, but you have to also provide this.
And Zilda does it.
So obviously, NVIDIA likes CUDA being on the Nvidia GPUs. Were you ever concerned
with Nvidia having any issue with you doing this? So I had initially but
the case has been settled with Oracle versus Google and I have been contacted by a fair bit of organizations.
I mean, initially, including Intel.
So, and pretty much everyone tells me,
they checked with their lawyers and lawyers
think that after Oracle versus Google,
implementing APIs is fine.
You can just do this.
And actually, I saw this when I was at Intel.
So I
released end of 2020
next version early 2021.
I don't remember how
much later but I think three months
later
Oracle Google was
settled. The judgment was made in the
United States Supreme Court
and shortly after i was
contacted by intel that and they told me hey oracle google is settled uh we think this is
we can do this yeah that yeah i know about that case i didn't realize it was only settled that
recently yeah yeah yeah well three years ago yeah well yeah still it's still pretty recently like when
did the case actually start if you remember what year it was like early oh i don't know maybe 2018
2019 those cases take a long time sure yeah yeah but basically you are you just effectively assume
like okay that case has gone well now that it's actually done yeah by the way
I'm not a lawyer
neither of us
are lawyers here so
don't take anything we say
why is the camera following me don't do that Jitsu
why is
yeah
assume that nothing we say is legal
advice here oh no now where is it even focusing?
Oh, it's off there.
Is it going to move?
No?
What is it?
I hate, I hate Jitsi.
I don't know.
On my screen, the camera is fixed.
Oh, okay.
I'll, sure.
Okay, I'll just fix it up on here then.
Fine.
Anyway, getting distracted by that um technology's hard
yeah there's probably a setting to turn that off i just don't know where it is um so what is it
about cuda that makes it such a driving force in this space like why does everybody want access to
cuda is it just the fact that nvidia is like has that market share now or is there
more to it than just that? There's much more to it. So if you look at how the current market for
server GPUs look like for GPU compute. So you have three vendors right?vidia who is competent and intel and amd who are not and i'm talking about software
software because um that's my expertise well i can't really speak much about hardware hardware
looks fine from my well other than some embarrassing failures theware looks fine from my... Well, other than some
embarrassing failures,
the hardware looks fine, but the problem
is with the software.
So, you are a developer,
then you
expect a certain level of quality
in your software.
So, your compiler will not
miscompile your code.
Your profiler will give you profiling information.
Your debugger will work.
And those things are, for some reason,
possible in CUDA land.
They're totally possible in Heap land and DPC,
in whatever it's called, Intel land.
And that's it.
Things just work on NVIDIA.
So it's not like NVIDIA stack is perfect.
For example, we run into a miscompilation
in the NVIDIA compiler.
So they have their own bugs,
but things work.
You have profiler, it works.
You have compiler, it works.
You have debugger.
So I can specifically talk more about AMD
because at Intel, I was working in pre-Silicon environment.
So there was expectation that things will not work.
So I didn't really use the public version as much.
But on AMD side, so imagine that your runtime, for example,
operation on images doesn't work, like silently give you wrong results. So, and it's arbitrarily
for only just some formats. So we have image with floating 32 bit floating point pixels, it works. If it's
16 bit floating point pixels, you get wrong results for no good reasons. You have 8 bit
unsigned integer, it works. 8 bit unsigned works sign gives wrong result for no reason
and it is sort of thing uh like you cannot you just cannot live the life where you have to second
guess everything you do with your uh compiler and some tools are just just not usable so
Some tools are just not usable. So if you're writing GPU code, you're not doing this for your deep love of GPUs.
You do it because you want to have performance.
And what you need, you need a profiler.
You need a profiler to tell you, okay, what are my performance problems?
And you're going to use it a lot when writing GPU code.
I think an average GPU programmer have a higher need
for than average CPU programmer.
So, so one example on AMD performance profiler.
So you want to provide workload
and the workload will have, let's say,
performance events.
And let's say coarse-grained performance events,
meaning copying data between CPU and GPU
and dispatch of a kernel
and some other things that take time on a GPU.
And in a normal workload,
you will have tens of thousands,
hundreds of thousands,
maybe low million number of those events.
And you open, imagine this,
you open AMD imagine this you open a amd gpu profiler and well it has certain limitations
it can only capture a certain number of performance events so i want you to
to guess how many it can capture uh if you were talking about dealing with millions, I would hope that you could handle that much.
But if there's a problem here, maybe in the range of a few hundred thousand?
Fifty.
Oh.
Fifty.
Oh, okay.
That's the limit.
At least it was last year when I was using it.
So, you cannot measure performance.
I see.
That sounds like a problem.
It is a problem.
I think they improved it.
You can now capture low000, maybe low 100.
I mean, it's better, but it's still worse.
Or is the magnitude better?
Yeah, but still, if you have NVIDIA profiler,
you can just capture the things.
So I only work with smaller workloads
when it's maybe tens of thousands, and it just works.
when it's maybe tens of thousands and it just works.
So that's more of an AMD problem.
So, I mean, going back to a higher level, if you look at AMD, they have the right strategy
and bad execution. Where Inter had bad execution.
Where Intel had
bad execution,
had bad strategy, but
I think decent execution
of the strategy.
And could
that work?
Let's talk more about Intel.
I want to talk more about
Intel because right now they are poor.
There's no risk that they are going to sue me.
And Intel has the...
So if you look at what AMD is doing,
their compute stack is relatively similar to CUDA.
Meaning the goal here is you have your CUDA code,
existing CUDA code,
and porting to AMD world is,
if it works, it's relatively easy.
What is the AMD?
Yeah, I was going to ask,
what does the AMD stack actually look like?
They're following CUDA, basically.
So the APIs are the runtime,
the performance libraries are relatively similar.
The tools might be different,
but the tools are not super important.
They're not part of your program, right?
So imagine you have a function to allocate memory.
On CUDA side, it would be called QMalloc.
Take some arguments. On AMD side it's going to be pretty much exactly the same, but it's going to be
named HipMalloc. And this is what programmers actually want. Because the world we live in
is the world where most of the GPU compute code is already written and using CUDA.
So this is objective reality.
Intel rejects this objective reality and lives in their own dream world where there's no GPU code.
And every programmer wants to write code from scratch so
if you look at their uh dpc++ or SQL in some ways it is actually better than cuda if you look at the
uh purely at the the APIs they have but the problem is the API is very different from CUDA. So every porting you do,
it's one-way ticket to Intel ghetto.
And there's also the social aspect.
It's somewhat difficult to trust Intel,
who hasn't seen much success in GPU compute world
to port to Intel platform and
just abandon your existing working
CUDA code.
With HIP, you can
always port back HIP to CUDA.
You can do this.
With DPC++, it's
much more difficult.
It's very different.
And another problem with DPC++ is it's very
strongly
C++ based. With CUDA
and heap, you can have your application written in some other language
and it's going to interoperate
with your CUDA C++ code relatively easily.
So there's a C API, which you can use from any language.
With DPC++, it's not so easy.
Everything is C++.
There's no C API.
So using it from something like Python is a tad difficult. And even if you do the mapping,
it's very difficult to have interoperability between your back from C++ to Python. And we
are talking about C++ and Python, but there are other languages that also want to do some degree of GPU compute.
And they have the same difficulty.
And you're not going to be able to interoperate between all of them.
And you need to have strong C++ support.
So it's already a sort of losing position.
So this strategy is just not good. It has no future.
Unless they did, unless we port Zlodat into the GPUs.
Hopefully we'll do this soon and rescue them from their own
wrong ideas.
Yeah, when you bring up supporting things like Python, like Python is a really popular language
when it comes to like doing that in like the research space you have a lot of people who just you know they're already
doing a lot of their math computation with python and they're going to do a lot of their other stuff
with python if they can so it just makes sense to make that or make it easy to interact with
other systems through that as well, where this Intel system sounds like
it's a whole different way of approaching things
that adds some extra challenge.
CUDA makes it easier.
Python is actually in a relatively good spot
because, as I understand,
they have some good ways to bind to C++ libraries.
Other languages are not in such a strong position.
Right, sure. That makes sense.
So it's slightly less of a problem from a Python perspective.
Okay, so we kind of went down.
I don't even know how we exactly got to this bit.
But let's shift focus a little bit.
So now the project has been through its new revival into what it is now, what are the current goals of the project?
What sort of applications are intended to be supported?
are intended to be supported and what would be like a dream support that maybe right now is a bit outside of the scope and then what are things that you could support theoretically but you're
just not going to touch right so so the goal the main goal didn't change Total world domination is something we want to do. But more specifically, since
our team is relatively small, just me, and external contributors for whom I'm very thankful.
So we have to... So there's limited amount of time. So need to focus so right now we are focusing on
machine learning workloads all the python sensor flow starting with something similar like lmc
but we want you to be able to whatever machine learning workload you have to run smoothly on Zluta.
And we're also making some other choices.
So we are focusing less on Windows because Windows needed extra support.
And it's going to work, but you will have to do...
It's not going to be as smooth. Well, it wasn't really smooth, but it's going to work, but you will have to do user role. It's not going to be as smooth.
Well, it wasn't really smooth, but it's going to be even less smooth.
We are going to support less GPUs,
only focusing on the GPUs that are sort of similar to NVIDIA GPUs.
And this is specifically RDNA GPUs,
RDNA 1, 2, 3, future RDNA GPUs, RDNA 1, 2, 3, future RDNA GPUs.
Yeah, these are the main areas of focus.
What is it about the Windows support
that actually makes it a challenge?
Right, so loading libraries.
It's tricky.
So in the perfect world,
how it would work?
You have your executable.
It doesn't matter what sort of executable.
It doesn't matter if it's Python or, I don't know,
or Blender or something else.
Language is not important
because at the bottom level
you are talking to a library.
You launch your X through some kind of Zulu launcher and every time you load a library,
we check is it a CUDA library?
If it's a CUDA library,
we replace it with
Zluda library. So it's all
transparent and efficient.
And every time
you launch a new
subprocess, we
also insert ourselves into this sub-process
and if the sub-process loads a library,
we also replace it with Zluda library.
It turns out it's not really possible.
So what Zluda launcher settled on is...
And why it's not possible.
Firstly, surprisingly creative ways you can
load CUDA into your process.
So some applications will
have
the main
actually will have dependency on the
NVQ.dll
and no, we can't
replace it. It's fine.
But some other applications,
I think, and it's relatively think But some other applications, think of
things like Python. Python actually doesn't have
dependency on NVQ.dll.
It's just some
Python script is going to
load NVQ.dll
using
dynamic loading APIs,
load library.
So, okay, we can overwrite your load library and replace the libraries,
but there's also some applications
where you load DLL,
and this DLL has a strong dependency
on NVQ.dllll and we cannot detect this.
On Linux world it's slightly cleaner because anytime you load a library, the system runtime
loader will go through the public API.
It will be called to dlopen.
On Windows it's split. If you're doing it yourself, it's public
API,
but the system
loader will have its own
function, and we don't want to overwrite
system
functions from
kernel 32DLL
or whatever.
It's not a robust way.
So what we settled on, you run your application
under zoodaloader, zoodaloader will inject our own NVQ,
you want it or not, you're getting it.
Just already not nice, but presumably you want this.
And then if you're explicitly loading library,
then we explicitly replace it.
If you're explicitly loading NVQda,
we're overriding it to explicitly load our own NVQda.
Just this part of implicitly loading
Zluda into every process
whether it's necessary
or not, it's
sort of not nice
but it's the most robust way
and you need to have this support for
every DLL and those DLLs
there's some degree of complications
because
also you have sort of two kinds of DLLs on Windows.
When it comes to code, I have loader DLL, which lives in Windows system 32.
And you have real library, which lives in one of those driver paths.
On Linux, it's simpler.
You don't have this loader
DLL.
Linux gives you
official supported way to
inject yourself into
process and all its children
through environment variables.
So it's less work.
It's less work.
And it works.
If you could just suddenly force all the Windows developers
to do things in a certain way,
what would you want them to be doing?
I don't want to force them to do things my way
that is easy to me because they just have different...
It's a different system, different needs.
This is kind of baked into the system, the way the dependencies are resolved. It's more difficult
for Zluda, but in some ways it's actually better than on Linux because on Linux you have a dependency to a function. So if we say the simplest function there is malloc, right?
So on Linux is going to be what is backed into your executable library is the name of
DLL and the name of the function.
On Linux side, it's just the name of the functions. Your name of the library is not included in the dependency resolution,
and it quite often leads to conflicts.
So, I mean, for my purpose, the Linux way is easier and smoother,
but it has its own problems.
That's understandable.
It does sound like, though, that the Windows side That's understandable.
It does sound like though that the Windows side just makes things more complex.
For me, yes.
Because what we do is not a normal way of doing things. alternatively just throw away your throw away your NVQ, your official
NVQ.dll and just install ourselves into Windows System 32.
Right. But I don't want to do this because
it can crash some applications. Maybe
we don't support something and even if you don't have
an NVIDIA GPU, it's going to be more robust if
you use official library because there's some hidden APIs, there's dark API, we might come to
this in our discussion but if it's not supported on our side, your application is more likely to crash and it's going to
happen with games.
So suddenly you installed Zluda into your system and some of your games start crashing
for no reason.
And it doesn't show it's fault of Zluda, it just crashes and you don't know why.
We don't want to do this.
We want you to launch your application with zluda launcher so it crashes you
know it's not something else no that actually makes a lot of sense because um my understanding
is there are certain parts that you don't at least right now want to bother touching just because
you know as you said there's only so many people working on the project there's only so much time
to work on things so there's there's things that many people working on the project. There's only so much time to work on things.
So there's things that just are not going to be included, at least for now.
Yes, yes, yes.
If you look at the NVIDIA API, if you look at the CUDA API, it's huge.
It's gigantic.
And, you know, it's 80-20 thing. If you look at
applications, they're going to look
they're going to use
certain functionality much more than any other
functionality. So pretty much every application is going
we want to allocate
memory on a GPU. Every
application want to launch
kernel. Not every applications
will have multi-GPU support.
Not every applications will want to do runtime linking of kernels
and some other niche functionalities.
So generally, how we approach things in Zluda,
we don't track CUDA version and add APIs added in a CUDA version, we look at applications.
And if application uses certain APIs, then we implement those APIs.
And next applications, what API does it use?
We implement those APIs. So sometimes I get the question,
which version of CUDA is implemented in Zuluda I don't know it's
whatever applications are using it's a mix we do run applications to work we we don't care about
some you know some artificial standard that doesn't exist right actually that's a I think
a really sensible way to approach it because like
you know you could approach it as i don't know how cuda versioning works
let's just say version one like start at version one get a complete perfect implementation but then
nothing else like no modern applications are supported because all you're supporting is like
that basic core functionality and you're going to take a really long time to get to the point where things are actually working with real world like software
yeah yeah yeah yeah yeah yeah we care about the application we want your stuff to work
that's the goal whatever it takes right you're not trying to pass like Vulcan conformance tests no
so you bring up that term dark API
you said that is one of the things you want to talk about
I've never heard that term before
what does that actually mean and then
we can go into like what you're actually talking about
with CUDA dark API
yeah it's
it's a pain
so something you should understand uh so so could obviously not
not an open source api but there's something more to it is that
your code is always a second class citizen on cuda platform you may not realize it
second-class citizen on CUDA platform. You may not realize it. And I will explain why. So first thing, there's really two and a half APIs on CUDA. So there's two APIs most CUDA
engineers will be aware of. There's what they call runtime API and driver API.
And they're extremely similar.
Runtime API is slightly higher level,
driver API is slightly lower level.
We implement driver API.
And there's also one more API hidden from a public view.
So we have API.
Typically, if you have API, you have a name of a function,
number and
type and name of the parameters
and return values
and some documentation.
With Dark API, you have nothing.
Every function that is supported by
Dark API is
some kind of unique identifier git and the index
so so you ask your runtime you for you you ask your cuda driver hey give me table with function pointers for this unique key. And this
is used by
Intel Runtime
really for no good reason.
I don't know why they do this. And for
some first
party libraries. Because there's some
things NV doesn't want
you to know.
They don't want you to be able to do.
And a classic example is ray tracing.
So you might think, okay, I have my GPU and I want to know if my GPU is capable of hardware ray tracing.
And Vita doesn't expose this information.
At least doesn't expose it to you.
It exposes this information to its own libraries, its own runtime, and it goes through
this hidden API. It has no name, has no documentation, so we call it Dark API. I don't
know what is its proper name. Nobody knows. And the Dark API is frustrating experience for
something like Zoda because we have to
implement it. It actually
almost kills
Zoda because first time
I'm enabling very simple application
just adds two
numbers. All it does
adds two numbers on the GPU.
So I
have this good application and I look at the interactions between
application and CUDA driver and everything goes as expected. Every function that I wrote in the
source code is being called. And there's one small addition I did not write. And there's call to this
And there's one small addition I did not write.
And there's call to this co-export table.
And I look at it.
This is some kind of unique key I never seen before.
And this gives me a table of pointers.
I don't know what are those pointers.
OK, I said break pointer to each one of them.
And it calls one of them with, for some reason, I know no arguments.
I don't know what this function does.
I decided that's too much for me.
It's too much for me.
I give up.
And I gave up for, I don't remember, for two weeks.
After two weeks, I thought, well, okay, maybe, well, I'm in mood for some pain.
Let's try it.
It's probably not going to work.
And I look at this function. I should look at what function it calls, and it's for instructions.
It calls allocation.
It calls malloc to allocate some memory and returns this memory to the driver.
Okay, I thought, well, it's not so bad.
Actually, and it's really motivating me to keep going
and pure luck because it was by far the simplest
and easiest function from the Arc API.
All other ones have been slightly more difficult.
And generally, so generally I don't,
been slightly more difficult. And generally, so generally I don't, I was wrong approach
to look at what this function does because it's most of the time it's not super useful.
Nowadays what I do, I look what are the inputs and what are the outputs?
Because usually that's your first clue. If a function returns some kind of pointer and you see across the rest of the application
that this pointer is used as context,
then you know, okay, this dark API is creating context
with some internal bits being set or unset.
And those bits are not really important.
We might as well create a context for slowdap
because it probably does something,
but there's no, across the applications we had,
there's no observable effect to this function
that is different from the public API.
So that's how it goes. And we implemented those only parts of dark API that are necessary to run your applications.
Mainly interactions between the high level runtime API and driver API.
Runtime API doesn't always talk to driver API through the public APIs. It also talks through dark API. Runtime API doesn't always talk to driver API through the public APIs. It also talks
through dark API. And then I noticed that the first party libraries use dark API for
various reasons, which I don't care about.
So it's basically just those internal functions that there's no point for them to document
because you're never actually supposed to call them yourself.
I mean, I think they should document them.
Well, they should.
Because it would be nice to know
how can I use everything that I paid with this GPU,
but they decided not to.
that I paid with this GPU,
but they decided not to.
There might be some thinking why they do this.
It's obfuscation.
Maybe some of those things
they just don't think are useful.
Maybe some things
they just want to hide.
Well, when it comes to ray tracing,
they definitely want to hide those things
because optics also doesn't expose those things.
Whatever the reason for it,
it makes your life a lot more complex.
Yeah, yeah.
It probably makes also life of CUDA engineers more complex
because some things they do in this dark API,
I don't know why they are doing this.
It's just... For example example there's something they added recently and my suspicion is they they
wanted to make my life more difficult but compiler optimized out all they wanted to do in this
function because this function what does it do you have let's call this function foobar, and it returns if
function foobar, so itself, starts at an even or odd byte in memory.
And this is completely totally pointless because your compiler
will pretty much always align your function to, not only to even address, but most likely to your natural size to 64 bits.
So it's always going to return zero.
I don't know why they do this.
It's just a mystery to me.
And literally two instructions.
Or three.
I don't know. I don't know.
I don't.
I haven't answered for you either.
But my expectation is there was more complex body inside.
They wanted me to implement,
but optimizer optimized everything out.
Who knows what they are thinking.
It's a mystery.
I only care about the applications running not about their creative ideas.
Right, right. So with these dark API functions, the way you're basically approaching them
is effectively like a test suite where you're throwing data into it, you're looking at the
data that comes out of it, and you're hoping that whatever you're doing in between
is getting you the result
that would happen on an actual
NVIDIA GPU with CUDA.
Yeah, most of the time
it's sufficient. Sometimes you have to look
what it does internally.
But it's usually too complex to do this
because
if you have a function,
it's going to call any number of internal functions.
So observable properties is what matters.
If it sets a flag in some kind of object, it tells me nothing.
We're going to do the same thing without
setting this flag.
So you said earlier that
CUDA is a really big API.
How big actually
is it?
I can
open my... If you give me a
minute, I can open my
ID and tell you.
Too good?
Just give me a second. So just the...
It's not 100% accurate.
It might also count some function pointers,
but it's just the driver API is 575 functions exported.
We don't implement all of them.
We implement, well, currently we implement nothing, but the old Zulu I implemented below
100, or maybe slightly above 100, and it was enough to run a whole lot of applications,
most of them
okay so this is driver api and there's also performance libraries which have their own
apis and they also have a lot of functions so those hundred like a lot of the main functions
that sort of everything is going to need to deal with as opposed it was going to be a lot of these
like little functions here and there that
are important for those workloads but may not necessarily be
something that most applications are using.
Right, if you have... we implement, as I said, we do enablement workload by workload, so application by application.
And we can see that there's a core of operations, functions that everyone is going to use, and then it gets less and less common.
Some of them might be used by nobody because they are added for completion, some kind of getters of properties, this sort of stuff.
One thing I didn't really ask earlier is, what does the name of the project actually mean?
Why is that the name? I kind of get it Uda, so Kuda, but why Zaluda?
Yeah, so, okay.
I have to give an excuse for myself.
So, the name of the project,
I literally came with it the day before release.
Initially, and for the whole of 2020,
and you can say it if you go back enough in history,
it was simply called not CUDA.
And, you know, I'm not a lawyer,
but I'm relatively certain that I cannot release a project name like this.
And the day before release, I said,
well, let's go with something that sounds slightly like CUDA.
There's no such words in English, I think, really.
Let's try Polish.
I'm Polish.
And this sounded nice.
So CUDA means something like mirage, illusion.
OK.
That's actually, you know, that sounds kind of cool, actually.
There's no hidden second meaning, I thought it's going to be nice, use polish, have a
word that sounds nice.
And it's not going to get used to it.
Yeah, but I was aware that it's going to be impossible to pronounce, so it's spelled differently,
so I simplified it a little bit
yeah yeah as i said like i i i was never gonna pronounce this word correctly but i don't think
i don't think anyone was going to if it was uh spelt correctly how would it actually be spelt
um i can write in chat oh Oh, yeah. Let's see if I can...
I don't even know what that's...
Yeah, yeah. This L is
spelled differently.
Okay.
I don't even know what that symbol is.
It's like English W, the pronunciation.
Oh.
For anyone who
is just listening,
it's an L with like a slash through it.
Yeah.
Yeah.
I have Polish listeners who will be like,
you're a moron.
I can barely speak English in the best of days.
Don't get me started on that.
Don't get me started on Polish.
That's what I get for being Australian. Yeah. So, hmm. Where do we go from here? Oh, we haven't talked about it being in Rust yet, have we? I don't think so. Yeah, I get this question a lot.
So Rust
is being perceived as
still sort of
exotic, alternative to more
mainstream
language in this
area
for solving this sort of problem.
So the mainstream solution would be to C or C++, but the thing is, I have known Rust for
a long time, for over a decade, so I learned Rust before version 1.0.
Yeah, I was gonna say, if you've known Rust over a decade...
Yeah, okay, yeah. over a decade. Yeah. So I've been interested in this early
because I always wanted to do
so professionally
I've been writing
C sharp and then F sharp
but I always had a certain level of interest
in lower level
development.
And
I always had
suspicion that
kind of lower level development system
level programming is
not only difficult because it's
maybe less mainstream
tricky but also because
C++ is just not a good
language so the first time I
learned about Rust
and saw the sort of semantics it has,
I realized, wow, that's something I always wanted for system level programming.
And I would never write Zlodai in C++. It's just too much pain, too difficult.
And I knew Rust, and I always wanted to do a project in Rust.
So things I wrote in Rust were sort of small projects
that I think never were released to try some features
or try something out with the language.
to try some features or try something out with the language. So I think this is the first big mainstream project I did in Rust.
And I'm really happy with the language.
So language is relatively good for writing system level code.
Well, there are some problems with it.
But I mean, very small minority, my opinion at least. So maybe don't listen to me, but the build system is really anemic. So it works well if you have a relatively simple project that is all in Rust, but we
have things that are slightly more complex.
There's interrupt between, there's CodeGen and Cargo is just not good enough for this
purpose.
So we can have our own solution, but it would be better better if cargo had a real build system so i actually
prefer cmake i'm one of the few people in the probably only person who prefers to have a
simeric or ms build or anything else other than cargo for building cargo cargo does package
management is relatively good at package management, but build
system is just not it.
But
when it comes to language semantics,
the availability
of libraries, it's
good. That's what I wanted.
Why would C++
be painful? What about the language
is a problem?
It doesn't have the feature I want
in the language. So I want
to have
enums,
discriminated unions. So
I've experienced writing in F sharp and
professional experience writing
F sharp. In F sharp you are going to use
discriminated unions a lot. And a lot
of really fundamental
types in Rust
are expressed as enums, like F sharp Rust styles enums,
where you have not only the value,
but also data associated with it, discriminated unions.
And so things like
instruction.
This is
enum.
Statements,
directive, pretty much everything in the compiler.
And C++ doesn't really have
a good support for this.
Other things like memory management.
Much easier
once you learn how
to deal with Borrow Checker, it's much easier to do in Rust than in C++. And obviously Rust
is not all, you know, not every, not whole world subscribes to russ ideas about memory management and object and code safety
so if you look at things like where we have to sort of interact with the outside world
like emitting color vm assembly the code to emit lvm assembly is just every second line is unsafe
is unsafe, but it's okay. And unsafe in Rust is the same semantics or even more strict semantics than C++. So we're still coming ahead with Rust. So that's what I want, basically.
So Rust is basically just,
at least for you, the better tool for the job.
Yeah, for me
it's better C++. And as I said,
Rust is not perfect. There's some areas where it's just
unusable,
like writing
GPU sidecodes. So C++
has, and specifically
Clang, has all those little
attributes that are useful for me.
So, and things like support for address spaces.
This is very niche features,
but I want them for writing GPU code
because if you look at Zluda compiler,
certain functions are implemented as calls
to full blown functions in LVM module
to which we link to during compilation.
And this LV module it's written in C++, right?
C++ compared to LVM.
And you cannot really write this
sort of code in Rust because it doesn't have good support
for writing GPU code.
Fair enough.
I thought you were going to say more there,
but no, no.
The delay sometimes
throws me off when people are going to stop talking.
Yeah, you're literally on the other end of the world,
so there must be a fair bit of delay.
Yeah, I can make it work, though.
I've made it work so far, so, you know,
it hasn't gone horribly bad.
One thing about the project itself,
I don't think you really talk about this anywhere,
but the choice of license on the project,
and again, neither of us are lawyers,
so don't get in our case about any specific points,
but why, because it seems to have two licenses attached to it,
the Apache and MIT license. Yes. Why is have two licenses attached to it. The Apache and MIT license.
Yes.
Why is there two licenses?
And why those licenses in particular?
So I want my software to be used by everyone.
Fair enough.
For any purpose you want.
So this has been sort of...
Actually, it comes from Rust community.
It's just been popular solution in Rust community
to do our license under MIT and Apache licenses.
You pick whichever one you want.
And as we agreed, we are not lawyers.
And this is
my understanding that this is
the path to
be the most compatible
with all other
open source and closed source licenses
that's why
so your goal is basically just getting
making it so people can actually use it
rather than you know ensuring some
free software perfection about the software where you know it's everyone that uses it has to
also be free software all that sort of stuff yes yes yes yes actually while we're we're actually
down this route,
what is your general stance
when it comes to open source and free software?
Clearly, from this,
you're more in favor of the open source side,
but do you have a position you stand on more generally?
So my thinking is,
and this comes from working at
relatively big companies if you want your software to be used the way to go is to use
MIT or Apache license if you're licensing under GPL big companies will not touch it
unless it's something that is extremely critical
like Linux kernel.
Those companies have explicit policies that,
hey, if it's MIT or BSD or Apache,
then you can use it in some only computer,
link to your software, give it GPL, not really.
Before release, and this is for internal stuff, for external stuff, there's usually,
or rather stuff that is being released from within the company, there's going to be a legal review
within the company, there's going to be a legal review and they will check if you're using something that uses GPL.
And if it's using GPL, then it's probably not going to be released with GPL.
So maybe it's your goal.
So the corporations are not using your software.
Then I would even just recommend using GPL.
software, then I would even just recommend using GPL. It's just my stance or politic, but this is the reality I observed.
Yeah, the Linux kernel was in a weird position when it came along, because it was there to
replace the proprietary Unix systems that were coming up the BSD world did exist but it was this
weird like
mix of proprietary BSD
and then 386 BSD was there as well
and I
we could just turn this into just me ranting
about the early history of Linux
because this is one of my very
one of the topics I really really enjoy
and why the whole
GNU herd thing just didn't work and
why it should have been based on 3d6 VSD but then they didn't end up wanting to
do that and chucked away that entire project to wait on this mythical kernel
that was never going to come around anyway and then Linux came along before
they even started the project so no one cared about heard after that anyway uh
yeah yeah but just be aware that uh gpl gpl uh licensed staff has a sort of special position
when it comes to licensing in corporations so they're going to use it they're going to
contribute but it has to be big enough so they're going to use links, they're going to contribute, but it has to be big enough.
So they're going to use Linux Canada, they're going to have their own fork of GDB,
which is also GPL license and stuff like this.
But if it's something smaller than a lawyer who is giving you a review,
it's not going to probably give you a special exception for it.
review is not going to probably give you a special exception for it.
If your goal is to make a license, to make a library,
just don't even bother. Don't even bother with the GPL style license.
I mean, I'm not going to tell you how you should life your life. Sure, sure.
If you're maximum usage, then GPL is probably not the solution.
So what is your background in programming?
I don't just mean your corporate background.
When did you actually start doing programming?
How did you actually get yourself interested in it?
Well, during my first programming lesson at university,
I didn't program before going
to university.
So I hope it's not going to be a letdown, but I always use computers, but wasn't really
interested in programming as such.
I went into computer science degree because I was broadly aware that you can have a good
life if you have a computer science degree.
Whatever you do, programming, security, databases, it's going to be fine
those were different times
that was
oh wow
so long ago
long time ago
it's still
I had to understand that it's going to be
great and I was actually
how do you say, overwhelmed?
Maybe not the best word, but I did not see myself
as a possible programmer, because my thinking was
that programming, extremely difficult.
Every programmer has this sort of galactic brain
and the tools they are using are really high technology
and everyone is excellent.
And I started programming and what I realized
that all the programming languages are old garbage
from 30 years ago.
Programmers,
they cannot program a FizzBuzz.
So,
you know, it doesn't matter if I'm bad
at programming. It was extremely encouraging
because I realized it doesn't matter
if I'm really bad at programming.
Other programmers are even worse.
So my bad code is
not going to
make things worse on average.
So yeah, let's start programming.
And it was interesting.
That's how I became a programmer.
I actually found my passion for programming
when I was studying for computer science degree
In that first
programming class you did, what language
were you actually working with?
My first
programming lessons
was C
It's a sensible
language because I've heard some real weird
answers from people before where it's like
objective C or
you know just
other random things that don't make
any sense
right so I learned
a number of those languages
I think I'm using many of them
so I learned C, C++, Rabi
Python I did So I learned C, C++, Rabi, Python.
I did...
So C Sharp I learned by myself
and this was my first
professional
language I used professionally.
Okay, so all completely
normal and sensible languages. I started with Java. Oh yeah, so all completely normal and sensible languages.
I started with Java.
Oh yeah, so when it comes to
Oh yeah, I learned some Java.
We learned Prolog.
But I remember
nothing about Prolog.
I don't remember that Prolog existed.
Yeah, it's fairly
interesting in its niche.
But I wouldn't be able to write anything in Prolog nowadays.
So now, is Zaluda your main focus at this point?
Yes, yes, yes.
And my journey has been, when we want strange languages,
was sort of C-sharp, then, then F sharp, then assembly,
then Zluda, which is Rust and C++.
No, there are some, and in between there are OpenCL and C4Media,
which you definitely didn't hear about and none of your listeners heard about,
but it's a great language.
Ironically, it's a great language
for writing code for Intel GPUs.
It's a dialect of C++ for writing in the GPU code,
and it's really good if you want to do just this.
It's mainly used internally at Intel. Oh, okay, that makes sense then. But I think they did
some public releases. How do you spell the name by the way? CM, C for... No, I think it's C for
Metal nowadays, previously called C for Media. Ah, yep, yep, okay.
Has nothing to do with Apple Metal.
Everyone asks this question, nothing to do.
C for Metal
is a programming language that allows for creation of
high-performance compute and media kernels to inter-GPUs
using an explicit single instruction, multiple
data programming, blah, blah, blah.
Open CL and SQL applications using
implicit SIM, program model, blah, blah, blah.
Okay, I don't want to read all this.
Yeah, so by the way, if you have an Intel first party library that has GPU code, check which language does it use.
If it's written in C for Metal, those people know what they are doing.
If it's written in something else, probably not.
But they're trying to
include C4 metal in
DPC++
as a different
mode of programming. It's called now eSIMT, explicit SIMT.
Ah, okay, okay.
But CM is still more ergonomic
if you just want Intel GPU code.
eSIMT has the advantage that it interrupts much better
with the normal DPC file class code.
So let's actually get back to the project itself. When you
wanted to sit down and actually start working on a project like this,
where do you even start with something like this? You see the API there, you see you have a GPU.
What do you even do to start getting anything, doing anything?
Start with the...
So in this case, it was starting with trying to understand
what CUDA is,
because it was my first CUDA project.
I mean, I know what CUDA is,
but I had years of experience writing.
Oh, this was also, I didn't mention,
the secondary goal of this project was to actually understand how CUDA works
or how we write CUDA code.
So at this point, I had
years of experience writing Intel GPU
code, but
Intel GPU,
it's always
worth it to understand what is the competition
doing, how do you
write CUDA code.
So firstly, understand
how do you even simplest things with CUDA code. So firstly, understand how do you even simplest thing with CUDA.
And the goal has been, the first goal has been an application that does the simplest
thing possible at two numbers on a GPU.
But it still needs to do a fair bit of setup.
So get a context, create, load the source module, allocate memory, all those stuff.
So look with tool like, so on Linux you have tool like L trace,
where you can lock all the interactions of your application with certain libraries.
So one of the first things to do was understand what sort of host functionality
has to be implemented
and think
really hard, can it
be implemented in terms of
Intel GPU? So all this
host functionality.
So don't think big.
Don't think big.
That's what worked for me. It was just a simple application
because it was a research project. I had no goals to have complex things running. If I had
two and two returning for on the GPU, I would be successful. So I have a simple application.
I know what are the interaction. Obviously it was a lie because I didn't know about dark API at the time. And then I look at those calls and each of
them, there's like 10 of them. And I know, okay, all of them can be expressed in the
terms of Intel level zero. And this is host-code interaction.
Since I have experience
with writing
GPU code, I know there's also the
GPU
side
code.
So next step was try to understand
what is the
format of the
NVIDIA GPU code.
Is it some kind of intermediate representation?
Or is it some kind of architecture-specific assembly?
I mean, depending on the level of complexity,
either can work.
And here I was lucky because, as I said,
NVIDIA is competent and they have a virtual assembly called PTX.
And this is a text format.
So I have a text format and this text format contains and it's somewhat
abstract at least from my perspective so it's virtual abstract representation it's not
it's not written for a specific assembly but it's going to work after compilation with any
going to work after compilation with any CUDA GPU.
And I look at the instruction set,
and actually, good thing,
NVIDIA documents the PTX format,
and I look at those instructions,
and all those instructions I'm familiar with, and I know they're going to work on Intel GPUs,
some things like load memory, store memory,
do multiplication, do addition.
All those things are going to support that on Intel GPU.
And I was still thinking small.
I have my very simple module in text format,
and I have my host code.
So it looks possible.
It looks possible that I started implementing, starting with the PTX, with the compiler, because it looks relatively easy, but as I start
implementing, there are complications. And the way to implement this sort of classical
compiler design is you have your text code, you start by parsing it using a parser generator.
In my case it was a Rust library called Lullerpop.
Rust library called Lullerpop.
That's the name.
Yeah, it comes from, it comes from approbation of this kind of grammars
it can parse.
It's not super fun, but library is good.
No, no, this is a solid library.
And I start parsing those things.
And then once you have things parsed,
you have sort of in-memory presentation of the source code.
You now start writing your transformation passes.
And every transformation pass makes it slightly closer to your target.
So I didn't even think about what my target is going to be.
At first, I thought maybe I can compile it straight to the Intel GPU instruction set.
But relatively early, I decided, no, no, no, it's not going to work.
I'm going to compile it to SPIR-V which is a bit more
abstract, much easier to compile
for, which was the right choice.
So I'm going to compile to SPIR-V
and in every
compilation
path I try to
make it a little
closer to what SPIR-V is
expecting.
And it little closer to what we are expecting. And
it sounds simple, but
there's a fair
bit of
weird behaviors in
PTX format.
There's entirely, for example, there's entirely
too many
other spaces.
On GPU, you are generally going to have
like four different address spaces,
global memory, shared memory, private memory,
generic memory, maybe constant memory.
And PTAs has like seven or eight eight and even in the simple kind of seven
or eight different address spaces some of them are i mean if things were designed from the ground up
are really not necessary in the current world but they exist and you need to have some way
to translate them to other spaces that are available in in spear v some instructions
have no uh direct uh replacement in spear v so you have to uh replace with a series of
simpler instructions this sort of thing and. And it took a long time.
So it was almost a year just to have a translation that actually
translated correctly and worked.
So if you are attempting something like this, I don't have any advice
other than start small
and never give up.
So I gave up like three or four times.
One of those things was, I said previously, dark API, first time I encountered this.
Another time was when I was very technical. So I did one pass that was completely
unnecessary, translation to SSA format. And I realized that, well, it's completely unnecessary.
LVM and SPV are going to do this for you. You don't have to do this.
What is the SSA format?
single static assignment. This is sort of representation on which
compilers operate, optimizing compiler, because this sort of compiler that is inside, it's not an optimizing compiler, it's
just translation. Right. So it's completely unnecessary. And
there's some two or three other things I don't remember. But I
do remember just giving this project a rest for a week or two several times and still coming back to it
when you said this was your first cuda project the first thing i instantly thought of was the um
the linus torvald's linux email it's just a hobby project it won't be anything serious like a new yeah yeah as i said i
didn't expect it to be any big uh it was the the main reason i wanted to do this i wanted to learn
about level zero and mainly about level zero because it was new for all of us at intel at the
at the time so it it was at least initially
just basically a hobby project
to learn how this works.
Yeah.
But it sort of came along
to some extent
and became something
a lot more than that.
Yeah, it was
even when I was joining KMD
to work on this project for the first time, full time, we didn't know if it's going to work.
So we sort of expected that it's going to be either going to work and it's going to be great or just as likely it's not going to work. We are going to run into some kind of impossible roadblock.
And in this case, we are going to release the updated
Zluda for AMD. That doesn't work for some reason, but with explanation why it doesn't work. So it's
anyone can read the source and learn from it.
anyone can read the source and learn from it.
I didn't expect
nobody expected it's going to work, but
it does. Performance is actually
not bad.
Yeah, I didn't bring up the AMD stuff earlier
because I wasn't sure what you can and can't say
there, because I'm sure there's...
At least judging by what you have publicly said,
there certainly were issues that have
gone on.
Yeah, so I can only repeat what happened.
So I worked as a contractor for AMD for two years.
Towards the end, I released the source code.
AMD decided that they shouldn't have allowed me to do so.
It's not legally binding.
They shouldn't have allowed me to do so.
It's not legally binding.
We rolled back the code to pre-AMD version and we are starting again.
Yeah, that's...
I'm sure there's a lot more there that you would like to say
if there weren't issues there.
Maybe one day, but...
Yeah, I certainly won't push you on that because I'm sure there's certainly some't issues there. Maybe one day. I certainly won't push you on that, because I'm sure
there's certainly some legal issues there
that you don't want to...
Avoid getting sued
by NVIDIA, you don't want to get sued by AMD.
So what was interesting about the situation
is that something I learned.
There's so many
internet lawyers who
want to prosecute the case in front of internet judges.
So one reason why, the main reason why I'm trying to dodge this topic is that it's going to bring internet lawyers into the comment section.
I see, I section. I see.
Sorry, it has been already handed
by actual lawyers.
Yeah.
Another thing is that
there's
a certain demographic of people who
blame NVIDIA for
everything, even if
NVIDIA has nothing to do with things,
at least to my understanding
yeah I've definitely seen a lot of those internet lawyer comments myself you go on reddit you'll see
plenty of them like oh well even just in the um the github discussion you had like there were
people arguing there like what was the position
of the person who told you to take the thing down are they at the correct level of the chain to do
that like they like this you're not even if you even if you were in the right like i don't i i
can't imagine you're in a position to like go to war against amd or even want to do that no no no
AMD is a
billion dollar company I'm not going to
fight them
they have lawyers they have
PR department
like it's not helping anyone
yeah it's much better to just
comply with whatever
whatever they want to reasonably get you to do and then just go from there.
Yeah, focus on the technology, focus on the nice things we can build, all the applications that we can have running on your GPU.
So, what is the current state of
things?
So, current state of things is
very little works. So,
currently, I don't think
even the whole... So, this is the situation
as of today. I don't think even
the whole
project builds
because it's being
written from the ground up starting from the
most basic things like even parsing of the ptx so we have a new parser and then
this new parser the the way it works has effect on the rest of the compilation. So I rewrote the parser,
rewrote the compilation passes to be nicer, simpler,
and recently finished writing the code responsible
for emitting Alvian bitcode.
So there's a number of unit tests inside Zluda.
And the unit test works by unit test for the compiler.
So compiler takes a simple handwritten PTX module,
compiles it, check if the result is as expected.
And there's like 90 of them.
So those tests for compiler works,
but we don't have host code working.
And that's the next step, have the host code working,
have some other tooling around the project working.
And once we have host code,
has some of the necessary tools,
then we'll start focusing on the specific workloads.
The first one is going to be probably Geekbench
because it's relatively simple
and then work on those machine learning workloads.
So LLMC.
LLMC is probably going to be the first machine learning
core code that works. With some caveats. Flash attention support will have to wait a little bit.
But yeah, LLMC is the first goal, first milestone.
So now that you've done a lot of this work like in the past
now you're doing a rewrite what is it like other things you've maybe learned
from that experience that maybe you're gonna make things easier now that you
know because as you said you started it it was like your first attempt at doing
this I assume there's a lot you've learned from doing it at least once. Right.
So one big lesson, don't trust AMD things to work.
It might sound funny, but it was an extremely stressful situation.
One of the first workloads I did when I contracted for AMD
was a closed source application.
I don't remember which one, either 3DF Zephyr or RealityCapture.
And so I enable everything that is required
for this application, both host code and in a compiler.
But for some reason it doesn't work.
And I, since it's closed source application, it's relatively tricky to debug, but I
eventually, I found the kernel that is producing incorrect data. And I run this,
And I run this.
So I extract this kernel and the data it operates on, and I run it both with,
on the CUDA stack and on the Zluda.
And it works correct.
And it works correct on both every time.
For some reason, the same kernel
fails when run inside the application.
And at the same time, I'm starting to panic because it's, well, impossible situation,
impossible to debug. It should work. It just defies the logic and the laws of physics.
And I tried it both ways. So I run the whole workload on Zluda.
tried it both ways. So I run the whole workload on Zluda. And because I thought, well, maybe earlier kernel under Zluda is computing things wrong. And I run it on both source from CUDA,
source from Zluda and try it. Zluda to CUDA, CUDA to to zuda, it all works well, gives right result.
And purely by luck, purely by luck, I realized that there is a bug in AMD host code. For certain
for certain memory copies, it copies them incorrectly.
It was my first revelation.
Do not trust host code, even if it's simple mapping,
even if it looks simple.
Do not trust that MD host code is going to work correctly.
Always double check.
So there's, you know, it sounds pessimistic.
There's not so many cases, but generally, if it's something that has to do with textures, then double check if you're using heap. So
that's what I learned. There's some other lessons that are very Zluda specific.
And it was actually somewhat frustrating looking,
starting from this point two years ago,
because I look at my code from before the rollback
and I look at this compilation pass and I think,
well, that's wrong.
It's wrong.
There's better ways to do this.
There's better ways to do this.
But I have to fix all
that other code
before I arrive at this.
So you can see
where the goal is, but there's just
so much to get there.
I remember.
I got rid of this part because
there are better ways to do it.
I have to live with this again.
But hopefully the second time around,
you can get there quicker.
Like now that you know what you're doing.
Yeah, yeah, it's much quicker right now.
Because firstly, because the workloads are different.
So oh, one thing I did not mention that we're getting rid of,
so much sadly, I have mixed feelings about this. We're not going to do ray tracing
because Sluzloda has had an implementation of Optix. Optix is NVIDIA's framework to do ray tracing and it's
very complex.
During my contract work at
AMD, I think
Optics
took
I don't know, almost a year.
It was very
complex. Not only Optics itself
but also
applications using Optics. So the goal has
been Arnold. It's closed source, very complex rendering solution. Debugging was a lot of effort.
So it's going to be so much faster without optics so much easier it would be cool to have it you know if you
had infinite time yeah yeah i know i agree i agree you know yeah yeah so so one of the things so
as i left uh md or yeah as i left m AMD, released Zluda.
So something that
became really
clear, really quickly
was that there's a lot
of interest in
machine learning workloads.
Relatively little
interest in sort of professional graphics
workloads. So I've been
focusing on...
So there's a time when I left AMD
and had nothing else
lined up.
So I was still working
on Zluda, but focusing on the workloads
for which there's no
commercial interest.
I almost got
GameWorks...
Well, I got it running
in one application but never matched the code
and then
things similar to it
so like
I don't remember
there's this suit
for 3D
photogrammetry
I don't
remember what was it called so also has been requested but
you know has no there's no commercial interest in it right but i i like to just well most people
will describe nvidia at this point not as a gpu company they're an ai company right like i i like
to call them a uh a machine learning shovel company.
During a gold rush, don't dig for gold, sell the shovels.
That's what NVIDIA does.
Yeah, yeah, yeah, yeah.
That's true.
If you look at the machine learning market,
machine learning hardware market, there's two markets.
There's NVIDIA market and non-NVIDIA market.
And, you know, a lot of people want zluda for machine learning
because uh they want okay not by the means of hardware but the means of software take part of
nvidia market because nvidia market is much bigger than non-nvidia market
yeah and that's that's just not going to change in the real world
like it's
no matter what AMD
I mean AMD making effort
and AMD is I think closest
because as I said they made the right
strategic call to make
their APIs close
to Nvidia APIs
maybe the closest the problem is their
execution is just
just not good.
And purely
talking about software, there's also
some hardware
choices, some
design choices they made in the hardware
that make porting
from CUDA to heap more difficult
than it should be.
Is there anything you want to get into there, or...?
Uh...
Yeah, so...
And it's no secret.
Do we have a drawing board? Because I need to draw it.
In Jitsi.
I don't know if Jitsi has one, actually.
I don't use it.
We had one.
Yeah, there's a whiteboard.
OK.
Excellent.
Yeah, beautiful.
OK.
So core difference between CPU and GPU is that if you have a hardware GPU thread,
it's going to operate on a vector of data.
What it means is that you have your CPU and CPU...
How do I draw?
Okay, that's what I meant. Give me a second.
Okay, this one.
So,
okay, you have CPU.
Okay.
CPU operates on a
single element at a time.
Uh-huh.
So you add
two plus two,
you get four.
Single element at a time.
Right, right.
GPUs
GPU operate on a
fixed vector of elements.
So in this example
this is GPU
where the instruction set
had a width of 4.
So maybe there's 1, 2, 3, 4, and another, I don't know.
Where's the eraser?
There's no eraser on this thing.
I think I have to, well, whatever.
There's 1 plus six, seven.
Yeah.
And the result,
we'll leave it as an exercise
for our readers, watchers,
listeners.
But it operates on a vector at a time.
This is a key difference. This is why
CPU code is not going to be efficient when just taken and run on a GPU code because you're going to use only one element of the vector.
And that's why only some workloads are efficient on GPU because you need to be able to vectorize your things and in this example you have with four of the vector and this is sort of
the most basic parameter of your GPU if you look at the all Nvidia GPUs in the
history that are programmable with cDA and probably every GPU in the future,
the width of the vector, the number of elements you operate at the time is 32.
And AMD, for some reason that is impossible to explain.
Okay, I'll explain in a second.
But for reasons that are...
Maybe I shouldn't explain.
It's going to sound silly.
But this is an explanation I received inside AMD.
So AMD has really two architectures.
One architecture that is not meant for compute.
That is meant for gaming.
Has Win32.
Which makes it easy to port compute code.
And their compute architecture is difficult to port compute code. And their compute architecture is difficult to port compute
code.
So are we done with the whiteboard, by the way?
Are we done with the whiteboard, by the way?
Come again?
Are we done with the whiteboard?
Do we need that on the screen?
Yeah, yeah, yeah, yeah.
Just I don't know how to close it either.
Uh...
Unpin?
What?
Now the white...
Yeah, okay.
You suffer with the white water, I'll continue my explanation.
Whatever, we've got a few minutes left of the show anyway.
It's going to just be a broken layout.
It's fine.
Okay.
Yeah, yeah.
So, on AMD GPUs, you compute GPUs you have with 64.
It's relatively difficult.
Well, it's not impossible, but it's not easy.
It's not easy.
It's not easy.
It's not easy.
It's not easy. It's not easy. It's not easy. It's not easy. It's fine. Okay. Yeah, yeah. So on AMD GPUs, you compute GPUs you have with 64.
It's relatively difficult.
Well, it's not impossible, but it's extra difficulty to compute CUDA code written for vector size 32 to AMD GPUs with the vector size 64.
vector size 64.
And if every
AMD GPU had vector size
what's called warp size
32, it would be much easier to port.
Much easier. But
what I learned
at AMD, there's a guy
who has a spreadsheet
and according to his spreadsheet
warp size
64 is better for center workloads spreadsheet warp size 64 is better
for centa workloads than warp size 32
and that's why AMD spends millions of dollars
porting from
warp size 64 to warp size 32
okay
otherwise it would just
replace the names and it should work.
Mm-hmm.
If it's 32 to 32.
You know, I don't think I ever had a
visual demo in the middle of a show before. That's
certainly a fir- Oh wait, no. No, I did have a game dev on who did one once.
This is the second time. I've not had a whiteboard though. A whiteboard's a new one
Yeah, but I think it's nice
It's a nice change, I like it
I've done like 200, 250 of these or so and finally there's something new and weird
So by the way, so previously Zluda has the two modes to run.
Expected why CUDA warp 32
and hardware warp 64,
but it was a time-consuming feature
because it sort of applies to every layer of Zluda,
both compiler and both host code.
So we're
living it out.
And AMD
announced that they're merging
those two architectures into one.
And I mean,
I don't have
special knowledge, but I expect
it's going to be Warp32
to be similar with NVIDIA.
Otherwise, it would be self-sabotage, not using Warp32.
Even if it's not efficient in its hardware sense,
it's much more efficient when porting the software.
And porting the software is the bottleneck.
Right.
Even if it's faster, if no one's writing software...
Yes, yes, yes.
Sadly, we live in the sort of CUDA-shaped world when it comes to GPU computing.
That's the objective reality.
That's the objective reality. Well, on that note, I guess we could start wrapping things up.
So, let people know where they can find the project.
I know you've got a Discord server linked here as well.
Yeah, we do.
How active is that?
I haven't gone into it myself.
I mean, it's fairly active.
Can you put the link to GitHub and Discord in the description?
Yeah, I can do all that.
There's no nice and easy link to it.
Please do.
Okay, so we have a Discord.
So I was worried at the beginning
that there's either going to be,
I don't know, three people
and it's going to be totally empty
or it's going to be 3,000
and I will be spending time
moderating the Discord.
But there's, I think, like 100 or 200 people.
It has a healthy level of activity.
It's nice.
And I encourage you to join
unless you're one of those people from comment sections
who is going to write really ugly things about
NVIDIA or AMD
about Intel
in this case please do not come
but if you're a normal
person please do
like
when I have something
working
in Zloda I'm going to
share it first on Discord.
And later when I have more things batched together,
then I'm going to write a news.
So if you want to build a bit closer to the development as it happens,
or you have some questions, then please join Discord.
So if somebody wants to get involved with the project,
head over there and head over to the GitHub and just go for it?
Yeah, if you want to get involved, look at GitHub, join Discord.
It's probably...
So we are not getting so many external contributors. I
think one of the reasons that project Yeah, and it's sort of
not mainstream programming queue. If it's something web
related, you're going to have a much bigger pool of developers
who can contribute GPU development, much more niche. So
it's not like we are overwhelmed by contributors.
If you want to add something, then
you will be a really special person in this project.
Oh, there's 12 contributors listed.
Yeah, yeah, you will be a very special person.
Yeah, and even if you cannot contribute code,
then if you can contribute, I don't know.
Changes to the documentation or documentation itself, very welcome.
I'm not a native English speaker, so you might have an advantage over me
and ways to improve this project.
It is really normal for a project to have, you know,
a big drop between their core contributor and then the second top contributor.
I have not seen a distribution like this before,
where you're like 400,000 total lines of code added and removed.
The next person is two.
You will be a very special
person if you write like a a serious patch for this project yeah yeah yeah
yeah if you contribute 20 lines of code you can probably be the second biggest contributor
yeah yeah yeah so is there anything else you want to direct people to or is that pretty much just it
I think that's it
awesome
oh yeah yeah
one more thing
watch the next episode and previous episode of this podcast
yes do that
I'll do my outro and then
we can sign off
okay
so my main channel is Brody Robertsbertson i do linux videos
there six-ish days a week i did a video on zaluda coming back so if you're not seeing that one yet
uh go check that one out it'll be out it's like it'll be like three weeks old by the time this
video comes out so i don't know if you haven't heard about it coming back yet go watch that
video i go for the blog post to go over the history of the project all that fun stuff
uh if you want to see my gaming stuff that is over on brody on games i stream there twice a
week i also have a reaction channel where clips from that stream go up so if you want to go over
and watch just clips of the stream do that as well and if you listen to the audio version of this you
can find the video version on youtube at tech over t if you like to see the video version on YouTube at Tech Over Tea. If you'd like to see the video version,
it is on YouTube, Tech Over Tea. I will give you the final word. What do you want to say?
How do you want to sign us off? Well, I finished my tea. It's time for another tea.
That's a good plan, actually. That's it.