In The Arena by TechArena - A Mathematical Revolution in AI with Lemurian Labs CEO Jay Dawani
Episode Date: October 17, 2023TechArena host Allyson Klein chats with Lemurian Labs co-founder and CEO Jay Dawani about his vision for a mathematical revolution in AI and how his company plans to change the game on AI performance....
Transcript
Discussion (0)
Welcome to the Tech Arena,
featuring authentic discussions between
tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome to the Tech Arena.
My name is Alison Klein, and I am really delighted to have
Jay Diwani, co-founder and CEO of Lemurian Labs with me. Welcome, Jay. Hi, thank you for having
me, Alison. Delighted to be here. So Jay, I've got a lot of people on the show that I would say
they've written a book on things, but you've actually written the book on AI and ML. So tell
me a little bit about your background and how you came to be the CEO of Lemurian Labs.
Yeah.
So I can't go all the way into the background because I don't think we have enough time for all the bizarre things that I've attempted.
But a few years ago, I was working as director of AI at a company called Geometric Energy.
And I was trying to build a team to allow us to solve a lot of the problems we were.
And I interviewed a few hundred people for the roles.
And out of those few hundred, maybe 10 could actually answer the questions that I had.
And only two could actually write any of the code that I was hoping they would.
And it was really disappointing to see that because we need more ML engineers that actually understand the fundamentals.
They understand well enough to manipulate and create new workloads and new algorithms.
Because I don't believe we just stop here.
Just the workloads we have that are open source, you can't just keep pulling those and working
for the same models and the same data sets.
And it was sort of out of that frustration that I decided, you know what?
I understand the math well enough.
Maybe I'll write a book that is accessible
and builds on high school mathematics foundations
so that more people can get into the field.
It's not as daunting as people think.
And I wanted to fix that misconception
that you don't need a PhD to be an ML engineer
because I don't have a PhD in ML.
I dropped out of university.
And if I can do it, they can too.
Now, you and I have talked off the air about the importance in this moment in the industry
of development of AI technology, and we're at the AI Hardware Summit. This journey that you've had
into ML and looking at various fields like robotics and autonomous transport and how
they're using machine learning has led you to a company that is actually developing hardware.
How does this challenge inform underlying hardware requirements?
And how do you see the interplay between hardware and software today?
Well, that's a loaded question. So when I was working in a lot of these areas,
I was really excited that computer vision seemed
to be more or less solved.
And that's something that's been holding us back
for a very long time.
And you no longer have to hang the engineer features
and try to make a classifier.
You can learn all those features.
And the larger your model gets, the less of a priority you need
so your model can figure it out.
And it can learn features that you may not have thought of.
And then you also had reinforcement learning
that was showing some really, really great promise
so that you could couple the two
and have a vision-based decision-making system.
And that is essentially what brought people to research
a lot of the robotics ideas.
Autonomous vehicles were born out of the observation that computer vision was almost
solved. With enough compute, we could get there. With enough data, we can get there.
So the data we collected, we made a lot of progress in the industry, but we still have
a massive problem ahead of us because the workload isn't done evolving yet. It needs a lot more. And more
importantly, we don't have enough compute in those environments at the edge to make autonomous
robots work. The model has to grow and the more it grows, the more memory requirement needs,
the more flop requirements it has, and hardware is not progressing at the rate that it needs to
to enable that, especially safely. Because if you start quantizing models too aggressively to try and make them fit, you're creating more errors inside the model that you can't account for.
So I had a case at a self-driving company where I had a one in a million probability that my model would misclassify and react wrongly to a woman pushing a baby in the shoulder across the street.
That's not a scenario I want to go wrong.
But because of compute constraints, I had to quantize this model and deploy it.
And that probability ended up being 1 in 10,000.
And I had no idea how do I need to fix this with my training data set?
What was changing inside?
And that's where I started to think about the importance of numerics and hardware and
how that needs to be evolving for these workloads. It was really, again, I guess most things to me come
down from I was frustrated at some point, so I started doing something. I started thinking more
about the hardware and looking at why hardware wasn't evolving in the direction that we needed.
And I learned that we kind of made a lot of choices a few decades ago that we've been continuing to ride along with.
And it wasn't anyone's fault.
They were the best architectures at the time for the workloads we looked at.
But when the workload changes, hardware has to evolve with it to make it work.
Because hardware is normally, if you saw the switch from single core to multi-core, software took a decade to catch up.
And it's still somewhat catching up, especially as hardware continues to evolve.
But now, software is not off in its own direction.
Because we had that switch from software 1.0 to 2.0.
And hardware never really fully evolved with it.
We continue to use the same platform.
And now that we're ripping out bits bits trying to go for smaller number formats,
there's a diminishing return from going for more efficient number formats because the architecture
can't benefit. It spends more energy in the instruction decode and the data movement side
than it does in the math. So if you save a couple of percentage and energy just changing the math,
it doesn't matter. You're still dominated by movement. And it was really those observations together that started making you think about maybe
there's an opportunity for introducing a new kind of computer architecture.
So we made the transition from scalar to vector.
Let's make the geometry matrix now, because that's what this workload is asking for.
So instead of going for SIMD, we went for spatial.
And we've been working very,
very diligently on our software stack because to me, that is, accelerated computing needs to be
around enabling a workload and the engineers writing those workloads. We can't make their
lives harder. So let's look at how the workload wants to run, how the workload may evolve.
And Jeff Bezos said a long time ago, when times are changing, it's more important
to look at what stays the same. Because only when things are somewhat static can you actually
start creating value. Because if the workload is changing, you don't know where it's going
to end up. Everyone is kind of guessing. So that's how we kind of approach this. What
are the kernels that are going to stay the same? What is going to change? Data flows
are constantly changing. The underlying kernel
isn't really changing all that much. Right.
Numerical requirements are not changing that much. And that gave us a sense of this is how
we should start thinking about the system. And then, of course, thinking about the edge,
power is important. And I was a little bit optimistic around timing around the edge. So
we kind of brought ourselves up to the data center
because a lot of folks had come to us and said,
what you're doing seems really great,
but I think we want to try and use this for this application.
So we rethought a lot of the architecture side.
But to answer your question,
we need to be more software-centric right now.
And we need to build around engineers and their needs
and share the workload.
We can't go chasing really cool tops per watt kind of numbers
because they don't translate to real life benefit.
Right.
You know, when you're talking, it makes me think about so many things
that I've experienced in my career in the Silicon arena
and knowing that software developers have been constrained
based on what architecture they're writing towards all the time.
This is the case.
And I think that we are hitting a peak with AI moving so quickly.
The architectures that we are utilizing are providing constraints.
You've introduced a new concept at Lemurian around a new number system for AI
that I want to talk to you about and why that's so important.
I know your team is very hard at work developing silicon. When you look at introducing a new architecture,
you've talked a little bit about it, but why now and why a new number system?
Yeah. So I mentioned earlier that I had a problem with quantization of these models.
Now the entire industry a few years ago got really, really excited. And they're like, hey, let's just jump to Int8 because it'll work.
And I don't think any AI engineer was like, yeah, that's a great idea.
It was mostly hardware engineers being like, that's how we get performance and efficiency.
Because we save like, what, 20x on the mat wall, 4x on data movements and so on.
And that's great.
But you still need to run the workload.
So convolutions could get quantized fairly well,
but that's a small subset of them.
Fully connected networks could get quantized fairly well.
And then you have transformers, which are largely matmuls,
but they have other components too that are not very quantizable.
And these workloads are still going to evolve.
They're going to get more complicated.
They're very sensitive to perturbations, which is what numerics do.
And you can think of it as a dynamical system.
If something is constantly flowing and slight perturbations here and there
will lead you astray, you want the system to be as stable as possible.
And the DSP world has been doing this for decades.
They've looked at a workload.
They've exploited the numerics, have it very, very smartly.
And then they built the right architecture that would solve
and be the right tool for the job.
And it was really that thinking that we brought in.
We've never done that in general purpose computing.
We're starting to see that a little bit.
GPUs, they give 1,000x in 10 years, they claim.
Most of it was from number formats and specialization.
Very little came from node scaling.
Interesting.
Yeah.
So now that we have this workload that
is demanding so much, and it's growing
at 5x the rate of hardware, so training flops
is growing at 10x a year.
Assume hardware grows 2x a
year. We've gone from a 5x difference from year one to almost a 15,000 plus difference
in year six.
Right.
Right? So that's an alarming rate of growth. We've never experienced something like that
as an industry. And we can't use the same bag of tricks to continue going on.
It's not sustainable. Yeah.
So there was a saying that the status quo is Latin for challenge.
It's something you should be throwing down.
It's something to be challenged.
Oh, interesting.
Yeah. Why are we continuously stuck in certain ideologies because they're comfortable?
The world is changing.
We should too.
Change isn't comfortable,
which is probably why people push back.
But we need to rethink computer architecture
for these workloads.
And a big part of that is number systems.
At the end of the day,
you can think of a car.
Or if you have a car, a motorcycle, or an airplane,
each are transportation vehicles.
They get you from point A to point B.
But the engine is different.
The body is different.
You're not going to take the engine from a motorcycle
and put that in an airplane.
That just doesn't work.
Or the other way around.
You have to right-size it for the tasks that you're going to solve and do that.
Because right now, the architectures we have, to them, everything looks like a nail and
they're a hammer.
But that's not what you want all the time.
There's a lot of problems where you just want a good tool for the job.
So number systems, they're an engine.
They need to get fed.
At the end of the day, compute is computation.
It's math.
So we thought about how much precision and dynamic range
do these workloads need?
And when thinking about that
and the trend that people were going into,
it kind of became obvious
that we were going to run out of bits very soon.
And that was going to make software engineers' job really hard
because we can't deploy these systems safely.
And we went back to a really old idea,
log number systems.
Log number systems are great
because it multiplies or adds,
divisions and subtractions,
and those are very free operations.
But there's a caveat.
Addition and subtraction in a log format is very hard.
So what people did early on
was they used a lookup table
to convert from the log domain to the linear domain
and do the addition, subtraction in the linear domain
where it was easier to do.
But those lookup tables end up growing very, very big.
So any return that you get from that benefit
of turning multiplication to addition
gets kind of mitigated away.
And there's sort of the holy grail of number systems that if you can make surely logarithmic
format work.
It's like a 350-year-old math problem.
Right.
And it's been 50 years in the VLSI world.
But me and my co-founder, my co-founder is one of the experts in arithmetic and logarithmic
systems, and we came up with a way of actually making addition in a log
format work. Wow.
And we've developed multiple types, all the way from 2-bit to 64-bits to stack up against
floating point. And where floating point scales linearly and gets worse and worse as you add more
bits, we kind of taper off something in that way. So pound for pound or bit for bit this thing has better precision dynamic range and the signal to
noise ratio and so from instead of saying a ppa perspective we call it ppp perspective
performance power precision this thing wins all day long and the interesting thing was
number systems can have a morsel-like effect whereas they shrink your
architecture opening up space in your silicon so you can attempt more right that's what brings
right gave rise to multi-core and that's always been holding back a lot of different architectures
that could have been very promising right because if you just change the architecture alone and keep
everything else the same like if you have the same IP as every other of your competitors,
you have the same workload and constraints on software,
and all you change is the architecture,
you're bound to, at best, a 30% improvement.
And if that's the case,
you'll just stay with whoever the market leader is
because they have the best software stack.
You know it's going to work.
But if you really want to break free from that,
you want to deliver something that's going to create value,
you have to rethink number formats
and software for what the
application's asking for and design
and architecture around that.
And that's what we did.
What are the performance
you threw out some numbers there, but
what are the performance deltas that you're
looking at?
We're targeting a 300 watt PCIe form factor.
It's got a crap ton of HBM on it.
And that's probably because the workload is very, very memory-centric
and it needs a lot of data movement, especially with transformers.
And this thing can do three and a half petaops of dense compute.
With sparsity, obviously, we can get a 2x.
But that is massive.
And this isn't a five nanometer process.
And
we'll be releasing a lot more
down the line about the architecture, more specifics,
how our software stack works, as well
the number system, showing how
to do quantization in it,
as well as training in RCA
number form and then quantize down or from floats to ours.
And we'll also be sharing MLPerf numbers on our architecture in our simulation in the
name of openness.
We want to be open.
We don't want to hide it.
We want to give engineers more freedom, more control of the software stack and the architecture
so that they can, in the event the workload
changes, you have this function plus one
thing where you have a new kernel
that you need to run and you don't want to have to
wait for the company to decide it's worthy
enough to pack into their software stack and make
work. You as an engineer
want to have control enough to be able to create your own
kernels and fuse them and then compose them
to do whatever you want.
And that's how we've thought about this.
And we have results that we're going to be sharing again very soon.
In fact, I'll probably be sharing that on Thursday.
So stay tuned for that.
And I think we'll surprise a few people.
I'm looking forward to it.
The theme at this conference is all about this incredible demand for choice.
I think that everyone is looking at this and saying,
we can't have a single supplier fueling something as important as AI.
Who are you aiming for the introduction of these platforms and why?
So I mentioned earlier that I was really frustrated
that I couldn't get to train and deploy the workloads that I was looking at.
Around this time, I was venting to a lot of my friends
who were in the industry, and a lot of them worked
at one of the bigger companies, a lot of startups,
and they weirdly felt similar pains to the ones I did,
despite them being much more well-capitalized,
having very large compute clusters available.
And that's because there's too many people, too many ideas,
and there's not enough compute to go
around.
Very recently,
a very large company
deployed a very popular workload
and it took up enough compute
to the point where
90% of the AI projects in the company were killed.
Because they ran out.
That's bizarre to think about.
A single workload being deployed, taking up all the compute in many, many large-scale
data centers.
And this is the beginning.
You think about that chasm that is getting created with the growth of model size, if
you want to stay at the frontiers, and the rate of compute and the cost of compute and the power consumption, very few companies
are going to be able to train these models.
I was with a company a few days ago, and they just trained their latest model.
And this thing took a little over a few, took almost 10 Yottaflops to train.
Wow.
That is ridiculous.
We just built the first Exascale computer last year. You were talking about Yottaflops to train. Wow. That is ridiculous. We just built the first exascale computer last year.
You were talking about Yottaflops now, not even Zettaflops.
And if you build a Zettascale computer today, that's going to take like 20-ish nuclear reactors.
Right.
That's not sustainable.
And even with the current trend of scaling, it doesn't get that much better.
10 years from now, you'll still need a few nuclear reactors to power this.
So we need to rethink and reimagine accelerated computing for this workload.
And we need to make computing accessible so that more people can come in, more architectures can get trained.
And it's not just five companies in the world that have the computer to train this.
Right. And it's not just five companies in the world that have the computer to train this. Because when you have that, the incentive to progress the workload and deliver something as great isn't there.
A lot of the learnings, the failures of these models don't surface.
And that is partly what I want to fix.
I want to make all the startups out there, all the midsize companies, all the tier two clouds that are all starved for compute. I want them to be able to have a voice in this and have enough compute to train AI models and
deploy them at scale without, you know, breaking the planet. When you think about that, it almost
takes me back to that line in Jurassic Park where, you know, they were so focused on if they could
build it, they didn't think about if they should.
And I feel like we're at that moment with AI.
You know, I think that you talk about the power, you know, and I read a statistic of, you know, forecasts of AI growing data center energy drop to 20% of what the world consumes.
Yeah, that is entirely possible.
I think it's going to happen within the next
10 years, if not
sooner.
There's a lot of talk about AI
doomerism. That guy's going to
do a lot of bad in the world.
And I've
been saying pretty jokingly that
it's unlikely that's going to be the case
because our pursuit of these models and the rate
these models are growing and the carbon emissions resulting from that are probably going to get us first.
And that's something we need to fix.
And then we can figure out, you know, the killer robots scenario later on.
Right, exactly.
But let's paint a rosier picture.
Sure.
Let's say that you and other innovators in the AI space are successful and delivering some breakthroughs
in terms of performance efficiency
and core capability
and democratizing access,
what does the future look like?
What do you want to see from this industry?
And what do you want to foment
for the broader developer community?
So when I was a kid,
I got into technology
because I grew up in a third world country.
And I didn't have access to a lot of what everybody else did.
I'd say some Canada and other places.
And I remember the first time I saw a computer and what it could do, it blew my mind. I've always had this view of any technology, once it's built and once it
gets to a price point where it gets proliferated, is going to be an amplifier. And every single
platform shift, a technological shift, has brought along parts of the old and enabled
it much more. So to me, it's the internet democratized access to information.
AI turns that information into knowledge and enables you to have access to that.
That's a force multiplier.
A person that didn't know anything or couldn't read all of those books now can go to a large
language model, interact with it, ask it questions, and it can learn better.
Imagine the impact that could have.
It's profound.
And there's other applications that are also stifling this.
When compute gets down and the ag is progressed,
we can have autonomous robots, safer roads, better drug discovery,
better diagnostics for people.
We can, instead of addressing things after that, we can start being preventative, understanding
more about our bodies.
That excites me.
That's what I want to enable.
But most of the people working on those problems are compute staff.
We can't attempt it.
Right now, the best we can do is build models that interact with us, give some decent looking answers,
create some images,
which are all really cool
because that impacts creators
and a lot of other industries a lot.
But there's more important work
that needs to get done.
We're not there yet.
So I can't wait to hear more
about what you and the Lemurian team deliver.
I can't wait to see your MLPerf delivery
later this week.
That's exciting. I'm't wait to see your MLPerf delivery later this week. That's exciting.
I'm sure that folks online are intrigued and want to engage with you and your team.
Where would you find the, excuse me, where will you send them for more information and to engage?
Simplest way is go to our website. There is a contact us page and you will get a hold of us pretty quick.
If not, you can always reach out to us on our LinkedIn page as well.
And we're pretty responsive.
So if you have genuine requests or you want to learn more, we're more than happy to share more.
Well, Jay, thank you so much for the time today.
It's been a real pleasure getting to know you and have a fantastic time.
Thank you.
Thank you as well.
Thanks for joining the Tech Arena.
Subscribe and engage at our website, thetecharena.net.
All content is copyright by the Tech Arena.