Computer Architecture Podcast - Ep 15: The Hardware Startup Experience from Business Case to Software with Dr. Karu Sankaralingam, University of Wisconsin-Madison/Nvidia
Episode Date: March 28, 2024Dr. Karu Sankaralingam is a Professor at the University of Wisconsin-Madison, an entrepeneur, inventor, as well as a Principal Research Scientist at NVIDIA. His work has been featured in industry fo...rums of Mentor and Synopsys, and has been covered by the New York Times, Wired, and IEEE Spectrum. He founded the hardware startup SimpleMachines in 2017 which developed chip designs applying dataflow computing to push the limits of AI generality in hardware and built the Mozart chip. In his career, he has led three chip projects: Mozart (16nm, HBM2 based design), MIAOW open source GPU on FPGA, and the TRIPS chip as a student during his PhD. In his research he has pioneered the principles of dataflow computing, focusing on the role of architecture, microarchitecture and the compiler. He has published over 100 research papers, has graduated 9 PhD students, is an inventor on 21 patents, and 9 award papers. He is a Fellow of IEEE.
Transcript
Discussion (0)
Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to
cutting-edge work in computer architecture and the remarkable people behind it.
We are your hosts.
I'm Suvaneh Subramanian.
And I'm Lisa Hsu.
Our guest for this episode was Karu Sankaralingam, who is a professor at the University of Wisconsin-Madison,
an entrepreneur, an inventor, and also a principal research scientist at
NVIDIA.
His work has been featured in industry forums of Mentor and Synopsys and has been covered
by the New York Times, Wired, and IEEE Spectrum.
He founded the startup Simple Machines in 2017, which developed chip designs applying
data flow computing to push the limits of AI generality in hardware and built the Mozart
chip.
In his career, he has led three chip projects, Mozart at 16 nanometers and HBM2-based design,
Miao, an open-source GPU on FPGA, and the TRIPS chip as a student during his PhD.
In his research, he has pioneered the principles of data flow computing,
focusing on the role of architecture, microarchitecture, and the compiler. He has published over 100 research papers,
has graduated nine PhD students, is an inventor on 21 patents, and has published nine award papers.
He's a fellow of the IEEE. Now, Simple Machines was founded before the huge boom in AI chip startups and has since folded.
And so we really wanted to pick Kari's brain about his thoughts on running a chip startup.
And he was kind enough to join us and share his thoughts on building a business case,
understanding the user experience, and how much software is necessary to build a hardware startup.
Now, before we get to the interview, a quick disclaimer that all views shared on this show are the opinions of individuals and do not reflect the views of the organizations they work for. With that, let's get right to it.
Karu, welcome to the podcast. I'm thrilled to be here. Thank you so much for talking to me today.
We're so excited to have you. And as listeners know, our first question is always,
what's getting you up in the morning these days?
Well, everybody's life is different. So mine is I have a young family. I have an 11-year-old
and seven-year-old. So what gets me up is making sure they both go to school and our houses just until they have reached their
school is getting them out the door and in the past few years I have become a all-season biker
so summer winter I don't have a parking permit anymore what gets me up is getting them there
and man my very very short-minute bike ride to school.
And I'm not even a real biker.
I go on an e-bike.
So real bikers don't be offended by me calling myself a biker.
So that's what gets me up, getting my kids out the door and my very enjoyable all-weather bike ride to work.
That's really amazing i think you bike to work in wisconsin and
there was another guy philip wells i remember philip white his bike every day to work and i
was like you guys i mean you live in wisconsin like what are you talking about you work every
day that's my yeah my inspiration actually ben, Ben lived there. I got here in 2007.
Ben grew up, I think, in Pennsylvania or something.
Then he went to grad school in Stanford, and then he moved here.
So he would bike.
I never got it until I myself started biking.
I'm like, yeah, actually, like with all the gear,
it doesn't really matter what the temperature is.
It's very enjoyable, that 15 minutes of crisp air in your face.
Oh, that's amazing.
That's amazing.
When I was at Michigan, I remember walking to the office and some days it was so cold that I wouldn't have described it as crisp air.
I remember telling people it felt like someone was slapping me the whole way.
And so I don't find that fun. But yeah, okay. So that gets you in the morning. And now you're in
your office at Wisconsin. So Simple Machines, that's sort of the lead here. What's going on
with Simple Machines now? For those of our listeners who don't know much about it, maybe you tell us a little bit about it.
Yeah, so SuperMachine was a startup I founded with work done as part of my research and a handful of my graduate students.
What we were trying to do, it's kind of becoming more relevant now,
a product market fit was maybe a couple of years too ahead.
We were trying to go after a small batch inference
and trying to build a chip that would run at extremely high utilization,
60 to 70% of hardware utilization,
even when you didn't have a lot of batch parallelism.
The second angle we were going after was this observation
that algorithms were changing at a rate much
faster than the chip design cycle. So our goal was how do I build something general?
And more recently, we've coined this term called saying we are in the era of efficient generalization, not really the era of specialization.
Now specialization implies I have a thing, I'm going to build silicon for that thing.
That thing is changing at a rate that is really, really fast.
And if it takes you 18 months to build silicon, by the time you build your specialized thing, the thing has changed.
So we were putting behavior principles that with software could really be modified to run very, very efficiently.
So that was our value prop.
And we built out, we'll get into this more, the hardest part.
I mean, many hard parts.
The business part was probably absolutely the hardest part. I mean, many hard parts. The business part was probably
absolutely the hardest.
In my opinion,
until you do it, nothing prepares you for it.
Doing it the first time
prepares you for doing it the second time.
So what do you mean
specifically by the business part?
Do you mean coming up with a value proposition
or do you mean the actual running
of the business? What do you mean by that?
It's, people sometimes use this term,
it's a little cliche,
the whole concept of a product market fit.
Eventually you have a technology,
it does well on benchmarks and so on and so forth.
Ultimately, a customer has to feel,
I'm going to pay you money to buy this thing.
I'm going to pay you instead of somebody else.
And the value you bring is so much better than something else that's already mainstream.
And especially in deep learning, obviously, NVIDIA had the absolute has, had absolute dominance there.
And the barrier to entry was, this was our biggest learning, and it was what drove the company.
And even as we had more and more conversations, we kept realizing how important it was.
Basically, the user experience expectation. And Nvidia has raised the bar so high,
which is you show up with a model
written in a high level language and it will run.
That's it.
You write your 50 lines of PyTorch,
you run it on the hardware, that's it.
That's all you need.
You don't need to do anything else.
And the user expectation for every customer
was pretty much that, which is, yes a great compiler a great technology unless that level
of user experience is there it becomes very very hard for them to get convinced that the 5x 10x
speed ups all the other metrics kind of become secondary after the user experience
and that that's number one and second is the user experience need to then match with
some thing that the customer is not able to do with whatever is available currently.
And that was a very intriguing, under retrospect, somewhat obvious lesson, which is when we started, which was
2017 and we kind of had our first product 2019 timeframe, it was still the beginning
of AI adoption. So everybody would be coming up with a use case that generated value. That's the only reason why any company, pick a company, right?
Home Depot, Bank of America, insurance company, Geico, whoever, right?
They would be internally building a product internally or a feature
that was providing value from adopting AI because it actually worked.
And their internal platform would be a GPU
because those are the only things they have access to.
So there was a very kind of intriguing,
what is it called?
Searching under the streetlight is not the right metaphor,
but it's more of
they were building nails
because what was available
was a hammer.
Which totally makes sense
because why would you do anything else?
And those of you who are into woodwork
know screws are better.
You shot up with a screw, they like i don't know i don't
have a screwdriver what am i going to do with the screws better it can hold more strength every
which way i look at it this is a better thing but i have built a nail because available to me was a
hammer right and so that was kind of one of the things that,
and then from kind of like people talk about this
in political conversations and so on,
the framing becomes very hard,
which is you built a value prop,
which is going after something the state of art is not good at,
but the customer is designing products optimized for what the state of art is not good at, but the customer is designing products optimized for what the
state of art is good at.
So the framing takes time to help.
You may have more and more conversations to say, here are the things where things can
go, and when things go there, it could bring value, right?
So that was the most difficult part, just kind of getting customers to, and as a startup, you need to start generating revenue, having conversations, help you raise money at some point, you have, if you, if you pop all the way up, there's a lot of discussion about how modern companies these days where we're all very short term cited, right? Because you need to hit
your quarterly estimates or whatever. And so whether you're a startup or whether you're a
humongous company, I would hear this discussion at various companies that I've worked at as well,
where you're talking about how well, you know, but if we do this long-term investment,
the value that it could create in 10 years is really, really long. But if we keep doing what we're doing with this, you know, I don't want to say the word cash cow exactly, because that's
never a term that I heard, but like my interpretation is we've got this thing that works.
To do that migration, that's a huge investment. And then particularly if the vendor is a small,
unproven startup it's like well what
happens if they go away what happens so that becomes a very very difficult conversation i
can imagine that being the thing and then you know to our students who are students and professors in
the field you know we're we're all trying to publish papers right and the published paper
you want to have the graph that shows the 10% and you're like, that should be enough, right?
10%.
I mean, back in the heyday, 10% IPC increase.
Let's just do it.
I know.
I know.
And there's another aspect to it, which is, which is later about this, which is
fundamentally, how do you build a, which is something I've learned a lot and I
think it's super, super important for the chip industry and semiconductor industry as a whole, which is something I've learned a lot and I think it's super, super important
for the chip industry and semiconductor industry as a whole, which is how do we think about
completely scalable business models?
Because the easiest thing as a chip startup or whatever is to say, I'm going to build
a thing, I'm going to sell it to somebody and then they'll buy it.
And then I want to build another thing three years later, sell it to my customers. In general, the software
world kind of has moved on from that. There is no software startup and most VCs are not used to this
type of conversation anymore or where your revenue is
selling things there's even a term annual recurring revenue and so on where fundamentally
what they want is subscription-based services where you don't have to go through this cyclical
thing and you have every year you're locking in newer customers and i don't know if your older
customers are going away.
Right. So most fundamentally, everybody wants to be in some kind of services based business model.
And that takes a lot of kind of like technology, product, business model conversations to try and figure out how do you move there.
And that was one of the directions we were taking the product and
trying to get it into market, which is essentially show up as a cloud service.
And everything else kind of becomes secondary, which has its own challenges.
Right. So you touched upon a few different themes there.
In particular, you talked about user expectations on the seamless adoption for whatever workload they have on a new technology. And I want to expand on a couple of themes. One is,
you were talking about how do you connect the value proposition of an underlying technology,
let's say in a chip or a new architecture to the user's context. And specifically,
you talked about the software stack that you need in order to enable that and make that value
connection, right? So can you talk a little bit about your experiences and how do you think about
bootstrapping the software architecture when you have a chip technology or an architecture technology where it makes sense for a user who's trying to adopt this
or play around with it and see what value it can actually deliver?
Yeah, I think to me, there were two aspects to it.
One is how do you create something that matches enough customers so there is enough people who can who
can use it and our default at the time was anyways that will support tensorflow and pytorch and then
we had some uh some mechanisms by which if we didn't support a set of operators we would have a relatively clean
fallback so we wouldn't crash and burn we would still be able to run the application to
completion so that was a very kind of engineering uh solution so you were able to run stuff as
opposed to like okay i can't run this anymore.
So from a, what does that mean from a practical standpoint, right?
It just means a huge, like a massive amount of money you need to spend as a company to
have a software team.
We're just writing, writing like a boatload of software.
And if we take remove forget papers forget ideas
forget technology right it's just like a boatload of software engineering and you just need and then
like with all these studies right there was like 50 60 lines of production software can be written
in a day whatever so it's and for a startup it's time and money.
Just need to hire the people and build up that stack. And this was, we were doing all this four, five years ago
where many of the things that are there
even in TARS 2.0 today didn't quite exist
in terms of easy way to get the operator graph and so on.
So we were doing kind of a concrete example we actually had to do like a basic
memory allocator so we have tensors that show up on the host side every tensor needs memory
allocated on the device set right and the simplistic way to do this is a one-on-one
allocation so when you have very large networks,
you end up sometimes with extremely high demands
on memory on the device side,
because operator one needed a tensor
or produced an activation.
By the time operator 10 came along,
that activation is dead from a data liveness standpoint so you need a way to reclaim
that memory so you can allocate it to some other tensor when operator 10 came along so these are
all things like we used to write like in the 80s and 90s right basic memory allocators and so on
so these are all like i I mean, we're a chip
company. We just have right software
doing stuff like this, right?
And it's time and money,
right? And unless you do those things,
you don't have
a turnkey
user experience compliance
thing that
you can put in the hands of a customer
and they can try it because you
can show up with kernels and so on, but it doesn't quite move the needle.
And I think this is kind of in my, at least as far as I can tell, whether big companies,
small company, everybody's experience, right.
Which is just the user experience has to be flawless and which is this other paper kind of we have
upcoming or as plus work which also shows that how diverse that operator stack is
just the sheer number of operators across many applications is just It's just too numerous.
It's just a lot of engineering work.
And there doesn't yet exist an auto compiler.
They'll just produce all of that when you push about.
Yeah, that's very interesting.
And you mentioned a few themes here, a couple of things.
The first one was, you know, kernels and the sheer diversity of the operator graph.
And you briefly talked about like compilers as well in that in that in that exchange.
So how do you think about the trade offs about hand optimized
or human optimized kernels that a lot of people write versus
building out a compiler stack. And obviously, there are trade
offs here a human written compiler library, you can
quickly write something, someone can analyze it for that specific operator or specific operating point of an application.
But a compiler stack is more general, but it also takes time to build out those abstractions and make it robust enough.
So how do you think about the trade-offs between these two things?
Do you think there's one right investment for a company that's building out a new architecture?
So I don't know. So I think I'll be super candid in my view here, right? Which is
there are, let's say there are at least two pieces or three pieces to a compiler. One is
the front-end parsing that takes a set of operators and decides here are the operators
that can be fused and once
fused here's the back-end code i need to generate right and then there's the actual generation of
that back-end code when we started simple machines our dream was that back- backend would be automatic,
semi-automatic, things would all just work, right?
And as we made more and more progress toward it,
it dawned on us that building
that completely automatic compiler would take more time,
would not be as robust as a intermediate approach,
which is build a library of the most common thing
and then have a kind of like a application,
application specific compiler
or dynamic programming search to figure out which
is the best library to call which is kind of basically what the cutlass and kudian and
approaches whereas a set of specialized libraries implementations rather for particular gem shapes
which are generated partially automatically partially by humans tuning those libraries
so we took a similar approach and for us we felt at the time it was the right approach and in terms
of time and money required to build out a massive compiler team that was a more productive approach. And fast forwarding to now, something really fascinating is happening, which is why I keep saying this efficient generalization.
To get better performance, the microarchitecture is fundamentally getting, let's say, more and more complex. And as we all know, inevitably for performance reasons,
this is not getting properly exposed
in the architecture.
The architecture is very opaque,
fundamentally obfuscating
what is needed to extract performance.
And tools like performance counters
and so on are good,
but not quite at the level that even a phd level
programmer let alone average programmer right like a phd computer architect it's very hard to go and
fine-tune and extract the most out of the underlying object i have a weird metaphor for
this it's not a great metaphor i view it as kind of where coherence and consistency protocols were in the mid-80s,
where people in industry understood them, they were implementing them.
But I think academics took a step back and came up with really rich formalism
that helped kind of nail some of the correctness and performance issues.
So I think this performance portability or performance exposure is a really fundamental thing, which it's not a compiler thing, it's not an architecture
thing, it's like, how do we deal with better ways for programmers to understand
what the hardware is doing without expecting everyone to just become a ninja
programmer and ninja architect, right?
Yeah, so it sounds like essentially what you're saying is there's a lot of
different ways to write software right now in the world, right? You can have
the ninjas that are super low-level understand and extract every last bit
of performance, and then there's sort of like the the mass market case which is we have to make it easy
we have to make it easy and it's just not a work you know forget about uh again extracting an extra
50 percent of performance like it just has to work and even the ninja level stuff with this kind of
new world of stuff that we're building there you you can't there's not enough information and there's not
enough broad concepts of how to expose the information so that we may produce more ninjas
because i think we all have accepted through the our careers like we were never going to make
everybody a ninja that's never going to happen you're always going to have some, but the way the world is right now, we're cruising toward a world of none.
Yeah.
And I think plus there's the other aspect.
I don't know how much of this is fundamental to the way computer
hardware is because of pattern protection and IP and so on.
We don't want to expose the micropotential.
Yeah.
Yeah.
Right. Yeah. Yeah. Right?
Yeah.
And if you, and your fundamental thing is,
I don't want anybody to know the secret sauce I put in,
then you're kind of left with libraries are the way,
and the libraries will be closed source.
And if somebody wanted to write,
and this is where the AI stuff comes in, right?
If somebody came up with a new operator that's very different from a transformer then who's gonna write its bare metal
code the deal scientist simply doesn't have the skill and if we want micro architecture to be
let's say more and more uh less exposed which yeah we want we want to be less and less exposed.
How do we create that co-evolution, I think, is a huge tension.
Yeah, I think I completely understand the tension that you're talking about, because
in the era of specialization or even, you know, efficient generalization, as you called
it, the microarchitecture is a key part of the secret sauce for a lot of
companies.
And but there's a huge software stack that runs on top of that there's a very close to
bare metal, you know, intrinsics, libraries, and so on.
But then once you write that there's an entire stack that sits on top as well.
Like in the ecosystem that we have for software today, what are places that you think are
places where we can make reasonable progress
as a community, right? Where we don't have to expose the lower innards, but you still think
there are gaps in the ecosystem that could be filled with sufficiently general frameworks and
so on. For example, in the recent past, we've had frameworks like MLIR that are taking off to help
make code generation for domain specific accelerators
a lot more easier. I'm sure like, you know, wind the clock back 20 years back, the compiler stack
was also equally opaque and then we got LLVM come up and sort of going through a similar transition.
So in the entire stack, like where are places that you think we can make meaningful progress
and it would be great if people actually took a step and try to build out this ecosystem so that you have a way to abstract away the
innards. So you have the lower level that is still under the control of a particular
company or a particular startup, but the upper layers of the stack are still relatively amenable.
You can bring in things from the overall ecosystem without having to re the wheel or write things all the way from scratch so i have a little bit of a controversial view on this right
which is i think as a community at least with deep learning we may have unnecessarily caught ourselves
as academic community caught ourselves in a little bit of quicksand and again this is my own
prejudices and things i've learned from doing my startup
and now being at InMedia Research as well.
So I view fundamentally
what is academic research best at?
It's taking an idea
and in my view,
doing the absolute bare minimum
to demonstrate the idea is actually meaningful.
Building out full products and the whole ecosystem is great if you want to do it, but I actually
think it's not necessary.
So, and if I go back to this whole compiler thing and software stack thing, what exists
right now is not super composable.
So I come up with a clever technique to do fusion
i have an academic i should not be responsible for showing it works with tf it works with touch
works with jacks work with this network that might work this whole plethora of operators and all of
that stuff because if it was that all encompassing, right, that thing is probably be there already, right?
It almost will be, and kind of looking at the generative AI hype now, right?
And if the stuff doesn't work for it, doesn't applicable for LLMs, so what?
LLMs have been special case optimized because they have to be, they're such an important
workload. So I think I'm almost going to make a case
for identify the one thing
and then without having to boil the ocean
for composability and end-to-end in every way,
can we demonstrate that technical value prop
in a way that companies and research labs who are arguably more equipped
to go and push it into their turnkey composable flow will then pay attention.
So to give you a simple example, TensorRT is great. It's a bit of a challenge to use,
depending on the workload you're using. And if I come up with a fusion technique, I don't know
if I completely agree. It's my responsibility to have a stack
bar that shows this technique does X with tensor
RT and my technique is better or worse.
How do we come up with a way that I can take my insight and
then the widget or the artifact I build
demonstrates the intellectual contribution of insight without doing all of the composing.
Again, that's kind of where I landed. I can see how others have a completely different view,
which is unless you do the whole thing your stuff is useless right
yeah sometimes it seems like it depends on always depends on what reviewer you get right yeah some reviewers want you to boil the ocean and some are like this is insightful like who cares like we've
got to get this information out there so subani's original question was like where in the software stack can we maybe make some grounds and I think
what you just said was not quite an answer to that but also very very interesting which is you know
like as an academic community you know let's not necessarily think about the whole software stack
end to end because that shouldn't be anybody's one job. The job of the academic community is to figure out
where is there something interesting so that the companies or people who are actually looking for
customers and actually looking for users can go. I think the thing with the end-to-end stack that
you were talking about before that I find really interesting is there's a lot of chip startups out
there right now. And from a chip person and a chip
startup's perspective, you always think that you have to have some sort of special sauce in the
chip and that's, you know, whatever the microarchitectures or some sort of selling point.
But as you point out, like the reality is of trying to get someone to pay you money for your stuff
is that all of these chip startups probably have to end up being way more vertical
than they had intended to be right because what you the idea starts at the in some ways the bottom
of the value stack it's the very first step and in the end you have to take it all the way to the
top so i guess what i wonder is from your experience of like running simple machines, like, do you, do you think anybody has a real chance of displacing NVIDIA? Because if you have to take it from some
sort of special sauce thing without the resources of a now $2 trillion company and without, you know,
a historical stack building that has gone on for a very, very long time. And now you have to ask,
you know, something that has maybe just gotten series A or just gotten series B and like a few people whose primary reason that they're working on this startup is a hardware idea to say, okay, and now we've got to do the whole step.
You know, whether it is libraries or a hybrid or whatever, like that whole middle section up to the top has to be great.
So, you know, what are the chances of the, I think a couple of years ago, there was a
New York Times article that said there were 55 AI chip startups, and this was a couple
of years ago.
So what are the chances?
One flippant view of this, I think many companies wanted to be like, so NVIDIA was the NVIDIA
of deep learning.
They wanted to be the, or AMD was the second to Intel.
They wanted to be the AMD of deep learning, but I mean, guess what?
AMD is now the AMD of deep learning.
So I mean, MI300X is an amazing silicon.
It is awesome.
I think the Rackham stack is great.
So I actually think the answer to your question is,
I kind of go back to,
I think the opportunity comes from taking a step back
and looking at business value.
Like what would, let's take a use case like automotive,
for example, right?
Or maybe healthcare is another example, which is what are places where there is business value experiment first.
You don't even need to build your chip, right?
You have business value.
Can you value test the business value
with existing hardware, existing software, and so on?
And that should be the first kind of intellectual,
internal trial run, right?
Instead of saying, I'm 10x better, 5x better than somebody. Can you prove that business value? Because the whole saying i'm 10x better 5x better than somebody can you prove
that business rather because the whole 5x 10x better the problem with that is it'll take you
some time to get your product out the door and in that time the competitors also get better and the
gap will close and all of that stuff right so i think in a weird way, I actually believe because Moonslaw has ended, the opportunities are potentially higher now than even five years ago.
The business model is I am going to sell a better product to a hyperscaler compared to what NVIDIA or AMD can. I'm very skeptical about that type of business model
and how it could lead to a viable product.
I think that type of business model
is looking for an exit very quickly at some point
because you have something,
but now all the hyperscalers
have bootstrapped their own hardware teams.
The exit strategy doesn't seem viable either.
Yeah, that
makes sense. So I think there
was a book, The Innovator's
Dilemma. Have you
read this? Yeah. And so
what he talks about is in order to be
truly disruptive, you sort of have to come in
from the bottom on the side
of something. I mean, I'm
totally paraphrasing the idea,
but like how Arm got started is like,
they're doing tiny little chips somewhere on the side
and then suddenly, boom, wow,
they're selling, you know, billion chips a year.
And then they move up from there, right?
And so it sounds like what you're saying is like,
somebody should figure out, okay,
there's some side market that's not the stuff
that's in the center of tech.
It's not, you know, selling in the center of tech it's not you know selling
xeons to microsoft or whatever but it's like yeah the stuff that's going in the medical devices that
are scanning whether or not your spinal cord is good or your knee is good or whatever laparoscopic
surgery type stuff and then build something from there because there's more opportunity
and then from there you you know, build up.
I mean, I think the model and innovators dilemma is like,
you gain market share there, you build up your revenue,
you use that revenue to actually fund moving up.
And there's another thing, which was,
I remember having this conversation.
What do they call it?
They call it the business model canvas.
It's a pretty standard template where they say,
what's your unfair advantage?
What are your channels?
And then what's your revenue model and so on.
And this totally applies for software startups.
What they say is you should at least fill out 50 versions
of a business model canvas.
Or like a very large number before you decide, this is what my
company is going to do.
Right.
And then, yeah, so that leads to, you know, how did you decide to start Simple
Machines?
Like that?
I've been curious about that.
Like how does, how does one decide to start a company?
I think honestly, it was a little bit of luck, timing and overconfidence of where my lab
and my students were.
So I think at the time, some of the goal, this was 2016, 2015, some of the ideas of
specializing for CNNs had just come out and people had had papers on production trees
and so on. So we just got in our stream data flow, which is a data flow array, and then we put a
stream engine and we were like, oh, holy crap, we solved everything. And I became convinced that
this is a solution and our biggest advantage is nobody will believe
how good this is.
So we should go build it.
And then at the time we were trying to raise money, Google just made public their TPU chip.
So I remember I would go have the stocks to people saying, they would initially be like
quizzical, what are you talking about like
i don't understand why do we need new chips what do you mean we don't have enough compute
and then i pointed to the google thing and they were like holy crap this guy what this guy is
saying makes sense and so that kind of helped us raise raise some money initially to uh you get stuff going but it was really we had this belief
that you needed a way to programmatically this sounds a little bit cliched but in a programmatic
way move data in and out and simple loads and stores were not enough and once you did that
you needed a data flow array that could morph into what the operators were.
So we built those two, and we were able to show some kernel execution and so on.
And then this was a little bit flippant and serious because I'm a tenured professor.
I call myself the entrepreneur without any risk, which is a total oxymoron.
It's like worst case
scenario go back and become a professor right right my students were very very excited so
i wanted to circle back to another thing that you talked about like what is academia really good at
it is taking a step back you know figuring out what are some common patterns and how to formalize them into useful concepts and idioms. So from
your vantage point right now, are there certain gaps that you see in this formulas and landscape
or in this principle landscape? I think I really enjoyed reading your papers that, you
know, distill the principles around specialization, you know, starting from your HBCA 2016 paper
and so on. In a similar vein, fast forward a few years since that paper, are there places
in the overall ecosystem where you see some formulas in GAP? You talked about the analogy
with coherence and consistency, you know, maybe 20 years back. Are there similar flavors
of problems that you see today? Things that you feel people know sort of the mechanics
of certain solutions, but it still does not have a very elegant sort of, uh, sort of idioms
that people can understand and then build on top of and boil ideas into, okay, here
are the key principles that we're building on.
I mean, this is completely my own intellectual prejudice and my own biases on where I think industry is
whatever I had that industry as a leadership vantage point in and where academia can help
the thing I feel we can and should do more of in academic research is almost
and I'm not saying this because I've done a lot of clean slate work in my research, right? I don't want this to come off as don't do clean slate research.
For better or worse, DL architecture seems to be coalescing to work, a draw core, a vector
processor, a matrix engine, and some kind of thing I'm going to call a stream engine since that's the
name we used in our research right other people have different names for it right mi300x ampere
hopper whatever has been disclosed about mia tpu at some level they all look like this different
companies have different trade-offs on should i have a big matrix engine or a small matrix engine and so on so the way i'm getting to is this seems pretty good and i think there
have been lots of other work that i've looked at if i just take the efficiency of an alu
and everything else went away right the tops per watt of just that thing compared to silicon you can buy it today
the gap's not huge it's about 5x or so which teaches us something which is 5x is simultaneously
very large simultaneously very small if i don't change algorithm, I'm not going to be able to build some new silicon
that's 20x better than what
NVIDIA is shipping today or AMD
is shipping today. So going back
to then your question,
I think going back and looking at
principles that say, if I do this,
I can improve
stuff by 20%, 30%.
It's something
that I would love to see more of because everybody is trying
to write the 10x, 20x better papers, which are all, I mean, they're great. I'm not criticizing
any of them. Many of them are bringing a sledgehammer where the insights are very hard to extract out and then even if there's
an insight i have to do so much work as a reader to figure out what is there when i set aside all
of your clean plate stuff that i can then extract out and put back into what is looking like a pretty
robust uh thing people have converged on right so i think that's kind of what I, I'd love to see more of, which is whether you do
CleanSlate or do something that's applicable, I'd love to see every paper almost have a
subsection that says, how might I take these insights and apply it to silicon as it exists today?
It sounds kind of boring, but I think in a weird way, that might be one of the most impactful things we can do, particularly because Moore's Law has ended and ideas are like crazy, right?
If you have a good idea, people will adopt it.
Yeah, I think, Sune, is this similar to what Norm was saying in our interview with him?
So we did this special episode for the 50th anniversary of SIGARC and ISCA and all that a while back.
And one of our guests was Norm Jopie.
And I think we talked a little bit about what academics versus industry should be looking at and I'm almost certain that
he said something very similar to this which is that academics should take sort of like almost
like I don't think he used the word small ideas that wasn't and that's not the intention but more
like something very well contained and just figure the heck out of it because who is more equipped to actually figure out what's going on
end to end and like make a full product it's it's industry that's for sure you know academics don't
have access to reams of data they don't have access to especially when you're talking about
hyperscaler type research like they don't have access to the same number of machines they don't
have access to you know proprietary stuff and then the end, there's so much like experience tells, I'm sure anybody who
works in this industry, so much that has to go into actually making something work
that are behind the scenes or special sauce that's in the company that's not exposed.
And so.
No, I think the sentiment has been echoed by multiple people.
Although we have seen about five years of development of different accelerators,
plus a plethora of papers in this
particular space, I think it's a good time to sit back and as a community maybe develop a pedagogy
around this. Like you said, for example, today most of these accelerators for deep learning in
particular have converged around having a vector engine, a matrix engine, and a few different data
flow primitives as well. So can we distill those things and say clearly,
here are the kind of ideas that could move the needle
and here's your upper bound on what you could achieve
over a reasonably well-defined
or well-engineered architecture, right?
So that we understand where are the rooms for improvement
and where are gaps where we don't know
how to tackle certain kinds of problems.
There are certain problems that you already know
how to tackle reasonably well,
and you might be within 2x, maybe 5x
of a well-engineered solution that's there in industry.
But it would be good to have that crystallized
and say, here are the set of ideas.
Here's where you can move the needle.
Here's the maximum upper bound of these things.
And I think that would be incredibly valuable
now that we have the benefit of hindsight
and looking back on all of these papers.
I think a good survey paper, not just a survey, but a survey combined with some quantitative
analysis or a lens through which to view all of these different ideas, I think would be
very valuable.
Yeah, I mean, we've been thinking about this.
My student and I have been writing up, we don't know what to do with it.
We kind of, it's currently called the top 10 myths of deep learning hardware.
And it's intentionally written that way.
And the first myth is GPUs are inefficient for deep learning.
I mean, that's a catchy title, that's for sure.
I mean, it sounds like you should keep going with it.
I mean, those are the kinds of things that people remember, right?
I quickly say that I've generally enjoyed your unrestrained and colorful critiques,
various ideas, you know, starting from, you know, your favorite simulator considered
harmful or the power struggles between risk and risk and so on.
So I think those are very catchy titles, but a kernel of truth in them as well.
Yeah.
Please continue.
Yeah, no, no, I was just saying,
I think one of the things you were trying to distill there
is I'm glad that at least from both of you,
I'm hearing some validation for that thread.
It originated a little bit from my disgruntlement
for not even the reviews I was seeing, just
core reviews of how people were approaching papers.
I was like, I think we need to do something else.
So that was kind of what triggered this.
And yeah, so we'll see what shape it'll take and try to figure out what exactly to do with it
so it sounds cool um maybe maybe a cigark blog post yeah that's probably the best venue because
it's kind of like it's uh i think it gets a lot of a lot of uhhip, even more than an actual paper. Right, right, exactly.
The end goal is for people to read it, not necessarily for me to have a paper published, right?
So maybe this is a good time to wind the clocks back and talk a little bit about your journey.
So what got you interested in computer architecture?
How did you end up at the University of Wisconsin-Madison?
Maybe any interesting inflection points along the way during that journey so my journey is probably
a slightly odd one and some of you may may know this may not so in the early 80s there was a
among the original wave of personal computers there there was something called the BBC Micro.
So, and this was announced, I think, late mid-80s or so.
And I was, I grew up in India.
So in my school, I remember very, very vividly
that when I was in fourth grade,
nobody knew how to teach computer science.
They had this other whole organization come and teach computer science they had this other whole organization
come and teach computer science with the bbc micro and i remember then it just had basic and i think
that what else uh yeah i think it was just basic at the time i was just hooked at that time i was
like oh wow i just love this right and that's how it started and then i was incredibly lucky to get to work with
steve cackler for my phd who was he's amazing and and then timing worked out they were starting the
trips project at the time so i got to uh work on that which was great to be able to build a chip as a PhD student.
And then after that, I got to Wisconsin.
It's a great place to be.
And I actually even started enjoying all the four seasons there.
After I had kids, they got into skiing, then I got into skiing.
So that's my journey.
And then after about 10 years at Wisconsin is when my simple machines journey came along to say, well, these ideas look pretty interesting.
I should do a startup and see what I learn.
So I never, I know Lisa asked this.
So the company itself, at a certain point, we were contemplating, we got our first product done, we had to raise money for the second one.
We didn't quite see customer revenue and customer uptake. So we were like, this is not quite going anywhere.
So we decided to fold the company. So, and the, I mean, like I said,
I think it was four years.
That's all I did.
I went on full-time leave.
I wasn't really publishing.
Every minute of it, I loved.
It was such an incredible journey,
very different from being an academic.
Lots of gray hair,
lots of pressure that I didn't have as an academic.
But now you're
responsible for people's paychecks, right?
So do you think you would ever start another company?
I'm actually going to say yes, because I absolutely love the journey.
And even though there's a lot of heartbreak and pain and uh my wife sometimes
jokes when i i had to after simple machine she was like and i don't want this to come off as uh
kind of rude but she was like dude the amount of highs and lows you had in a day
running simple machines you're not going to have in a year as a professor.
Yeah.
I think it's mostly the thing, the thing that's very different.
I've tried to recreate it as best as I can in my group. Just this intensity of like 10,
30 incredibly talented people all working on this one thing and they all completely believe
it absolutely right and it's such a kind of like like from a it's almost like endorphins are in
your brain all the time right you just love going to work You forget how long you've been working.
There's all this stress, but it's very different.
It's almost like it's not quite us against them,
but you completely believe and there's no finger pointing.
Everybody's just working on this thing.
And you completely believe you've got the answer.
And I think that's the reason why I do it again.
That sounds cool.
It does sound cool.
Yeah, it's hard to get that sort of feeling sometimes
in a large company
because you need massive levels of alignment, right?
Yeah, yeah, yeah.
And also because I was very, very lucky with the investors i remember
uh afri shut down i was i was having ice cream with my with my kids somewhere here and one of
my investors walking by i was like oh god what's he going to say and it was really funny he was
like how are you doing good good i'm glad you're with your family. Next startup, call me. I'm in.
Well, that's an endorsement.
That's great.
Yeah.
I recently read a book about venture capital in the United States.
I forget what it was called.
Oh, shoot.
It was really, oh, The Power Law.
It's called The Power Law.
Oh, yeah, yeah, yeah.
Have you read it? Yeah, oh, The Power Law. It's called The Power Law. Oh, yeah, yeah, yeah. Have you read it?
Yeah, yeah.
And I mean, one of the main premise of it is that like they fully expect
many of their ventures to not work out, right?
That's the name of the game.
But some will.
And but as they look for that some,
they're also just looking for talent, right?
Yeah.
People who they would bet on again.
So it sounds like you're one of them.
Yeah.
Well, we'll see.
Well, cool.
So I think maybe one of our final questions would be like,
you know, what's coming up next for you?
You know, it sounds like you would start another company,
but there's nothing on the horizon,
or maybe there is,
or what's exciting you these days?
You know, maybe you're not having the same kind of highs and lows as a
professor, but, you know,
presumably something is getting you to take your bike through the Wisconsin
winters to get to the office.
I mean, I'm working on a couple of things.
One, I think I also, I'm at NVIDIA Research part-time.
So there are many aspects of that I can't talk about,
but the work is super, super exciting.
I'm very thrilled to have that opportunity
so I can talk more broadly about my thing
I'm working on at the university.
So I'm beginning to look mostly at the intersection
of really like economics.
What happens when moves's law ends?
And it has ended.
How should architects think about it?
And what should fundamentally the chip design cycle look like?
And we have this new project.
We're basically trying, inspired by other fields
that are hugely data-driven.
We don't have a lot of data on how we design an architecture.
So we're trying to kind of,
and this is based on some of the learnings
on my startup and other things,
which is how would we control the,
and actually control,
how do we react to this new economics
and how do we kind of extract more data?
How do we do it in a privacy preserving way and so on?
So those are the things i'm uh
spending a lot of my time on to see what does kind of almost what is forced more architecture
right what should we as a field be thinking about uh just expanding on that like many academics i
know that you're also very passionate about teaching.
So, for the coming crop of students, especially in the changing landscape with the demise of Moore's law, as well as exciting new architectures and new application paradigms and so on.
What are things that might be different that you would want to teach the students compared to our standard pedagogy of computer architecture curriculum? No, I mean, that's a great question. And that's actually something we've been working on.
What I would like to do is to find a way to,
in a weird way, teach more
and also find a way to teach more efficiently.
So without students after taking three courses,
how do I teach them enough about what's the basic in order
processor and also expose them to something beyond that, right?
So, and some of this is inspired by, I think things that have happened in
software engineering and programming languages where the productivity of
programming languages is just exploded, right?
What are you going to do with 50 lines of Python? It's crazy.
So can we find a way to introduce not quite new languages,
but how do we just get them quickly to be able to have like a turn?
This is one of the things we've built kind of looking at is how to deploy a
turnkey RISC-V processor they can build, put it on FPGA, run a program
like full AlexNet on it and do all that in about two-thirds of a semester. And so they kind of
built out an entire RISC-V processor that's running real stuff and running a very large
regression and then add more to it that exposes them to ideas beyond that, right?
So I think that's kind of, I don't have, it's been a thing I've been thinking about a lot,
which is what do we want undergrads who take maybe one or two architecture courses?
What are all, what's the maximum amount of things we can teach them
and also teach it more efficiently so they can get to more depth, right?
The grad one, honestly honestly is the harder one because what I've struggled with is
it's unclear to me which of the classic papers on like 90s and so on, on process design and so on,
what principles exactly they are teaching them and how do we motivate
students to extract out the right principles from those papers.
And the reaction to that is just read more recent papers.
So I tried out an experiment on trying to do something in between where everybody would
be required to pick one topic and articulate.
And I gave them five, 10 papers on that topic.
And the goal of the course is
they have to write a one pager
that identifies a new sub problem in that topic.
They don't need to do any experiments,
no simulation, no nothing.
All they need to do is to identify a new research problem.
They don't even need to have a solution, right?
This is a problem that exists, right?
In my mind, that was a way of kind of teaching them the skills of becoming a researcher in
a different way than reading the canon of what is architecture.
So I'm still figuring out what I learned from
that experiment. So students seem to love it, but I'm still trying to figure out how to improve it.
Yeah, that sounds like a fascinating direction of inquiry, both into pedagogy and also how do
you teach students. I like that I know one other professor at Stanford who conducts a similar exercise with the students where the essay or one page paper that you write is not describing a solution, but just describing a problem.
You don't need to have a solution, but just pointing out that, hey, here's an interesting problem that I've noticed.
And I don't think I've seen any good solutions for it.
Or here's what I've just observed so far.
And people generally, students really enjoy that entire process. so I'll give a plus one to that particular yeah it was very
interesting because they all said universally this was much much harder than anything they've done
but less time consuming
that sounds efficient.
Efficient generalization of hardware architecture,
efficient generalization for students. Yeah, that's wonderful.
So maybe before we close,
you can ask any words of wisdom to our listeners,
any advice that you would give to students,
researchers, professionals in this particular space.
Oh, wow.
Words of wisdom.
I'll tell you something flippant.
Be contrarian and try to be right.
That's very efficient as well.
Be contrarian and try to be right.
I mean, those are two key sides, right?
Like be contrarian and just be a jerk
and just like say no.
Yeah, you need to be contrarian, right?
Don't work on deep learning.
Cool.
Well, thanks so much, Kar.
We're so glad to have had you on the show today.
Awesome.
It has been a wonderful conversation. I think, you know, given that our audience is, you know, pretty much all computer architects and we all love hardware, we love hardware architecture, to think about what it takes to translate, you know, some of the ideas that are published in the academic world to make it in a idea of like a hardware startup, really, really fun.
And to talk about some of these business aspects too,
has been really fun. And I think in the end, you know,
I remember once I went on vacation in Costa Rica and we were walking around
these coffee farms and they were saying,
cause there's a lot of coffee beans that are sourced in Costa Rica and that
the farms were like the
bottom of the value add chain so they would say like you know here this giant bushel of beans
which somebody has like grown and picked and whatever it was like two dollars for like 20
pounds of beans or something like that and I was like wow it totally stinks to be the bottom of
the value add chain you can't even buy a cup of coffee at Starbucks for $2.
And yet all these beans and like so much work, it's so much infrastructure, it's so much effort.
And somehow the value gets added up and up and up the chain.
And I remember thinking that it was similar, like hardware was like that, right?
So much investment, so much infrastructure.
And then in the end, in some ways, it's a commodity.
And then the final value add is this user experience that you talked about. And I don't
think we necessarily think about that as a community. Yeah. So thanks.
Awesome. This is great. Thank you. That's a very nice metaphor. I'm going to steal it
next time. Coffee beans and hardware. Yeah, sounds good. Sounds good. Thank you so much for joining us today.
Great. Thank you, Lisa.
Yeah, I like Lisa's sentiment.
It was wonderful talking to you. Thank you for your
candid thoughts as always.
And to our listeners, thank you for being
with us on the Computer Architecture Podcast.
Until next time, it's goodbye from us.