Microarch Club - 10: Thomas Sohmers

Starting point is 00:00:00 Hey folks, Dan here. Today on the MicroArchClub podcast, I am joined by Thomas Summers. Thomas is currently the founder and CEO of Positron AI, which recently emerged from stealth with its first product, Atlas, which is a transformer inference appliance. We talk about what makes Atlas different than traditional inference hardware, and how Positron is able to deliver significantly higher performance per dollar compared to current GPU solutions. However, we start out at the beginning of Thomas' fascinating journey, which began with dropping out of high school to start a chip company at age 17. We cover his experience in the Thiel Fellowship program, why there's a need for a new instruction set architecture, and how he and a team of three

Starting point is 00:00:40 other engineers were able to tape out and deliver a new chip in a highly compressed time frame. Thomas is an ardent student of computing history, which made this discussion a ton of fun as we connect the dots from research papers in the 70s and 80s to the cutting edge processors of today. Before we jump in, I want to thank Michael, who is the co-founder and CTO at Lambda, for initially connecting Thomas and I a few months ago. With that, let's get into the conversation. All right. Hey, Thomas, welcome to the show. Hey, thanks for having me. Absolutely. For folks who don't have contacts, which I guess is everyone, Thomas and I spoke a few months back. Michael over at Lambda was nice enough to connect me

Starting point is 00:01:39 with Thomas. I know you all have worked together in the past. And I was just trying to talk to folks who had experience that I was interested in as I was learning more about processor design and trying to learn more about the industry. And Thomas, you were nice enough to spend some time chatting with me. And then you have a new venture that you're working on, which I'm sure we'll talk about at some point in this show. So now is kind of a great time as you all have come out of stealth to circle back around and have you on the show. Yeah, well, thanks again for inviting me. And yeah, always excited to be able to talk about computer architecture and computer history. Absolutely. Well, you know, it's interesting. I as a little behind the scenes here, I guess, I haven't released any episodes yet.

Starting point is 00:02:28 So I've just been kind of doing recordings of this show. And most of the folks I've talked to worked in the industry kind of in the 70s and 80s and are now retired and are very happy to talk about their experiences. But I'm also trying to get folks who are, folks who are hands on today and really get a sense for, you know, what, what has inspired the work that's happening today, as well as what, you know, modern work and processors looks like. You have a very, very interesting background, though, that I think I'll probably not have anyone else on the show who has something similar, but you, I believe, dropped out of high school, right? After getting the

Starting point is 00:03:05 Teal Fellowship. Do you want to talk a little bit about, you know, how you got interested in computer architecture and then how that whole process came about? Yeah. So I think some of my earliest memories, like three, four years old, were when I was in front of computers. So I've been a computer user and very interested in them since then. So this is right at the turn of the millennium. So, yeah, back, you know, still like early memories of dial-up tones and such. But, you know, the fact that my parents let me on the Internet in, you know, 2001 when I was five. Right.

Starting point is 00:03:44 Maybe, you know, good or bad idea. Not sure. But I just had to, you know, it was my obsession to be playing with computers. And I started to connect, like, more just using it as a tool and, like, the interest in, okay, could play games, can, can, you know, actually learn things online. I was, uh, uh, throughout elementary school, very into, uh, uh, trying to research as much as I can and just dive into Wikipedia for three plus hours at a time. But, um, uh, but yeah, my dad also purchased, um, uh, electronics kits from the sixts and 70s, the old, you know, 100 projects in one things. And so I also then got very interested in, okay, not only, you know, can do computers do all these

Starting point is 00:04:34 amazing things, but I can actually take, you know, a little piece of cardboard with a whole bunch of components and follow this instruction booklet, which I think I actually started using those booklets before I actually knew how to read, um, and, uh, uh, can make a, you know, light turn on and then, you know, graduate, graduate to a crystal radio and, and, uh, so on and so forth. So, um, yeah, it was really, those really early formative years that got me interested in electronics. Um, and then by the time, you know, late elementary school, middle school, I just would take apart everything in the house. And thankfully had parents that were okay with me taking things apart as long as I convinced them that I could put it back together. And I had a pretty good success rate of that. But even when I couldn't put something back together

Starting point is 00:05:26 if I showed that I actually learned something about it then they weren't too mad that's awesome we've talked a little bit you know about kind of like digging through abstraction layers and you know I've heard a lot of folks who have, you know, somewhat of a similar early interest in computing who end up being software engineers, but you kind of pushed down and maybe it was some of that, you know, electronic kit influence and that sort of thing. But what was it that kind of, you know, brought you from using computers to want to dig down and understand, you know, mechanically how they work?

Starting point is 00:06:04 Yeah. I think those early experience with just really basic electronics kits, I really appreciate putting my hands on things and especially experimenting like, okay, are there other ways that I can get this to function other than what the instructions tell me? And I think it's, you know's similar, probably unsurprisingly, was loved Lego sets, but I would go buy a Lego set, put together what it said, at least maybe half of it, and then have more fun just making something entirely new. And so that hands-on aspect, I think, was just always with me.

Starting point is 00:06:45 And then by the time I actually started learning, you know, computer programming, thankfully, in middle school, I was able to get into a public charter school called the Advanced Math and Science Academy. And they actually, you know, were starting, you know, teaching Python in sixth grade. And so I actually got much more interested when they were teaching Python, you know, teaching Python in sixth grade. And so I actually got much more interested when they were teaching Python, you know, learning that, but then on the side, you know, I've had Ardrinos and a few, you know, basic stamps and stuff like that, and started more on the C side, but also learning Python. But I guess even though I learned and could appreciate and build some things on Python and higher level software side, there was nothing like, it was nothing like writing C code and then getting the very simple Arduino version of C

Starting point is 00:07:37 and being able to get that to run and actually blink an LED. So it was really just always on that, on that hardware side for me. Gotcha. And so then you get into high school, and I assume that's when you applied to the Teal Fellowship. How did you learn about that? And what was kind of the process for getting into that program? Yeah, so when I was in eighth grade, so still middle school, the director of the computer science program at my high school recommended that I went to this event at the Google Cambridge office. And I grew up in central Massachusetts. And so what there was a day, you know, encouraging, you know, middle and high school students that they should, you know, study computers.

Starting point is 00:08:28 And I was already pretty hard set on that. But there was a woman there who was, I believe, a manager, director or something at iRobots, which is a local Boston area company. And she had met her. she thought I was smart. And then a couple weeks later, she emailed me out of the blue through my school and sent a link to the announcement of the Teal Fellowship. This was back in 2011. And she said, you know, she knew one of the organizers of the Teal Fellowship. And she said that it looked like something interesting for me. You know, the kind of crazy thing about it, I was, you know, 14 at the time. I was, yeah, a bit young for uh for it so I at that uh around the same time in 2010 I started actually

Starting point is 00:09:29 being a research affiliate at uh at a research lab at MIT and um worked there for three years and then finally in 2013 I uh got accepted into the old fellowship., that's incredible. I mean, you obviously showed quite a lot of promise, but also really, really great of her to reach out. And, you know, I think a lot of times we look at folks, you know, maybe being in eighth grade or something like that and assume, well, maybe they can do that later. But that's really cool that someone was looking out for that and, you know, thought it's not it's never too early. So that obviously that's really cool that someone was looking out for that. And, you know, it's not it's never too early. So that obviously had a really positive impact. I think the good thing was, was the director of our CS program at our school was I don't think she realized that it was actually dropping out of school was the like encouraged thing. Um, and so I'm sort of doubting

Starting point is 00:10:26 that if, if she knew that, that, uh, would have made that, that connection, but, um, right. But yeah, it was, you know, pretty serendipitous. Absolutely. So, so you get into the Teal Fellowship and, um, I believe there, um, is kind of like a period of time where you're kind of figuring out what to work on, right? How long was that period of time? And then obviously you and some other folks founded Rex Computing. What was kind of like the process of getting into Teal Fellowship and then getting to that point of starting the company? Yeah.

Starting point is 00:10:58 So the application for the Teal Fellowship at that point closed, you know closed December 31st, 2012, so I applied then. But for basically the past year and a half, roughly before that, the main project I was working on at MIT was maintaining and doing tests on the cluster computing system at the ISM, the lab I was at. And the particular focus, I was evaluating a whole bunch of new ARM-based processors for use for cluster and high-performance computing. And we were the first to benchmark the CalSATA, if anyone remembers them, the first ARM server processor back in 2011, 2012

Starting point is 00:11:48 timeframe. And so, you know, that was really my formative, formula of years learning on distributed computing and, and aspects really at a much higher level than the actual processor architecture, but was very interested in the idea that processors that were designed for embedded systems, ARM at the time was really just embedded in industrial and mobile, and really smartphones were really only just starting to really kick off, and ARM was getting some real steam with that. But the key thing there when applying for the Thiel Fellowship was this idea that I wanted to bring that sort of research work and my belief that, okay, these low power processors are going to be able to actually scale out and utilize them for general you know server industrial you know web

Starting point is 00:12:45 server processing and so and that was I would say a pretty risky novel idea outside of you know Cal SATA at the time was basically the only company pursuing it and so that that was my Teal fellowship application was utilizing and continuing that work but actually starting a company around it. So there was a whole multi-round interview process for the fellowship and then got selected in May of 2013 and moved out to the Bay Area in June. So basically that startup, like really just getting my way of the land in the Bay Area. You know, I was 17, had just moved out, never having been away from home for more than two weeks. And so getting to meet the other folks in the Teal Fellowship,

Starting point is 00:13:37 there's 20 selected per year, all under 20 years old. And spent basically that summer getting away from land and actually incorporated Rex in August of 2013. Yeah. So pretty quick from the beginning. And when you incorporated, you already had kind of the idea for what you wanted to do as well? Yeah.

Starting point is 00:13:57 Well, the idea originally was generically, you know, go after low power, high performance computing. And the target market for that was the big data, being the big buzzword at the time, data centers. And the earliest folks that I was talking with were at Facebook. So I got involved in the Open Compute Project pretty early on. Later became a coordinator for the high-performance computing group within OCP. But yeah, effectively, the really high-level outline was there. The first about six months after that, basically through the end of 2013,

Starting point is 00:14:50 was really focused on utilizing ARM and was after having a lot of conversations with all of the ARM chip vendors, ARM themselves, looking at what IP and core designs they had in their roadmap, et cetera, that came to the sad realization that it's not quite there yet. So ARMv8 specification just came out that summer. All of the chips out there were still 32-bit ARMv7. And so ARMv8 was still going to be around the corner, but even that first generation of them, I didn't think were going to be really competitive in the market against what, you know ago, that said, okay, well, if ARM isn't but I think the general thought was, okay, with ARMv8, they're actually embracing a lot of the CISC elements of x86. And they're still not actually providing a real competitive advantage. And the real root of the idea, like end of 2013,

Starting point is 00:16:20 early 2014 was, I have this Teal fellowship, it's, you know, I'm getting $50,000 a year, which is nothing in, in, you know, to live in the Bay area, but I was okay living with, you know, five, six other people. So, um, basically what the Teal fellowship really gave was that freedom to think a really crazy idea. Like why don't I make my own processor architecture? Um, and, uh, uh, actually take the time to explore that in a way that didn't feel like the immediate time pressure of I need to support myself beyond the fellowship. Right. That makes a lot of sense. ARM IP that was available. You know, you're kind of saying that you have that freedom. So maybe this wasn't as much of a consideration. But was the cost of licensing ARM IP, did that come into play? Or was it mostly just like the capabilities weren't there at that point?

Starting point is 00:17:16 It was really the capabilities. Thankfully, had some really great connections with folks at ARM. And they were trying to be, you know, extremely supportive. They, early on, but from, you know, the connections made at MIT, the folks that at ARM that wanted to go up to, you know, higher performance and capabilities, you know, were trying to be as helpful as possible. But as we've seen in this past decade, ARMS only actually got competitive in the past three or four years. So it really took time for all of that to evolve. And I should say that also evaluated other architectures

Starting point is 00:18:01 and I think a large part of the drive to think about designing, you know, new architecture from scratch was seeing that there were plenty of what I thought were good, interesting ideas, you know, be it a decade before that with, you know, MIT, RAH, and Tyler, and then, you know, later, as I learned about, you I learned about MultiFlow two decades before, etc., that my thinking at the time, especially being in the Bay Area with the entrepreneurial vibe and a lot of people's thinking being around, there were plenty of good ideas in the past that failed, not because they were bad ideas,

Starting point is 00:18:43 but it wasn't the right time, you know, being a big, big thing, kind of lended credence and belief on my part that, you know, this could actually be the right time to do something new. Absolutely. That's one of the things, so, you know, most of the research I did for our discussion today

Starting point is 00:19:04 is from this talk you gave at Stanford where you really go through what y'all build at Rex, essentially. And I'll link that in the show notes and would encourage folks to watch. But one of the things that I really appreciate, and we'll talk about some of these concepts here in a second, is the callbacks you do to talk about these designs that have maybe been used in the past or concepts that have been used in the past and in some cases are kind of like mocked right they're they're regarded as not working out and you kind of go through it and debunk some of those that's something I find really interesting looking at even like the risk versus CISC debate right because that happened in the 70s and 80s and things like that and then we seem to go the way versus CISC debate, right? Because that happened in the 70s and 80s and things like that.

Starting point is 00:19:45 And then we seem to go the way of CISC architectures with x86. And then obviously, right, with what we're talking about with ARM and RISC-V and, you know, some of the stuff y'all were doing, obviously RISC has come back into vogue here. Actually, tomorrow, I'm another kind of like behind the scenes, I'm recording with Robert Gardner, who designed the spark instruction set um and we're focusing on um register windows and talking about you know like why they succeeded and why they didn't so um there's a lot of times the reason things didn't succeed is not um that they just weren't good ideas right there was some other factor um and and maybe we can use that kind of to jump into what y'all were starting to build there. So I think the my main takeaway was

Starting point is 00:20:34 like two big concepts from the architecture that y'all designed. And that was VLIW, so very long instruction word, and scratchpad memory were kind of like the two things that I took away from that talk. Do you want to maybe start with VLIW and talk about the history of that and why y'all chose to go with that architecture? Yeah. So I think a big reason I gravitate towards the history is, you know, other than my love of electronics as a kid, I was very bored in school in general. But the only class I was very actively involved in throughout my school, it was always history. And so I, I just loved learning about, you know, everything from, you know, ancient Egypt and Rome to, uh, World War II and, and, uh, really getting an idea of, you know, the,

Starting point is 00:21:33 the causes and effects and, you know, how, how, you know, everything has branched out, uh, from, you know, over, you know, only a few thousand years of, you know, modern history. So that always fascinated me. And then, you know, computer history and how, like the fact that really, you know, electronic or if I expand a little bit, mechanical computers have a much, you know, shorter, you know, less than 150, 200 year time frame. And the evolution there has been remarkable. And I just, me learning any of the technical details of things, it just becomes so much easier.

Starting point is 00:22:13 And I think I make deeper connections, both from a just personal enjoyment standpoint, but also like being able to analyze things in I think a different way than others by actually learning the real, you know, historical line of things. Like if you understand why, you know, in general, smart people made certain decisions that in retrospect, people ridicule. If you actually look back and look at the context of how they came to that decision, it gives a lot more insight and usually suggests that, no, they weren't dumb. They actually made a pretty reasonable choice given their constraints and such. interesting thing just since brought up the risk um uh cisc wars and the the uh 80s and and and

Starting point is 00:23:07 beyond um a lot of people like don't actually recognize that risk was actually a thing before cisc was in the 60s and it's kind of overlooked but like i would i a lot of people you know that are nerdy about this sort of thing would would argue the CDC 6600 was the first RISC machine. So that was, you know, Seymour Cray designed first, you know, real supercomputer that had mass market adoption. And fundamentally, like, what is RISC? I would be, you know, a lot of people, probably because it's the easy thing for more lay people to understand, just think, oh, well, you have a fewer set of instructions that you build out and do pipelining and things like that. But fundamentally, it's a load store architecture. The fact that you're having to actually move all of your data into registers and you're operating on registers is a very different paradigm from, you know, early CISC machines,

Starting point is 00:24:05 you know, going, you know, system 360 onwards. And so just in, if we go back to what I think of being like the first real CISC war of going between, you know, Control Data Corp, which Seymour left and then formed Cray. But, you know, the first real risk architectures there versus System 360, which power... I actually haven't looked into Power 10 recently, exactly where they have compatibility break off. But a huge constraint on the power architecture for a very long time was IBM's insistence on being able to have binary compatibility all the way back to 1968. I think they still have it on the Z machines, which isn't power. There's other processing elements there. But yeah, the evolution of risk to VLIW, a lot of people,

Starting point is 00:25:04 and I would say the first true VLIW was the EY, which was extremely long instruction, 512, which was Josh Fisher, which then led to multi-flow that I mentioned earlier. But a lot of people like don't actually know about floating point systems. And they're super computers from the mid-70s that didn't call their thing VLIW at all, but has all of the key characteristics. The name is misleading in saying a very long instruction word. Everyone just thinks, OK, that must be what defines a VLIW. But in the same way that I would say risk is fundamentally that load store architecture, you're doing all of your operations on registers, and you're needing to move that in and out of whatever other parts of your memory before and after operating.

Starting point is 00:25:57 The core thing with VLIW is this idea that you can actually be doing concurrent operations that is very distinct and different from just parallel operations. And that you can encode in such a way that your instruction stream and that very long instruction, all of the operations that are being given in that single word are things that you are making a guarantee with the system

Starting point is 00:26:22 that the program or whatever is emitting that instruction word is making a guarantee that these things can all operate concurrently without hazards. Yeah, absolutely. extremely powerful idea for the mid-70s with FPS and then in the supercomputer space. And then it was really, I would say, Multiflow that tried to take that to a much grander level and be actually saying that we're going to build a compiler that can do this for you, that it's not the programmer, a really brilliant person drawing things on paper in the 70s with FPS and trying to manually schedule this, but that this can actually be a control flow graph and you can algorithmically determine what things can be done in parallel. And that, frankly, has been the holy grail problem

Starting point is 00:27:25 for 30, 40 years now as it relates to VLIW systems. Absolutely. And when you talk about different ways to get instruction-level parallelism, what we're kind of talking about, and this is with RISC and CISC as well, perhaps, is do you do more work in the hardware or do you push more complexity to the hardware? Or do you push more complexity

Starting point is 00:27:45 of hardware? Or do you push it into the compiler? And, you know, the way computing systems have evolved have changed that trade off. So some of the discussions, you know, looking back at things in the 70s and 80s, you know, assumptions about, for instance, you mentioned a human, you know, writing a program, it's very different when you're comparing that to a compiler generating machine code. So that's, you know, one of the things that I frequently see with some of the research from back in that time and then applying it today. That's just one vector where the trade-offs may have changed, you know, throughout history. The architecture, the processor that you

Starting point is 00:28:21 referenced in that talk that kind of gave VLIW a bad name was Itanium, which I'm sure a number of folks are familiar with. So with VLIW, we can potentially reduce the complexity of the processor, right, because we're pushing more of that into the compiler. Can you talk about why Itanium was not successful in doing that or why it's maybe mocked today? Yeah. Well, as I mentioned in that talk, and I'll stick to my stance here, I do not consider, you know, this is trying to like disown a, not even my own child, but something that gets associated with the thing that I love. But I do not consider Itanium, Itanic, to be a VLIW architecture. And my best evidence or thing I can point to support that

Starting point is 00:29:13 is Intel's own marketing from the time, which they were very explicitly trying to say that it wasn't VLIW back then. They coined this Epic name, so it's the explicitly parallel, you know, instruction set computer. And I appreciate whoever in the marketing department at that time, you know, was doing that. But even the, you know, technical reasoning for doing that was their belief that they could actually take the really powerful, amazing thing about VLIW of being able to have these explicitly encoded parallel or

Starting point is 00:29:52 concurrent operations that could be done. And I would say they had a bit of, and I've heard different stories from different people involved at the time. But in general, I think everyone would agree that there were strong battles internally in the both Itanium team itself. But, you know, being heavily, heavily constrained and influenced by other parties within Intel that were really trying to heavily push the, you know, x86 Pentium continuation elements. And what I think the real downfall for Itanium in diverging from sort of the pure VLIW roots, which have ended up being successful in many other products that we'll talk about, but was them trying to still have support and some level of compatibility with x86. And all the elements and kitchen sink pieces that were getting moved in from other parts of Intel and being bolted onto Itanium that kind of turned it into not being VLIW

Starting point is 00:30:58 and I would say very much being Epic and turning out to be an epic fail. But yeah, I just, you know, to, I guess, expand a little bit on it. It's fundamentally comes down to what I was saying before, where VLIW is the programmer, the compiler, whoever, whatever is giving that instruction. It's its job to actually be saying that we know at this cycle that you're going to be having no control or data hazards with this data providing. And when you start to try to add a lot of advanced processor features like branch prediction, any sort of caching

Starting point is 00:31:43 systems, and especially if you want to actually have some instruction compatibility with x86, which was a big goal that Intel never actually even delivered on with Itanium. You add in so much requirements of non-deterministic features that that guarantee can't be made. It's impossible. The delusion and the reason I think that the Itanium failure is so well known by people that aren't compute architects but that they still associate it as VLIW bad because of Itanium, don't realize that it wasn't that they just weren't able to make a sufficiently smart compiler.

Starting point is 00:32:31 That was an impossible task. I would say probably not provable as impossible, like P is or is not equal to NP. I would say it's similarly non-determinable, but any rational person, I think, would be saying that P is not equal to NP. And if it was, then the whole world collapsed and none of us, nothing we say about it matters if it was. And so I would make a similar assertion that Itanium's mythical magical compiler was impossible due to all of the non-VLIW cruft and design directions that they tried to do it to make it an appealing product from Intel's internal perspective. And fundamentally, I think that those decisions were rooted in Intel didn't want to

Starting point is 00:33:26 actually diverge from x86. Right, right. Absolutely. That's kind of a, once again, kind of referencing back to like where you put the complexity, you have to make sure wherever you push the complexity that you give sufficient information to make the decisions in that place. And it sounds like in this case, right, the, um, it, you know, compiler authors or human, you know, humans writing a machine code or assembly, um, you know, they didn't have sufficient information to reason about performance and maybe some of those hazards as well. Um, so that, that makes a lot of sense. And then with, you to take a full pipeline flush of 40 plus cycles at that time, if I remember correctly. You're going to have a really bad time.

Starting point is 00:34:14 So I think the key thing to remember is code worked on Itanium. You could run programs. They just performed horribly because those guarantees that that correct level of abstraction and where the handoffs are in complexity were like, I think this is the other thing and why I don't think a lot of the engineers and people associated with Itanium like to talk about and and don't like to correct the record on a broader scale um is because so many people and it becomes like within the broader um computer science you know culture like a known failure beyond just in computer architecture architects that um uh people don't like give credit to the engineers. The engineers actually want to make a good product. They actually want to do the right thing. They made a lot of good decisions. But when you have the wrong requirements going in

Starting point is 00:35:17 and the wrong request demands of a system, you're going to end up, you know, not making the right thing. Right, right. Absolutely. Well, so in your Neo architecture that y'all designed at Rex, y'all took a different approach. And you talked about in that talk, having hard real time determinism. What was kind of the strategy? Are there any kind of like things you could point out specifically in the processor architecture that allowed you to be able to do that in contrast to something like Itanium? Yeah.

Starting point is 00:35:55 So probably the biggest element that stands out but also vastly simplifies everything, is the fact that we went to a purely scratchpad-based memory system. So in terms of what would normally be L1, L2 caches and traditional processors, rather than having the TLBs and MMU and all of these extra pieces of logic, which take up a lot of area, a lot of power, a lot of design time and complexity, and add latency and such. Basically, think of you've got your L1 cache, but we just rip out all of that, and you're just directly accessing that memory, the physical addresses for those pieces. So at the individual core level, it was, you know, and I think a lot of how we talked about it to most people,

Starting point is 00:36:51 just think of it as being a simple risk core to start off with. In reality, it's VLIW, but VLIW got such a bad rap that you can still, it still meets the definition of risk. And, you know, most people don't actually care how instructions are generated for it if they're not responsible for doing anything related to it. But, you know, had, you know, a 64-bit ALU IEEE 754-2008 compatible, 64-bit floating point unit. And then, you know, two separate register files for each of those, and then the scratchpad memories, which were divided into multiple banks. And really, the VLIW, the instruction board that used to control these elements, enabled you to be doing an ALU op, a floating point op, as well as two load and store operations simultaneously,

Starting point is 00:37:46 all in a single cycle. And at that basic core level, that sounds pretty simple. It's no more difficult than, and in a lot of ways a lot easier than programming computers pre-1985-ish with the proliferation of caches. Like you just directly access the memory system, and it happens that a single one of those cores had more, you know, main memory just embedded within that core than a lot of the early computers that probably a lot of people listening to this podcast grew up with

Starting point is 00:38:22 in the early 80s, 90s. So, but yeah, we expanded. So the core difference is that just having that direct access to memory drastically simplifies and the benefit from us on the software layer is because you have that exact deterministic operation for every single instruction you're doing all through memory access. Take a predetermined number of cycles, number of nanoseconds. When you're actually generating, when you have your compiler and the compiler is trying to generate a control flow graph for your program,

Starting point is 00:39:11 it knows exactly when data is going to be where and that, you know, there's not these, the potential that the hardware is actually going to be doing something in secret and there's going to be a stall that needs to be inserted for God knows how many cycles. Right. And when you were describing that, it kind of made me think, it feels like a similar kind of like pushing the complexity or maybe even you could say pushing the capability onto the compiler to, you know, not just in the instruction and control process, but also in the memory system now um which is kind of like interesting to like lean fully into to that kind of model yeah yeah and so um and it's not like no one was doing that between itanium and us like there are plenty of vliw architectures that had great success primarily in the signal processing world um and so, you know, between, you know, 2000 and at this point, 2014, 2015, for Rex, you know, there were like basically every single cell phone. If you think about the number of devices there, maybe by 2015, I may just, I think it would be believable that there are more VLIW processor cores out in the world than x86 processor cores, because it would be in every single baseband for your cell phone.

Starting point is 00:40:37 It would be in every DSP that's doing any ADC DAC work with anything that's microphone connected to it. So, you know, the VLIW, what a lot of people don't realize is VLIW won in a very silent way in application segments where you needed that determinism. Like in most of those DSP applications, it was a requirement going into the design of that processor and the overall product that those processors would be going into to have some hard guarantees on timing. And the easiest thing for those designers to do was ensure that the hardware behaved exactly as designed and not to have these elements that were originally designed to make general programmers' lives easier. The reason that hardware-managed caching took over was back in the late 80s through 90s, processor architects were getting free transistors every two years. And so if you just keep getting free transistors and you almost don't know what to do with it,

Starting point is 00:41:45 you're going to start adding complexity to your design in order to make what is thought of as the, you know, your end customers, the people that actually need a program and utilize these machines, make their lives easier, allow them to address, you know, a virtual memory space and not have to understand all of the increasing complexity that's being introduced. And I think fundamentally the big, you know, our thesis for Rex was we wanted to do much grander programs, applications than what DSPs were restricted to. And we were targeting, you know, a higher level of performance, et cetera. And our belief, and I think, you know, we did prove out on this, is you can have a sufficiently advanced compiler to be able to do that sort of scheduling, as long as you make those

Starting point is 00:42:36 guarantees, and if you're willing to actually spend the time with compilations. Like, our compiler wasn't fast. It was developed off of LLVM. So I think the big advancements that only really came in the early 2010s was sufficiently advanced open source compilers were there. So the LLVM project was the biggest advancement to actually allow architecture independence and Apple's investment of actually making that open is one of the best saving graces for the entire industry at the past decade. I don't think we can thank Apple enough and Chris Lattner and everyone involved in that project for not only doing a good job and making it possible for Apple to do the portability that they wanted for Mac going from PowerPC to x86, and then they wanted to also enable those developers to develop on an x86 Mac but for an ARM iPhone, and then eventually them now moving fully to Apple Silicon based on ARM.

Starting point is 00:43:47 They could have kept all of that closed and to themselves, but they realized that that was also a really big problem for Google and Microsoft, and actually making that a whole community was one of the developments of like the past 10, 15 years, in my opinion. But, you know, directly enabled us with Rex to be able to start off with really advanced, you know, SSA form of the static intermediary representation of LLVM IR, and then be able to do our specific optimizations and that control graph work to be able to emit, you know, efficient performance assembly for the architecture.

Starting point is 00:44:40 And even though that compilation time took longer than if we just went with a RISC and had all that complexity in the hardware, compared to the late 80s or early 90s or 2000 with Itanium, just base computers, just the desktop computers that we're doing compilation jobs on, were so much faster. So I think VLIW, to actually have VLIW be broadly successful and usable for larger applications than what they, in the embedded spaces where they were successful, really needed you to be willing to spend a lot more horsepower on that compilation stage and just the natural evolution of computing performance

Starting point is 00:45:33 enabled that to actually be practical. Absolutely. You talked about last time when we chatted a few months ago about the kind of like software side of things sometimes being the hardest part of developing new hardware. So you mentioned there you know that y'all leverage lvm which was obviously um hugely advantageous um what other kind of um uh obviously there's reporting of operating systems and things like that but what were kind of like top of mind for y'all um when when tackling things other than the compiler that you thought were necessary

Starting point is 00:46:05 to enable the Neo architecture to get adoption? Yeah. So our real focus on the compiler side was getting base LLVM and Clang working. So Clang is the C and C++ front end for LLVM. At a high level, LLVM is kind of broken into three pieces. There's LLVM front ends for basically any high level language you can think of. Those front ends then actually output this LLVM IR, the intermediary representation, which is a static single assignment representation of the program. But basically there's no mapping to actual registers or anything that is hardware-specific at all.

Starting point is 00:46:45 But you have a control flow graph, and you can get a lot of intelligible and do a lot of optimizations just on that SSA form. And then finally, there's the backend that takes in that LLVM IR and actually generates assembly and machine code for the target architecture. So LVM being standing on the shoulders of giants there for those front end and a lot of the IR optimizations. But we had to build a custom backend. Thankfully, there already was a pretty good VLIW backend that Qualcomm released for their Hexagon DSP. And if I remember correctly, it wasn't actually mainline LLVM yet at that time. They had their own branch. Then I would say we more learned from that in developing our own backend than like, there are plenty of differences with, with hexagon, but yeah, that, that was the biggest, biggest lift, you know,

Starting point is 00:47:56 but even once you have a compiler, even if you've got a C front end, that's kind of useless if you don't have like a lib C that is, you know, actually targeted for your architecture so um you know my co-founder paul subxon um uh was really the software brains behind behind rex and doing all all that work basically single-handedly so um i think it also i didn't really talk about the founding of rex or anything but you know i very, very proud of what we did with, you know, basically four, you know, employees and going from, you know, raising our seed round in 2015 to taping out in less than a year from raising money, basically six months from actually getting started on real RTL that ended up in the final thing. So having all of that, and even just getting to that early level of, you know, a C compiler that can actually take, you know, basic C

Starting point is 00:48:58 was a really great starting point. But where I think a lot of the soft, like our plans were not in that reasonable time to have an operating system running on there, but actually having a good CUDA, or sorry not CUDA, ah, thinking of modern things now, but having a good BLAS library, so the basic linear algebra subroutines, be able to have those optimized for our design. Like the first benchmarks, like what we showed at that Stanford talk was the Artgem kernel in FFT. So getting those to actually compile and run through was an accomplishment, but obviously the industry as a whole has much higher demands of supporting much larger packages and libraries than what we could tackle with our team size. With the kind of like accelerated timeline you had because of funding, but then also, you know, the time it takes to get a chip taped out and get it fabricated and that sort of thing, you obviously had to be working on software in parallel with the hardware. You talked a little

Starting point is 00:50:18 bit about this in that Stanford talk as well. But what kind of strategies did y'all use? Did y'all have simulators that y'all built for the architecture? What was kind of your process there as a team? Yeah, in large part, because we didn't have much time just from a like base burn rate situation. And because we didn't, we only raised $2 million, which sounded like a lot when I was 18. And, you know, but yeah, it doesn't last long. And, you know, we couldn't afford any fancy simulation or emulation tools. And so our software development and all of the verification that we did on the design and basically the, we started, you know, RTL and hired our head of engineering in November

Starting point is 00:51:06 of 2015. Between then and when we taped out in June of 2016, was all done with Verilator, which is an open source project. I should clarify that. was, you know, waveform simulation done just like for individual blocks that was done just with incisive, the cadence simulation tool. But anything that was like a full core and greater was done using Verilator, which a lot of people told us was very crazy, you know, using an open source tool to be the thing that we're trusting to actually tape out a test chip on. But, you know, we did have to make a lot of modifications to increase performance with Verilator to get to the level we wanted. You know, we did have to make a lot of modifications to increase performance with Verilator to get to the level we wanted. You know, we started off with very crazy ambition of being able to do a full core level synthesis.

Starting point is 00:52:15 So going from RTL to having a net list and, you know, every single day. And we did get there basically by January of 2016. So, like, we were doing very rapid iterations on our cores and, you know, at least weekly having full top level synthesized and having, you know, our golden model was that RTL that was being synthesized and taking that exact RTL, running it through Verilator to generate that C++ simulator of that RTL and be able to actually then use Verilator to generate waveforms to compare with waveforms of individual blocks. And so that whole development flow was completely foreign to every single person we talked to in the industry. But if we didn't do that, there was no way we were going to be able to build a chip on that schedule.

Starting point is 00:53:14 And yeah, it's funny now having spent the past six-ish months working on some new hardware, haven't been able to replicate that level of... What we're doing now is a bit more complex and everything, but it's crazy to think back that we got that done purely out of necessity, and it worked without major problems. Absolutely. Yeah, I've used Verilator before myself, and it's excellent. And another thing that's, I think, in the last maybe decade. We've had a lot of problems back in 2015 that have been fixed. I'd imagine so. I'd imagine so. One of the things that, so I've,

Starting point is 00:54:05 I've mostly, um, worked on FPGAs and, um, you know, after that, uh, it's, you know, similar, similar process of, of going through synthesis and netless generation and that sort of thing. Um, but then you, you know, eventually generate your bit stream to, to load onto the FPGA. And one of the things that's been really nice, um, you know, being an individual who's doing much less than y'all were as a four person team, but is the open source tooling around synthesis and that sort of thing, which has really in the last decade, maybe even five years, you could say, you know, grown up quite a lot. But everyone I talked to who works hands on in the industry still indicates that there's like a pretty significant gap uh between what's available open source and and the proprietary tools yeah and yeah i i

Starting point is 00:54:55 i i have played around with the open source synthesis tools and it's it's um i guess i feel like i'm spoiled brat and saying that, yeah, I just can't. That's it. Uh, Verilator still, I think is, is for doing the specific things it does really well, which is enabling you to have very, very fast simulation. Like we were doing full chip, um, uh, simulation based on taking from RTL at like one and a quarter megahertz. And then this was simulating the full 16-quart chip on, you know, 2015 era, you know, low-cost Xeons. So, like, you know, our incisive simulations were running in the hundreds of kilohertz.

Starting point is 00:55:41 No, wait, sorry, way less than that. Yeah, I think 10. It was somewhere in the order of 10 to of kilohertz. No, sorry, way less than that. Yeah, I think 10. It was somewhere in the order of 10 to 50 kilohertz. So being able to actually do, and I think that was like single core. Like we couldn't even, I don't think we actually did anything larger than a single core with incisive.

Starting point is 00:56:00 So yeah, it's, there later enabled us to do, you know, constrained random test generation and actually doing, you know, nightly regressions and everything that I'm sure we could have bought a very expensive tool from Cadence or someone else to do that. But I'm pretty sure we actually ended up with something that worked better for us, just with that free open source tooling. So, and then like when it comes to Yoast and everything today, yeah, I think in large part because there's not really a practical way to use all open source tooling to get to building anywhere close to leading edge chip. I know there's the open, I'm forgetting the name of it, the Google sponsored project for doing shuttle runs.

Starting point is 00:57:04 Right, OpenMPs. Yeah. Um, and so like, I think that's great for, for students and I'm glad that something like that exists, but I guess I've gotten spoiled there where, you know, I'm, I'm never going to make a chip, you know, I already made a chip at 28 nanometer. I need to be going smaller and smaller take out so um right so but yeah that's um oh and and finally just on the fpga note um we we actually didn't start doing fpga simulation you know test of our our chip until after we taped out so really we were so focused on getting the the chip working we actually did buy a pretty high-end ultra-scale, Xilinx ultra-scale FPGA to do simulation with.

Starting point is 00:57:53 And we tried doing some things before tape out, and it was taking too much time, and we weren't making progress with it. So we just said, hold off on it. And we didn't actually start actually using the FPGA for doing some physical testing of connecting to the development boards that we were designing before we got the chips back. We wanted to test some of like, okay, can we have this on the FPGA and actually connect the microcontroller developer board that was going to be put onto our chip. Can we take that evaluation kit we got from STMicro and connect that to the FPGA and actually have that communication work?

Starting point is 00:58:32 And then we found bugs in our design post-tape out on that FPGA, which was sad. But because we had that time of knowing those bugs existed we we figured out ways to to get around them by by the time the real chips came back right and and uh obviously you know um the the company kind of wound down a little bit um after y'all got the chips fabricated um i do have to ask where are the chips today do you still have them on hand have been, you know, what are the interesting use cases that they've been used for so far? If you make nice display pieces, they're not very good heat generators. That's, you know, some of, you know, cold nights. I'm like, I have 181 of these chips, you know, in my closet.

Starting point is 00:59:24 You know, it would be nice if we made it too power efficient. 81 of these chips in my closet, it would be nice if we made it too power efficient. So it just can't be easier. Yeah. The long story short on that is 2017, I think we officially ran out of money two or three months after the Stanford talk. So April, May of 2017.

Starting point is 00:59:53 A lot of reasons for that, which we could go into. But yeah, I think the key thing, what can the chips do? We had, you know, amazing floating point performance and efficiency for, you know, what we had built, this low dinky 16 core test chip. You know, that like measured, you know, validated, you know, double precision floating point performance per watt on 28 nanometer was better. It's basically equivalent to NVIDIA A100 on, what, 7 on TSMC 7. So, like, we were able to demonstrate that worked. I would also say, even though the software was really early, it had the components there to show that, you know, with more people, more time, the software stack can function.

Starting point is 01:00:52 The core problem with the test chip was we, in the early fundraising we did, and kind of our belief primarily as technologists when starting the company, was that we needed to build a chip. That was what we had to do to show potential customers, investors, et cetera, that, you know, we're serious. And, you know, a large part of that is also, we were both really young, you know, 18, my co-founder was 20. And so we kind of had blinders on of, we just need to get this chip. And once we have the chip, that will be what unlocks us to get to the next step. And, well, I should say chip plus the basic software tools around there. And then, you know, all of these investors that to our faces had told us that, you know,

Starting point is 01:01:33 they didn't believe we could actually make a chip when they decided not to invest in us. Then they would come around once we demonstrated we could do that. And it wasn't until, you know, after we had the chip and we were very proud in demonstrating that that uh uh it became very clear that oh this thing we built greatly demonstrates the technical potential but it doesn't actually demonstrate like that it was a little you know 12 square millimeter test chip and it was never designed to actually do the thing that the 100 square millimeter version would you know actually be useful for so right um that focus on just

Starting point is 01:02:13 getting to have silicon and proving that would have probably actually worked out fine if we had raised three million or five million if we had some more buffer from when we had, because we were running on fumes by February of 2017 when giving that talk. Basically, by the time when we started talking to investors and customers, and they realized our cash situation, everything turned into

Starting point is 01:02:45 acquisition conversations um right so it still worked out in some ways but uh um i the other you know big big problem uh you know we we gave up a very very large amount of the the company in in our seed round. So for that $2 million initial investment, it was on a $3 million pre-money valuation. So we gave up 40% of the company at the very beginning. And we actually did have two different investment offers that would have recapped the company. But it turns out investors that have the ability to block, you know, new investment coming in and such don't like having themselves diluted a small, you know, large amount. So yeah, I think

Starting point is 01:03:43 we kind of almost like signed our death certificate from the very beginning where we didn't raise enough to ensure that we could get the chip and still have enough viability that we could have more extended conversations and make that real proof point with the technology demonstrator that we had. And then secondly, we just gave up way too much of the company to actually make an appealing investment to anyone else. And like when it came to the acquisition talks, when folks looked at the cap table and saw that we owned such a small part of our company, it's like, why would they want to pay the investors and have them get all of the benefits when after that deal is over like the primary reason that they buy you know they do an acqui to you know get the the people and you know some value technology they frankly don't care and would like the investors to get as little as possible. So that was a very tricky situation. Yeah, yeah, absolutely. Well, so you went and did a few different things

Starting point is 01:04:56 after that. We're going to fast forward through those. Maybe in a future episode, we'll have you back to talk about some of those but you recently um started a new company and just kind of came out of stealth as we mentioned at the beginning of the show um do you want to talk a little bit about what y'all are doing at positron and and maybe also um some of your recent announcements yeah um so back uh april April of 2023, started a new company, Positron AI, developing new hardware for accelerating machine learning models in the general sense. But like 95% of our focus is on transformers and specifically like the large language model and large multimodal network types. And so we haven't announced too much. We did exhibit at the NeurIPS conference last month and sort of had a soft launch, but we're going to be making some more real product and shipping product announcements this spring. But at a high level, what we did show at a working demo of our PCIe card-based accelerators,

Starting point is 01:06:14 having about a 5x performance per dollar advantage over NVIDIA H100 for LAMA inference. And some of the upcoming announcements are going to be significantly greater than that for some of the newer, exciting model types that are out there. So yeah, our core belief is that what are considered large language models today are going to be pun you know, puny in the not so distant future. And we're making a very strong bet on having, you know, very large

Starting point is 01:06:51 amounts of memory capacity and having the right balance of memory bandwidth to compute. And like the core thesis of, you know, why we started Positron was that the sort of first generation of AI accelerator chips that, you know, all the companies that were founded basically 2016, 17 till, you know, 2020 at least, and even a number after that, really based their designs on the ratio of compute to memory that existed for convolutional neural networks and other neural network architectures where you could throw as many flops as you want and it improves your actual realized performance. And fundamentally, what's different with transformer models, especially on the inference

Starting point is 01:07:36 side, if you want to have good throughput and latency for that model, it's a vastly different ratio required for memory bandwidth and memory capacity. And so we're targeting, like with this first generation product, having an order of magnitude greater memory capacity than what is possible on NVIDIA GPUs, be it today or in their upcoming product releases, and having a much higher ratio of memory bandwidth to the actual flops.

Starting point is 01:08:14 What does the software integration look like for a processor that's specifically focused on accelerating inference. Obviously, NVIDIA has CUDA. There's a couple other, I think it's AMD that has ROC. There's a couple different models there. Does y'all's architecture fit into some of those software stacks or are y'all developing your own? So the big thing and what got a lot of interest

Starting point is 01:08:44 and people looking at us very differently from, I would say, all of the other people trying to compete with NVIDIA is the fact that our software strategy is actually we don't have a compiler. Or there's no compiler that any user of our stuff would ever know about or care. And so really how this first generation product is the operating model for users is we have a we ingest directly the trained model file for these transformer models so you know.pt is the PyTorch output file that when you do training you know know, primarily on NVIDIA GPUs, that's the output. You know, you can, there's safe tensors and a couple other formats. But taking that trained model file, that is directly what we ingest and put on the device. And so there is no porting. There's no, you know, the story that AMD and Intel and many others have tried to tell

Starting point is 01:09:46 and to varying levels of success of actually being able to do anything close to it in reality is that them saying, oh, well, we support PyTorch or we support TensorFlow, or we have a process where you can go through 16 steps to convert your PyTorch model to ONNX. And, oh, well, actually, we don't have that offset support. And ONNX is always six to 12 months out of date on new features, you know, being supported or, you know, they just decided that they're never going to support that particular op. Rather than dealing with all of that, that has been the bane of the existence of anyone trying to move off of NVIDIA. We're actually being able to take directly your config.json for your Hugging Face Transformers compatible model

Starting point is 01:10:29 and take that model file and directly load that onto the device. So that there's zero compatibility, our compatibility processes, well, if you trained it on, presumably in NVIDIA, really, if you trained it on anything, and that it's a model architecture supported by Huggy Face transformers, we're able to ingest that directly. Wow, that's definitely a huge benefit there. I could also imagine, not only is that really nice from an integration perspective, but if you are, you know, a startup, a smaller company, removing the need to build all of that software support is also nice from

Starting point is 01:11:10 an organizational perspective. Yeah. Yeah, it's drastically simpler for the customer users, which is the number one thing, but anything that can also make our lives easier. And, you know, we're very clear in our stated mission and goals that if you're using, you know, some 3D unit piece, yeah, technically our hardware can support it. Yeah, technically we could. And actually in one case, like we've gotten a very simple version of that running and going through this, but at least in the foreseeable future, we're not going to be offering that to customers as an option. We want this to be the simplest process for us to be able to get a product out to market and people actually having a viable alternative to NVIDIA.

Starting point is 01:11:58 And, you know, I'm knocking on wood a little bit. We're approaching one year from the company being founded, but it was our goal from the beginning that we would actually have real product in customers' hands a year from delivery, and it's looking like we from Rex and other experiences that I've had since then is identifying what is it that the customer is actually craving for? making, you know, their, their decision on and just narrowly focusing on that and, and keeping, you know, your, your hand on the, the pulse of, uh, uh, you know, the customer throughout the entire development process, you know, the, be it, uh, you know, myself with Rex and I think the, the failure point for a lot of companies, be it new processors or otherwise, I think probably one of the main reasons companies fail is that they're too focused internally building and building something

Starting point is 01:13:13 that they themselves want rather than really deeply understanding and being obsessed with making something that the customer wants. And if even like there is a line and you have to use a lot of judgment and it's a process to it, but, um, you know, the customer is not always right, but they're usually more right than you are in making decisions of how, what is going to cause them to part with their, youearned dollars for your thing. So you obviously don't want to just build a faster horse when you could have made a car. But there's a balance which I'm still trying to find, and I think all entrepreneurs need to work on that. But it's definitely a really hard problem and an exciting one. It's the fundamentals of how do you actually build a successful business.

Starting point is 01:14:10 Right. On that kind of set of customers that are interested in a processor that's going to accelerate inference, what are kind of those key pain points or the key things you think people are willing to switch for? I mean, some that come to mind based on what you've described your product is the interface feels like it's really important, right? And you're having a really simple, maybe even you might say high level interface to your architecture could go a long way. I know availability is also just a big thing today. In terms of other aspects of y'all's product, what are things that you've heard from customers and then tried to incorporate in as key tenants? Yeah. I think fundamentally, I think the vast majority of purchasing decisions, if you assume that it's functional for the application that they care about and that they can use it in whatever mode method that they desire, the next step of that purchasing decision where things boil down to is the performance per dollar. So it's can I actually, is this going to be a net savings?

Starting point is 01:15:26 And when going in against an incumbent, you need to have, it can't be something that's 10%, 20% better. It needs to be, you know, ideally everyone says an order of magnitude. So somewhere between those two extremes is a great point. And the nice thing about this particular market and how we're trying to address this is, you know, what we're selling is actually not the chips we're selling in the appliance of a full system. And we're really treating that as being like a black box for LLM inference. So, you know, you can use either publicly available model, you know, we're aiming to work with partners that have their own closed source proprietary models that would be able to be run on our hardware and accessible to others,

Starting point is 01:16:10 or you can bring your own model. But from that point on, it's no different than the interaction method with any NVIDIA system, where we have an open AI compatible API front end. We've built a load balancing system that can actually is hardware agnostic. So you know requests can hit that load balancer and be distributed to a positron box, to an NVIDIA box, to an AMD box, as long as that API interface is the same as what those customers are used to. And most of the industry has kind of gravitated to the non-officially defined spec by OpenAI, but the one that they created for their own services. And if you support that, there's really no change from that user's perspective. They're simply sending tokens in as a prompt and they're getting tokens back as a response.

Starting point is 01:17:07 And so that's token in, token out model is really one of the biggest enablers to making this easy from a actual usage and adoption standpoint. Once you get beyond that point of, okay, is the model that I want to actually run capable of being run on this hardware?

Starting point is 01:17:27 Absolutely. Absolutely. Is the, you know, obviously things are moving at a pretty rapid pace in this area of the software and hardware industry. Are there any kind of like concerns or future proofing y'all do around building this hardware that adheres to that interface, you know, to ensure that your product will be viable for an extended period of time? Yeah, a key thing, you know, where even though we call, you know, what we've built, you know, principally a transformer accelerator,ally, it's a linear algebra accelerator

Starting point is 01:18:06 that has a very clever method and connections both for the memory system and for scalable networking of many of these things put together. So from the perspective of, you know, new model advancements just in the past few months, there's been, you know, actual, uh, first, I would say real viable competitors to transformers, uh, in the case of, of, um, uh, states-based models. So the, the, there's, um, Hyena and there's Mamba being two things that caught a lot of attention last month at NeurIPS. And while it's still, I think, too early to tell if they are going to actually scale out to,

Starting point is 01:18:53 you know, wider usage in their current states, you know, the fact that we've, we're fundamentally a, you know, a linear algebra accelerator, like the hardware maps just as well to those different ways of solving similar problems than how transformer, multi-headed attention and transformers do it. So we feel really good about having that flexibility of changing as model architectures do. And I think the biggest advantage we have to in like the thing that I don't think that the other hardware providers are really equipped for

Starting point is 01:19:33 right now is being able to scale to larger model sizes, especially with things like, I'm not saying dense models as a necessity, like the whole emergence of a mixture of experts with GPT-4, BARD, and now the Gemini version of BARD. And now like MixedRole being in the open source space. Having this method of having many experts with only a small number of them being used for every given token actually increases the total amount of memory you need to hold the entire model. And so that's like perfect for this architecture we've designed where, you know, we can support having, you know, current things are capping out at like eight experts. I'd be happy if, you know, things moved to 32, 64 or more experts because we actually have the memory that can hold that and we can do that

Starting point is 01:20:25 affordably. I think that even like the next generations and all of the new AI accelerators and things that are heavily embracing HBM as a memory technology are sort of going in the wrong direction. Yeah, HBM is great from a bandwidth perspective. It's in the name. But you're going to be paying at least four times more per gigabyte. And you have hard caps on how much memory can actually be supported per HBM stack and how many stacks you can actually put onto a co-auth package. And so we're embracing commodity memory for our scaling story, which is what's really giving us that massive amount of capacity and having the right balance of bandwidth to compute. Absolutely. That makes a ton of sense. Well, Thomas, I know we're getting close to the end of the time we allocated here, and I don't want to take up too much of your time. But I do hope

Starting point is 01:21:24 to have you on again in the future as you all roll out some new products. What are some ways that myself and others can keep up with you and Positron as you all continue to roll stuff out? Yeah. So I would say the best ways right now are on TwitterX.

Starting point is 01:21:41 So there's my personal TR Summers account and and positron, I think it's underscore AI. But you can find that from from my Twitter, my x account, at least. And then our websites, positron.ai. Awesome. Well, we'll make sure to link those in the show notes as well. But it sounds like there's a lot of really exciting stuff coming up. So definitely we'll keep an eye on that. And yeah, thank you again for joining for one of the early episodes of this show. I'm very appreciative of the folks who are willing to jump on for an interview before they've heard any of the other ones. But I'm sure folks will get a lot out of this one. So really appreciate you stopping by. All right. Well, thank you, Dan. Absolutely.

Microarch Club - 10: Thomas Sohmers

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.