Microarch Club - 1: Philip Freidin

Starting point is 00:00:00 Hey folks, Dan here. I'm excited to share the first interview on the MicroArch Club podcast. Today I am joined by Philip Frieden. Philip has a great backstory, growing up in Australia and getting involved in electronics, programming, and computer architecture at a young age. He went on to work on a number of well-known products, such as the AM2900 family of bit-sliced logic chips and the AM29000 RISC processor line at AMD before moving to Xilinx to work on FPGAs. We only covered a small fraction of the experience and wisdom that Philip has to offer, so I hope to have him back again in the future. I also should extend a special thank you to Philip for being willing to be the first guest on the podcast and to Jan Grey for initially connecting

Starting point is 00:00:41 us. With that, let's get into the conversation. All right, well, Philip, welcome to the MicroArch Club podcast, and thank you for being the first guest. Thank you very much for inviting me. Really looking forward to the discussion. Absolutely. Absolutely. Well, definitely honored to have you here. I think we had an interesting introduction to one another. I had written a blog post about Lookup Table Ram or LUT Ram and had posted on Twitter. And then we had a mutual connection who mentioned, hey, maybe I should talk to you because of your background in the area. And that might be something we're able to touch on in the interview today. But I always appreciate getting connected

Starting point is 00:01:43 to folks like yourself, just kind of randomly or by happenstance like that. Yeah. You're referring to Jan Grey? That's correct. Yeah. I've known him for about 31 years. Okay. And we met while discussing register files and LUT RAMs and doing CPUs in FPGAs, which has been very much an active hobby for both of us.

Starting point is 00:02:18 Absolutely. I've mostly come into contact with him through, now he leads or co-leads the RISC-V soft CPU special interest group. So I know he's doing some work on that front, but he's definitely someone else who I'd like to have on the podcast in the future. Oh, absolutely. He's a very smart guy. Certainly worth talking to. Absolutely.

Starting point is 00:02:45 With regard to soft CPUs, we are both pretty much together the pioneers of doing that. And it was actually through him asking some questions and then presenting his ideas that we discovered that almost concurrently we had architected CPUs that looked extremely similar using Xilinx FPGAs. That's awesome. Well, maybe we'll get into that when we get through your story a little bit. Going kind of to the beginning, I know when we were chatting kind of off the air last week, you mentioned that you didn't grow up in the United States, but part of your plan in getting into the industry and that you actually had a plan, right? Was this process of moving to the US.

Starting point is 00:03:46 So I wonder if you could just take us back to kind of growing up and your education and then what sort of your plan was to enter the industry. Yeah. So I grew up in Melbourne, Australia. And I was doing electronics and then computing from an extremely young age. I probably started soldering components around eight years old with my mother's help, primarily holding a screwdriver over the gas stove and getting it hot enough to melt solder and then eventually someone bought me a soldering iron you know initially just

Starting point is 00:04:32 making things like crystal radios and one and two transistor radios that sort of stuff along the way I know if you're aware of it, but there's these kits like 101 Projects that Radio Shack and other companies made. And Philips, which you've probably heard of, also used to make such kits. Phillips electronic engineer number eight kit and so you know my earliest sort of transistor constructions was using that kit and you know it's it's the Phillips kits probably weren't that prolific here in the US because it really primarily existed in Europe and Australia was probably more aligned with Europe than the US at the time. So here it would have been Radio Shack kits, Australia, UK, etc. It would have been these kits made by Philips. The lasting memory is that they used these really strong springs and for someone aged eight or nine years

Starting point is 00:05:46 old it really hurt my fingers uh putting it so basically whereas the american kits had these fairly floppy cylindrical springs you just push over the side and you could slide something in the phillips kits were built on a sheet of masonite with with a one inch by one inch grid of holes and you pushed a pin from the underside which couldn't go all the way through and then you then pushed this very strong cylindrical pin from above and then if you pushed it down really hard, you'd have an exposed loop, you'd put your wires in, then the spring would then come up and hold it. But getting those springs in place and pulling them off when you finished a project and wanted to do something different, I mean, it didn't quite bring your fingers to bleeding, but it was close so that's what I remember most about the kid the school I went to so we sort of jumping forward from that by

Starting point is 00:06:53 about four years so around age 12 the school I was going to there was a maths teacher who got a computer, what's the right word, timeshare computer terminal. So we're talking 1972. That dates me a little bit. This probably does it as well. So around 1972, the school got a computer terminal. And back in those days, computer terminals were something called an ASR-33, which is also called a teletype.

Starting point is 00:07:34 So it transmits and receives at 10 or 11 characters a second. And the mass storage available is called paper tape. And so you would, with the ASR33 terminal not connected to the computer, so you're not paying for services, you would type in your program and punch out a paper tape. And then when you'd finish creating your program, you'd then use a phone to dial in to the computer. So this is a little bit like bulletin boards of 20 years ago in terms of dial-up type stuff, except the data rate was 110 board um the modem was the size of a pc mini tower case okay right and all it did was 110 board right and so what you do is you'd quickly go through

Starting point is 00:08:39 the login process turn the paper tape reader on and read in the tape that you'd typed so it was you know banging away at 10 characters a second right and as soon as it finished reading the tape you'd type run the program would run for some amount of time or it would loop forever or whatever and it would print out results at 10 characters a second. And then unless there was a good reason, you would then disconnect from the service. So it was how short a time can you stay connected? And the way services back then were measured and charged was with a unit you've never heard of before called a kilo core tick

Starting point is 00:09:28 kilo was the amount of memory that your program was using in kilobytes so like five kilobytes or 10 kilobytes right core was in core memory and a tick was a second so So if you used 10K of memory for 50 seconds, you'd be charged 50 kilocortics. Right. Common currency there. Everyone's walking around with some kilocortics in their pocket. Right. So anyway...

Starting point is 00:10:00 What kind of operations were you doing on the computer at that time? So it was mostly math problems. So this is, let's see, I'm in approximately, this would have been eighth grade. Okay. In Australia, that's called Form 2. So the naming of the stuff in it is quite different in australia from from here in australia it's primary school one through six and then secondary school is form one to through to form six so it's still 12 years of schooling but for secondary school it's just

Starting point is 00:10:40 form one two three four five and six so it would have been form two um so you know we're doing some calculus type stuff some linear programming type stuff so you know simple algebra um some things like finding minima of functions that sort of thing um and you know maybe writing some computer games as well but because of the um the budget you know which was for the whole school you really couldn't get too carried away with with what you were doing with this one terminal shared across probably 60 or 70 students, although most of them weren't into it. But, you know, there was a small group of us who were. For me, I needed much more. And so it turns out, so the service was provided initially by General Electric, and then it was by Honeywell. So the computer was actually a General Electric 235, just called GE 235.

Starting point is 00:11:55 But they sold that division to Honeywell. And so it started off being GE Timeshare, then it became Honeywell Timeshare. The computer center was located in central Melbourne and that for me was a half hour trip on public transport and so after school one day I headed into the city and found the GE building and went in there. And, you know, I'm like a 12-year-old, right? Right. And I surprisingly know something about this service.

Starting point is 00:12:34 So I came up to the reception desk and I said, I've been using your service, but I really want to see the computer that we're connected to. Is there some way I can want to see the computer that I'm connect that we're connected to right is there some way I can get to see it and the receptionist when I came in she was chatting with with what turned out to be one of the computer system technicians okay and so she handed me off to the technician. And whether because he was interested in helping me or maybe to impress the receptionist, either of which is possible,

Starting point is 00:13:15 he took me in the elevator up to the computer center. And there was a big glass window where you could look in at all these machines and the people in with white coats loading and unloading listings from the line printer, mag tapes. storage like back in those days a disk drive would be a pack that was maybe 14 inches in diameter and maybe six to eight inches high and that might be 20 megabytes wow and and that's removal this is removable media right right so there was a plastic cake tin, as I refer to it, over the top and a cover at the bottom. And there was a way to hold the handle in the top, rotate a latch on the underside, which would remove the bottom. You then place it into the disk drive, which looked like a washing machine. Right. And then again, an appropriate twist of the wrist, lock the pack into the drive and you took the cake tin off.

Starting point is 00:14:26 And then the heads then came in from the side to access 20 megabytes of data. I mean, I have text files that are a thousand times the size of that now. So anyway, a big glass window to show me, you know, there's the computer. And he pointed to different things and told me what they were, and I asked appropriately intelligent questions. And he said, so then it turns out that as someone who actually did maintenance on the machine, he had an unlimited account, right?

Starting point is 00:15:02 No kilocortic limits for him. Right. I bet you were excited to hear this at 12 years old. Yeah, it was like, wow. So it turns out Honeywell Timeshare had a training room with 10 teletypes and a desk and a big blackboard at the front. and so they did training for corporate customers and he said you know let me set you up with some stuff you probably can't do from the school terminal and it was some game programs right nothing particularly exciting and you know it's still at 10 characters a second right right so he left me there you know to play for i don't know half an hour or so when he came back i said i've played enough right how do i get to really learn more about you know using it and he said well for our corporate customers we have training programs right and so instead of just programming in basic

Starting point is 00:16:01 you could if you want to spend the time learn Fortran and Algo and I was all for it and so that started what was probably six months of me going into Honeywell timeshare wow three or four nights a week and getting there usually around four or 5 p.m. and staying till about 10 p.m. And as I said, I did it for about six months until my parents blew the whistle on me because my school grades had fallen off a cliff. But I was now writing three or four hundred line Fortran and Algo programs at age 12. Oh, and basic as well right so so basically honeywell timeshare technician i don't know his name i don't know how i would ever thank him but he he gave me a resource

Starting point is 00:16:59 that no other 12 year old in australia would have had right right it was just phenomenal and you know so there i am and i'm i've already learned the three the only language i didn't learn because that machine didn't have it was kobol okay oh and assembler so i hadn't yet done assembler what instruction set architecture was that machine? I do not know. I'll look it up and put it in the show notes. Yeah, it almost certainly does get a reference in Wikipedia. Okay. And there were several machines.

Starting point is 00:17:39 It was GE215, 235, and 245. And the machine we had was a 235. And, you know, these machines were 19-inch cabinets, you know, maybe 16 to 20 of them in a row, where something like 4K of memory was a 19-inch rack, six foot high, three foot deep, and, you know, 25 inches wide or whatever that a 19 inch rack is by the time you put the the bat the the wrapper on it right so

Starting point is 00:18:12 anyway so that was my introduction to programming and learning multiple languages and getting already to do some comparative type stuff. The damage that did to my educational path was pretty severe and lasted me through to the end of school because there was a lot of stuff that I just didn't learn. Because I had, I mean, on the other hand what i did learn has lasted me a lifetime right right um but for instance i'd be lost without a spell checker right right right well because that was one of the things that went by the wayside right right well that's so interesting. You know, a lot of ways, it's a more physical or real world example of kind of like diving through the abstraction layers, right? You're at the end at this teletype,

Starting point is 00:19:19 and you have to kind of like go physically, take public transportation to go see the thing. And, you know, it's transportation to go see the thing. And, you know, it's interesting to compare that to experiences today. One of the things with this podcast, for example, I'm trying to do is kind of like pull back some of those abstraction layers and see what happens in the machine. But for me, a lot of times that looks like, you know, going to YouTube and watching an interview or listening to another podcast. Well, or listening to a gray beard on YouTube telling you this type of story. Right. Yeah, and I didn't mention that, of course, that technician got me the appropriate credentials,

Starting point is 00:19:59 no idea how, to let me not have to look for him every evening when I came in to to play on the system but basically you know no one in management either knew about it or cared about it um but it was just a wonderful learning experience for me that's um yeah so after, a few years later, so around 1974, I started working, well, playing with, playing and some work on a PDP-8. So the DEC computers really dominated my early hands-on computing. So there was a, how did this happen? So I was working as a summer job. So in Australia, the end of the school year is you know end of november and then we have like eight to ten weeks off which is end of school year and christmas and it's the middle of summer

Starting point is 00:21:18 in australia and so we can get pretty long summer jobs, right? Right. Significantly longer than what you can get here in the U.S. And so I got a summer job working in the electronics department of a hospital, right? Interesting. And this was arranged by my father, who was the head surgeon at that hospital. And he put a nice word into the head of the electronics department. And so I don't know if they do it now, but at least back in those days, hospitals had, as I said, a electronics department with maybe four or five electronic engineers and primarily

Starting point is 00:22:05 they were doing equipment repair right but they also had um they did some design work because the hospital of the large hospitals and this was a large hospital tended to be linked to a university. And so you'd have doctors who taught at the university, practiced at the hospital, and might be doing some research. And if that research needed special instrumentation, then the hospital had an electronics department who often helped along. So I got this 10-week or so Christmas holiday job,

Starting point is 00:22:55 and the head of the department had me primarily doing, go pick up this ECG machine on the eighth floor of the hospital and bring it back for repair, or go see if they you know if they forgot to plug it in right there was always there was always these jokes of nurses you know failing to plug in something or you know getting you know terminals around the wrong way um once had you know it was some piece of uh equipment that was somewhat portable in a box about that big by that big. Whoops. Right.

Starting point is 00:23:30 And someone said, disinfect it. And what they probably meant was get some appropriate cleaning agent and wipe down the front panel or whatever. They put it into a bucket of disinfectant. Right. and whatever they put it into a bucket of disinfectant right so you know so sometimes you know some things will be on repair anyway um because they had design capability the head the department said you know as well as doing this repair stuff and they had me doing you know once i'd actually shown that i understood what was going on i was actually doing repair of equipment you know some of it was fairly simple

Starting point is 00:24:11 you know things like broken battery terminals or wires to a to a potentiometer or you know a frayed terminal to an electrode for an ecg machine some it was simple, but some of it was some bug tracking. He said, you know, we have the parts, why don't you make a digital clock? Right. And back then, you know, it was 7400 series ICs and Nixie tubes. And so I made a four-digit clock and he let me just basically you know work through the data book and gave me guidance when I asked for it but as well as the experience of working you know in a real engineering department this is so now I'm I guess around 14 yeah so I'm you I guess, around 14. Yeah. So I actually got to design a digital clock. And back then, you were having to put a lot of components together, right,

Starting point is 00:25:21 to actually build different modulo counters for 59 rolling over to 00 and 12 rolling over to zero zero and right you know 12 rolling over to one um you know that that sort of stuff so um so i ended up with a with a digital clock i ended up with some experience but as i said this was a teaching hospital and one of the departments that i served had a PDP-8 computer. And so I requested ability to use it in the evenings when no one was using it. And the medical researcher knew me not by my correct name, Philip Frieden, but my name was joe friedenson as in my father's name is joe and i'm joe frieden's son fair enough fair enough and that carried insane um creds in that hospital right so so i was allowed to play with the pdp8 now the pdp8 is in this case was 8k of 12-bit memory it executed uh what today we would call pretty much a risk instruction set and yet it

Starting point is 00:26:37 was from the 9th pdp8 family dates back to the mid60s and extends through to the end of the 1970s in terms of six or seven different models of machine and the 8e was probably by far the most prolific the deck ever sold. It was priced down around 10 to 12K, and you could run... It had an interpreted language similar to BASIC called Focal, which looks not that different from Modem Line Noise. It's a little bit more readable than than apl and a little bit less readable than basic um but it it ran focal and so you'd have well actually let me think the interpreter was actually 2k but 2k words and so on a 4k pdpa which is the smallest configuration you've got 2k words which was

Starting point is 00:27:47 approximately 4k half words which are six bits and if you don't need lower case because your teletype only does uppercase you can pack two characters per 12-bit word, and if you put three 12-bit words together for floating point. So for integer, it was 12-bit integers, so plus minus 2,000 through to plus 2,000, 2048. But floating point would be a 12-bit exponent and a 24-bit mantissa so it's uh its range was actually better than ieee right single precision floating point which is 32 bits uh split as uh 8 and 24. so there's four more bits of uh mantissa and so it was it actually had better range than ieee which didn't exist back then i mean that didn't come along for another 15 20 years but you had this pdp8 with only 4k of memory and it had very good floating point and you could write you could write programs that fitted into that 2K words of memory,

Starting point is 00:29:08 including both your program and any array data. And the PDP-8s could go up to a maximum of 32K. Gotcha. And yeah, the PDP-8 has come up a few times in chatting with folks. And was PDEC called their machines PDPs rather than computers, right, they weren't called computers. They were called programmable data processors. computer department that was in charge of buying all computers and keeping them behind big glass-walled rooms with the high priests in their white coats. And so for researchers who wanted to get a computer in their lab, they could order it as a PDP.

Starting point is 00:30:20 Right. And the computer department... Didn't go through the red tape. Right. Bypass all the red tape. Right, bypass all the red tape. So, and they were also very prolific in universities because DEC made some large systems. The PDP-10 and the PDP-20 were campus-sized machines

Starting point is 00:30:44 that might have 100 terminals connected to them. So the terminals were still these ASR-33s, right? These dominated the 70s. They were eventually supplanted by what they called glass teletypes, which we would now refer to as dumb terminals. Okay. teletypes which we would now refer to as dumb terminals okay right so 24 lines of 80 characters and a keyboard and probably depending on when you did this there may still be a paper tape reader and punch associated with it gotcha okay yeah so anyway um So anyway, there was a department there that had a PDP-8,

Starting point is 00:31:29 and they didn't have anyone to program it. And I said, I can program. And so one of the researchers who had got the computer for his department but hadn't managed to get the funding to get a programmer, had me doing programming. And so I was writing simple data acquisition programs, absolute real-time stuff written either in Focal or in Fortran 2. That's a much earlier version of Fortran than,

Starting point is 00:32:11 I mean, most people who learn Fortran learn Fortran 4 or something more recent, but there was an earlier version called Fortran 2 that was sufficient for what we were doing. And so this was programming on the PDP-8, still everything with paper tapes, but now with a Fortran compiler, and that meant the Fortran compiler was on a paper tape.

Starting point is 00:32:37 So you'd load the Fortran compiler from a paper tape, then load your program, and it would then read it in, and you'd read in the paper tape then load your program and it would then read it in and you'd read you'd read in the paper tape several times because it would first of all build a symbol table then read again right and so you know compiling might take two or three hours of feeding the first of all, the Fortran compiler in, then loading the linker, and then loading the libraries, and then loading the intermediate binary,

Starting point is 00:33:12 and then at the very end of three or four hours, you'd get a new paper tape, which is a binary version of what was your Fortran program. Right. And were you, like the linker in the libraries, you're manually loading these into the machine or, yeah? Mm-hmm. Yeah.

Starting point is 00:33:31 So it's interesting. It's like steps we still take today, right, to compile and link a program, but you're physically participating in the process. Right, right. In fact, so for the Fortran compiler, it was like a two or three pass compiler. And so you'd load, so the first paper tape you'd load would be Fortran compiler pass one,

Starting point is 00:34:01 and then put in your paper tape. It would build a symbol table in memory. You would then load in Fortran compiler pass two, which would overwrite pass one, but it would now have the data left over from pass one. Then you read in your program source a second time, right? And now it's starting to build, it's starting to build the, I guess it's the data structures and the code tree and figuring out branch distances and that sort of stuff. And then there was a third pass.

Starting point is 00:34:31 Oh, the third pass output was the one that took whatever the data structures had been built and then turned it into relocatable binary, right? So now you've got a relocatable binary version of your program but no libraries right yet right then you load the linker and then you might if you wrote your program in multiple modules you'd have a relocatable binary paper tape for each one and so you'd load in the linker program then you'd load in all of your things you would then link between them for where symbols match between your different modules right right and then it would see here's the list i don't yet know and it would then tell you which um library tapes it needed to merge into your program right so things like the floating point library uh the sign and cos library you

Starting point is 00:35:27 know that sort of thing right and then at the end would it be a so uh it would do all of that and compile a statically compiled binary there's no uh like runtime dependencies yeah and and it was a a bare metal thing there's no operating system in this environment. So, yeah, in the paper tape version of this stuff, there's no operating system. Gotcha. Okay. At some point, you can add mass storage, like DEC's primary thing was a tape drive

Starting point is 00:36:00 that had a one-inch wide or three-quarter-inch wide tape called DEC tape, D-E-e-c-t-a-p-e and it was a randomly accessed tape that was pre-formatted into blocks and so there was an operating system at the beginning of the tape and you could have multiple files and the basic compiler and the Fortran compiler lived on tape. And so all of that paper tape stuff went away, right, if you could afford to buy that operating system and have the mag tape drives. Right.

Starting point is 00:36:38 So a lot of tape spinning around, right? Right. These are on little hubs that are only about four and a half five inches in diameter right okay and the tape just goes off a reel over the head and down onto the other reel right so anyway um i managed to talk the researcher that I was doing work for into finding the money to let me design a floppy disk drive system and so this was my first big hardware design and that was to design a board to plug into the PDP-8 that could talk to a disk drive and the disk drive was from Memorex. It used 8-inch floppies and could hold about a quarter of a megabyte.

Starting point is 00:37:33 And what was the interface that you'd be plugging into on the PDP-8? Okay, so the PDP-8 has a backplane called Omibus and actually probably in terms of in terms of um a sea change in how computers were built i believe the pdp8 was the first bus based computer okay where where any slot in the back plane could take any card. Everything prior to that, every place that you plugged a card in, there was only one card that went into that position, right? But the PDP-8 used these boards that are, I don't know, about 10 inches wide and 8 inches high,

Starting point is 00:38:27 and there'd be a single board was the clock another one was the registers another one was the alu three of them together was 4k of uh memory the middle board was the um was the actual core plane and then one of the boards was the read write amplifiers and the other one was the bus interface gotcha but but the bus interface basically ticked away at 1.2 or 1.4 microseconds oh the the model i'm talking about is the pdp8e okay right and it it ticks away at about 1.2 or 1.4 microseconds per instruction. Typical instructions took three clock cycles, so around four or five microseconds per instruction, so 200 operations per second, 200K operations a second. And so, you know, it's 12 bits of address, 12 bits of data, and a bunch of timing signals that tell you whether you're doing reads or writes, etc. And so not only did I have to design the board, I had to write the device driver, and then I had to read in, I had to write the device driver and then i had to read in i had to get the operating system onto the floppy for the very first time and the operating was system was distributed

Starting point is 00:39:54 as about 20 paper tapes so the actual so now instead of using the Fortran compiler, I'm now using the assembler, right? And the assembler had a similar multiple passes to do an assembly and a linking process to end up with basically the binary version of my device driver, right? right and so the this the operating system was called os8 and there are characteristics of os8 that you will still see today in the dos boxes on a windows 11 piece pc right the the syntax on that command line it all goes back to os8 so everything in OS 8, when you get to the PDP-11s, it was something called RT-11. When you get into PCs, it was called DOS, MS-DOS, right? They all use the same basic command line syntax. And, you know, that still exists today in the dos box or the command box that you see on your latest windows computers i mean there's differences but it's basically looks very very similar right right so you're working on this disk drive right and uh yeah go ahead and then and then i had to load the operating system and deck built

Starting point is 00:41:27 the operating system to be distributed as paper tapes knowing that some people would have their own device drivers rather than decks and so there was a a build phase for the os8 operating system where you aren't actually running the operating system you're running a program called build and you're telling it which devices you want right and which services you want OS 8 to provide and you get to build a custom image of the operating system and the very last stage is you discover if the device drive you wrote actually works, right, which is when it now has to write it to the media. Right.

Starting point is 00:42:09 And if that fails, of course, you know, it's back to the assembler. Right. To figure out what went wrong. So, you know, but all these machines all had front panels with toggle switches and LEDs, right? And, you know, you might want to insert an image. There's lots of PDP-8 images available on the web that you could insert a pop-up of, right? So, you know, basically you can single step through a program, examine any memory location,

Starting point is 00:42:43 change the value of any memory location, all from the front panel. So, you know, you can certainly just load the device driver, right, and then create a little five-line program that calls it and tells it to transfer 128 bytes from this place in memory to that place on disk, right? Right. that place on disk right right so so anyway this is me aged 14 writing operating system device drivers and then building operating systems right right and so this was this was now my path up

Starting point is 00:43:19 sort of through the really getting into the lower levels, not just of writing assembler code, but of understanding operating system principles. And again, there weren't any other kids around that were doing the same stuff as what I was doing. Right. I can imagine it was fairly unique. Yeah. So, again, a great education i would say the manuals that digital equipment made available for the pdp8 and then later for the pdp11 those they had some introductory programming books that started at the very low levels of Assembler that were absolutely excellent.

Starting point is 00:44:08 For someone who had no detailed knowledge or university training, their books, they're all available on BitSavers. right um you know i mean it's kind of historical and you know not really relevant but if you if you want to have a read there's a book that came out in 1969 called introduction to programming by digital equipment for the pdpa and it takes you through assembler and explains they didn't call them pointers back then but indirect addresses and memory references and a superb coverage of two complement arithmetic and you know just understanding at the very lowest level you mentioned about sort of levels of abstraction this has been very much something that i've always um thought a lot about and from back in those days because already i could see it and yet the the model that a lot of people It's layers upon layers. And I've strongly held that the quality of the work that you do at a given level of abstraction is improved by understanding the level below

Starting point is 00:45:40 that you're not using, and maybe the level above that you're providing a service to right right yeah so i heard a uh a quote pretty recently that you know when programming a lot of times we try to be very respectful of the abstraction boundaries on either side right because that's what allows things to interface well together but But from the programmer perspective, right, it's our job to actually venture across those boundaries to understand them, to understand what happens when we interface with them. So I think that's a key thing. Absolutely. In fact, really, there are two things that come from understanding the level below that you're working. So if's say you're writing a C program or a BASIC program.

Starting point is 00:46:30 Well, actually, C and BASIC are different enough, right? Because C gets compiled to Assembler. Right. BASIC is, at least traditionally, is interpreted. So there's another layer for BASIC that doesn't exist. Well, it's a different layer. For C, it's well it's a different layer for c it's the compiler for basic it's the interpreter but then eventually you're executing assembler to implement um the desired effect right um the better you understand that layer below whether it's basic or c that is the assembler level, the better you write your code, right?

Starting point is 00:47:07 When you're writing C code, if you know how it's likely to be compiled, right, there's a reason that you might write A equals B left shift one, as opposed to A equals b plus b right or a equals b times b i mean they all achieve the same thing but if if you don't have a lot of optimization going on the a equals b left shift one is by far the fastest right right unless the machine happens to not optimize the execution time for that right you know in my experience the uh uh that kind of like understanding uh let's say this is not always the quality right there's many things that impact performance at the physical level but the number of instructions right that you're you're going to be translated to that's kind of like the first level of understanding maybe how how you know, higher level code you write is going to be running.

Starting point is 00:48:11 But then even beyond that, right, understanding, and it varies in complexity across different instruction sets, but what's the performance impact of executing a given instruction, right? And that goes into how the processor is designed and that sort of thing. And it seems like you ventured down, kind of, you know, you kept going into that level of understanding that. What was kind of like the impetus? Was it just continuing to peel back the layers of the onion? Absolutely. Well, I mean, designing a disk controller with having never done a tough design, and back then, there weren't disk controller with having never done a you know a tough design and back then there weren't disk controller chips right it was you know probably 60 or 70 ttl chips right um right

Starting point is 00:48:54 connecting all right so i had to learn bus timing and you know how the machine executed instructions and oh that was the other thing um deck provided full service manuals for these machines and detailed schematics and theory of operation and so it was expected that some percentage of customers would do all the maintenance of their computers uh down to the chip level. Wow. Right? And so whereas IBM kind of did the exact opposite, right? You couldn't buy an IBM, you had to lease it, and they sent out service people, right, who maintained them. And that was true for most of the mainframe companies for DEC and then the other similar companies, Data General and Naked Mini and a few others, they all provided enough service information that a

Starting point is 00:49:55 competent EE could actually maintain these machines. So I actually did repair work on the PDP-8 as well. And so that meant I was getting into the partitioning of the computer into different sections and actually like, hey, it doesn't add correctly anymore. And the way you would nail it as that specific thing is DEC made diagnostic programs on paper tape right and you would load them into memory and if memory worked the program would then

Starting point is 00:50:33 run and maybe indicate you know hey you know i'm trying to do this operation it doesn't work and then what you could do is you could write little five or ten line programs and button them in from the front panel after you hand assemble them and then exercise a specific instruction say okay i've got this operand here and this one here and it's doing this instruction and the result is wrong right right and you stick that in the loop and then you go in with an oscilloscope and you trace out the circuit while it's cycling that one faulty instruction right right or that one failing memory address or whatever right right so yeah so anyway um we are really miles away from what you wanted to interview me oh no i think i think you've actually uh uh intuitively

Starting point is 00:51:25 gone the direction of uh that i'd like to but yeah no let's keep let's keep going through your your journey fine okay fine well do you want me to keep it at this level of detail and pace sure that that's great i'll go as long as you'll as you'll stay on so well we may be here for a few hours before we get to lots yeah well we'll have to do a part two if that's the case well actually you can just carve you know if we go too long make it into two episodes or whatever right okay so so i did basically um so this was most this started off as a summer job but i ended up doing again, again, after-school work for this researcher. And so for about two years, I was doing programming in the evening and building a disk drive controller and, you know, learning about the internals of PDP-8s.

Starting point is 00:52:21 And that continued up until around 1976, so two years later, when that researcher changed from one hospital to another. And I followed him and he said, what are we going to get? Are we going to get a PDP8 or do you want to try one of these newfangled PDP-11s? And the PDP-8 was what I knew. And I said, well, I really know this one, but if you don't mind me taking time coming up to speed, let's go for a PDP-11. Now, the PDP-11s had been around all through this time, but we couldn't, well, the doctor couldn't afford them. But now we had more research money, and so we got probably one of the first PDP-1134s. So there were four or five models before this, and I won't go through the models, but they basically all execute almost the same instruction set.

Starting point is 00:53:27 That is, there were different ways that you would never normally write, right? But they would execute it differently. And so there were a set of different little tests you could do to figure out what machine you were running on. Ah, I see. Right. Okay. So I think the earliest machines were just strict pile of gates type decode the instruction and run it type stuff by the time you get to the pdp 1134 it was a microcoded machine okay okay so microcoded machines and this is actually worth doing because this actually moves into a lot of other stuff in that frame, so from the late 60s onwards, there was a chip made

Starting point is 00:54:48 initially by Fairchild called the 74181. And it is a 4-bit ALU with an internal 4-bit carry chain and carry in and carry out that let you put multiple of these 741-81s side by side to build arbitrary width arithmetic paths. And so for a PDP-8, there would be three of those chips side by side. For a PDP-11, there'd be four of them. And when eventually you get to the vax 11 780 which was deck's first 32-bit computer there's eight of them right and so that that's an alu

Starting point is 00:55:37 only chip um but it also introduced um special so basically the 7400 series is mostly just building block chips, but there were a few very function specific stuff. And so one of the things the 74181 ALU chip did was it didn't just generate carry, carry in and carry out. It also had additional pins called generate and propagate. Okay. And generate and propagate are a fast analysis of the operands and the carry logic that bypasses the ripple process of rippling carry from one bit to the next right it's a it doesn't create an answer but it in broadside so very very quickly can say if the four-bit operation that that chip is doing if there's a carry-in it always will generate a carry out ah okay so sorry regardless of the carry in right it'll always generate a carry out so that's a generate right so the generate pin of a 181 says i know

Starting point is 00:56:55 i'm going to generate a carry no matter what the carry-in is right the propagate signal coming out of a 74181 says i will propagate a carry if there's a carry in gotcha so that means outside of the 181 you can look at the signal you're providing to the carry-in pin and the propagate signal and the generate signal that are calculated independent of the actual calculating the add or the subtract right and you can pre you can pre-calculate or concurrently calculate the carry into the next four bits right right so so that lets you get a very fast um carry in across four bits. Right. If you're putting four of those chips together, Fairchild designed a part called the 74182, and it had generate and propagate inputs that would come from all four of those chips

Starting point is 00:58:01 and look at the carry-in at the very beginning of a 16-bit thing and it would generate a generate and propagate across all 16 bits okay right very fast right and so you built so if you're depending on the size of the process you're building you built a tree of these 182 chips right and one so the 181s are doing the arithmetic having to deal with actual carry propagates going carry ripples through the uh arithmetic arithmetic path right but you've got these generate propagate signals that are predicting right ahead of stuff right so the domain of adding two numbers in two's complement or one's complement for that matter has been an area of research for decades right and so if you go look at how do you add two numbers together right there's the simplest stuff at one end which is you add the first two

Starting point is 00:59:03 bits and you look at the is it a carry the one problem right right right and then you do the next two bits and then the next two bits and the next you know two well one bit two operands right right and you keep doing that right right then you have things like the generate propagate tree but that isn't the fastest way to go. There are other things. There's things called carry select adders, where you calculate both answers concurrently, one with carry in and one without carry in,

Starting point is 00:59:36 and then you use the carry in signal to drive a MUX to select between the two answers. Right. And you still use those generate propagate 182s to control or carry select data. Right. It has the overhead of muxes. It has the bonus that it's faster.

Starting point is 00:59:55 Right. I love examples, even if they are, you know, not the ultimate optimization of something like the generate and propagate because they make it abundantly clear that a big part of performance improvements is understanding what dependencies do exist and what things are false dependencies, right? And if you can look at something that's a false dependency and say, oh, we can actually do this at the same time, then sometimes you get the incremental speed up. Sometimes you get, big performance changes. Right, yeah. So, in fact, as your adders get bigger, this generate propagate stuff actually improves performance

Starting point is 01:00:32 on an exponential curve, right? Because you end up, I mean, if you were doing a 64-bit ad, you're doing it in the same time that it takes a five-bit ad to to occur right right because most of most of the stuff that takes time is now it's the generate propagates because all the other stuff runs sort of concurrently right right um so we actually see some of this in DSP for why FPGAs dominate the very high end of the DSP marketplace. So I'm going to take a little side trip to expand on this thought. So one of the most common DSP functions that is insanely computationally heavy is digital filters and and but they're simple right they're simple you take a vector of coefficients and a vector of

Starting point is 01:01:37 data and you multiply each element of the of the data vector by the appropriate coefficient, right? And the set of coefficients will basically – and so it's actually that the term is sum of products. You may have heard that term. So the product is the multiplier of the coefficient by the operand repeated by n different elements. Right. vector instructions where you point at the coefficient array and the data array, which might have come in from an A to D converter, or it might be a scan line from a video image

Starting point is 01:02:33 or whatever. And then it, as fast as possible, fetches two operands at once, multiplies them together, and adds the result. So if you start with two 16-bit operands, you multiply them together, the worst case is a 32-bit result. If you add two of those together, the worst case thing is 33 bits, right? If both of those 32-bit numbers were near max, right, you'll end up with a 33-bit result. So you need one extra bit if

Starting point is 01:03:06 you're only adding two of them if you add four of them you need one more bit if you have eight of them you need one more bit 16 32 so depending how many elements you're adding together you need more bits in that accumulator right right so now we look at where does it make sense of how big should that accumulator be and that in fact will limit how big the array can be before you have to do something special right right so if you're going to do let's say add 4096 well multiply 4096 operands with 4096 coefficients, if that's what you want to do, then your accumulator needs 12 extra bits. So it will be 32 bits plus 12.

Starting point is 01:03:56 And if you're doing floating points as a way to avoid doing that, needing extra bits then you have to consider whether you're really doing the arithmetic you think you're doing because you're going to start throwing away all your low order bits because whether you add them or multiply them or whatever they're going to get pushed out by all the other arithmetic that you're doing. Right. Right? Right.

Starting point is 01:04:28 So there's a balance of when does... Well, first of all, floating point typically is massively slower than integer. Mm-hmm. Right? Typically. Except for some chips where they just throw an insane amount of hardware at it

Starting point is 01:04:43 and they bring floating point not up to the performance of pure integer but maybe a factor of three to five times slower right it depends on the processor and how much silicon you want to throw at it okay so that's how dsp is done in a dsp processor for a common task, which is what I've described to you has a name. It's called an FIR filter, right? And it's finite impulse response. And I could go into what that means, but it's irrelevant to this discussion.

Starting point is 01:05:19 What's important is that if I have a, let's say I have a data vector of 1,024 elements, right, then I've got to do 1,024 multiplies and 1,023 adds. Right. Right? And so that's going to take me, even if I overlap the adds over the multiplies, and the multiply accumulate section always the multiply, accumulate section always does that, right? You're up for approximately 1,024 system cycles.

Starting point is 01:05:53 Right. Plus, you've got to fetch 2,048 things out of memory. Right. Right. And so if you look at DSP chips, you will find that different regions of memory can be accessed concurrently. the on-chip memory, which might be a quarter of a megabyte, is broken up into four blocks, right, of 128K, and the processor can fetch four operands in one clock. Right, right.

Starting point is 01:06:35 Right? So that lets them, you know, have two pointers, both fetching one from the operand list, one from the coefficient list, pushing them through the multiplier, into the adder, into the accumulator, and then keep doing that. It still takes order N, right? It takes 1,024 cycles.

Starting point is 01:07:01 How would you ever improve that? And then along comes fpgas and fpgas say i've got 1024 operations i can do the first multiply and accumulate and the second multiply accumulate i can do them concurrently right and store the result if i can fetch enough operands right right and some of the dsp chips will do that too some of them have dual multiply accumulators right but there's a limit to how many they have right right in an fpga there are fpAs that have tens of thousands of multiplier accumulators in them. That's current technology. The early stuff, they were measured in tens to low hundreds. But even so, let's say you've got a chip.

Starting point is 01:07:57 Let's say you've got a chip that has 1,024 multipliers right you can fetch if you slide if you set things up right maybe you can fetch all 1024 coefficients all at once and you can take your data that was coming from an a to d converter have that going through a shift register that is word wide. Right. And when all 1,024 words are in exactly the right place, fire off all 1,024 multipliers all at once. Right.

Starting point is 01:08:38 And you get 1,024 multipliers in one cycle. And you combine the results of two of those multipliers so you actually don't care about the intermediate ads right you don't have to store them all you want is to know the results so you have 512 adders right that combine two different multipliers which on a dsp chip would have been two consecutive multipliers. On the FPGA, it's a pair of concurrent multipliers, and you get 512 additions, which all happen in one cycle. You follow those 512 adders by 128 adders, followed by 64, 32, whatever. You end up with about 11 stages deep.

Starting point is 01:09:32 Right. Right. And so in 11 clock cycles, plus the one clock to do all the multiplies, you have the result of 1,024 multiplies and 1,024 adds. You have it in 11 clocks. Right. But each of those stages was only used for one eleventh of that time. Right. So take that data that was in that shift register

Starting point is 01:10:00 and shift it one position and now do another 1 another 1024 multiplies. Right. And have those results marching behind down through that adder tree. You've got a pipeline situation going. Now you've got something that took 1024 clocks to fill the pipeline. So that's your startup latency. Right. right to fill the pipeline so that's your startup latency right but after you've paid that latency of 1024 clocks which is a millisecond right you're now delivering a result every single clock cycle

Starting point is 01:10:37 yep so now so now so here's how an fir filter works You set up the coefficients to be a pattern that you're looking for, right? And you set up your data is streaming, and it's somewhere in this data is the pattern I want to find. And now in a single clock cycle, whereas a DSP chip, it will take 1,024 clocks before you can shift the shift register once. In the FPGA, you shift at every clock. So the FPGA is running 1,024 times faster than a DSP chip. And they both had the same startup latency, right?

Starting point is 01:11:21 Right. They both couldn't start until you had your first 1024 data items but the dsp chip won't after it shifted everything one position it's another 1024 cycles until you get an answer right the fpga it's one clock because they were coming down through the pipeline of adders so could the the dsp chip could in theory also have 1024 multiply and then the add tree as well right it's just that most of the time they they do not and there's no you couldn't control expanding it or contracting it as you needed right you you couldn't make it application specific and it wouldn't be able to do anything else. Right, right. Right?

Starting point is 01:12:05 And it would be, instead of being the relatively cheap price that a DSP chip is, which might be in the $10 to $12, it might end up being like these high-end FPGAs, which are $2,000 to $3,000. But for the FPGA, if what you really needed was that 1024

Starting point is 01:12:28 position fir filter running at 100 megahertz delivering a new result every 10 nanoseconds right so every 10 nanoseconds it does 1024 multiplies and 1023 ads and delivers a result right that fpga might cost you a few thousand dollars right right maybe i don't know what the current prices are maybe maybe it's a few hundred dollars it doesn't matter right the point is no amount of banging on the side of your DSP chip will hit that performance target. Right. Right. And the cool thing for the FPGA is if you only need to do that until the radar says, I've now found the target, now switch from search mode to tracking mode right that's load a different bitstream

Starting point is 01:13:27 and now track the target that you found with that FIR filter right or now go into an image recognition algorithm or you know what whatever else right by the way that there is of course a two-dimensional version so what I've described is a single dimensional FIR right there a two-dimensional version. So what I've described is a single-dimensional FIR. There are two-dimensional FIRs which get applied to images and do things like edge sharpening and blurring or feature recognition, et cetera. The FIR filter is a workhorse of the DSP domain,

Starting point is 01:14:05 which covers image, audio, radar, and almost everywhere where you're dealing with real-world signals, other than on and off, there are DSP algorithms often involved. Right, right. So, and, you know, I mean, an FIR filter can also, you know, boost the base and trim the treble or whatever else.

Starting point is 01:14:35 I mean, it's a general purpose algorithm and it's functionally easy to implement, but it is computationally heavy. Right. And again, it's the example of looking at a DSP chip, which is inherently sequential, and the FPGA, which offers but doesn't enforce a parallel solution.

Starting point is 01:15:01 Right. If you're working only with DSP chips, that parallel tree-based solution just never occurs to you because it's not part of the architecture for the fpga it takes a while to reshape into thinking in terms of the extreme parallelism that fpgas are capable of, right? But that's the reason that, you know, high-end communication systems, satellite communications, wireless, G3, next-generation TV distribution,

Starting point is 01:15:38 FPGAs are used in all of these systems because they can implement these highly parallel solutions and you don't have to commit to a custom chip, which when they go and change the standard will now all be unusable. Right, right. Absolutely. I definitely heard that. For the FPGA, yeah.

Starting point is 01:16:01 Yeah, when the algorithm changes or the standard changes, the FPGA says, well, okay, send me a new bitstream. Right, exactly. Right. So, an interesting view of understanding your, is really important. You know, when you want to do optimization, I generally say look at your algorithms first before you start looking at boosting the clock speed or throwing more chips at it or putting a water cooler on it and overclocking it. Right. Right.

Starting point is 01:16:42 Before you do those things, go look at your algorithm and see if there's a better way right um in a very well i won't say cynical but and i'm it's cynical i've i've been of the general impression that when someone uses floating point for an algorithm, they don't understand their algorithm, right? Floating point is the crutch for not knowing what's going on in your data, right? If you know what you're, if you really understand what your data looks like and how you're manipulating it at each stage, for most applications, not all, but for most applications, there is a pure integer solution, or at worst, an integer solution

Starting point is 01:17:34 with an invisible binary point somewhere. Right. Breaking it into the integer part and the fractional part of a binary number. Right. And I hope you noticed I was very explicit that it's not the decimal point. Right, right. I get angry when people refer to that as a decimal point when they're looking at binary

Starting point is 01:18:03 or octal or hex data. Right. as a decimal point when they're looking at binary or octal or hex data. The term that, and by the way, all of this, this comes from when I worked at AMD where I was designing these types of chips. That is the bit slice products and the RISC CPUs. Right. I resolved that by giving it a, My generic name is the radix point. Okay, yep. So rather than the decimal point,

Starting point is 01:18:28 for whatever radix you're working, whether it's decimal, hex, octal, or binary, that dot is the radix point. Right, right. And then I get less frustrated and angry. Right. Well, to encourage you to continue listening to the podcast after after this episode i'll try to uh ensure that that other guests as well adhere to to uh to that

Starting point is 01:18:51 practice but you you mentioned amd right and and this is kind of i guess fast forwarding a few years from uh where where we last left you but i think you joined amd in. Is that right? That's correct. So that's five. We've jumped over about six or seven years there. Okay. And that's fine. We could do it later or not do it. It doesn't matter. You did ask on the notes that you gave me

Starting point is 01:19:20 how much planning was there. And so let me at least talk a little bit about that because... Yeah, please do. Yeah, one of the things I guess I would say permeates my approach to stuff is planning, right? I mean, I ended up in a department as a manager of product planning at AMD and then later at Xilinx. But my planning stuff dated back six years before I ended up at AMD. So I'm going to jump us back to that 7.4.181. Okay.

Starting point is 01:19:58 I'm going to jump over everything else that's relevant, but I'm going to come back to the 1.81. That approach to building the ALU as a slice of four bits of what is a larger data path, that ended up getting a name called bit slice. And its origin was, and I don't remember who had it first, but it was either Intel had a two-bit slice alu or monolithic memories had a four-bit slice alu right and these were more than the 74181 they added the register file

Starting point is 01:20:36 so rather than having the the the two operands that feed into the ALU being in external chips, the bit slice processors, which were basically a 181 with a register file, right? There was a two-bit version called the 3002, I think, from Intel. And then there was a 6701, I think, was the part number from monolithic memories. Okay. And it was a 4-bit slice. And the ALUs in both of them were not as good as the 74181.

Starting point is 01:21:14 Okay. Right. But they did do add, subtract, and, or, XOR, shift left and shift right by one bit. And actually, the shift left and shift right by one bit. And actually, the shift left and shift right meant you needed additional communication bits. So if you're only doing add and subtract, you just need a carry borrow line. But for shift in the shift left,

Starting point is 01:21:40 you can use the carry chain. But for shift right, you need something that's pointing the other way right right okay so you know that added an additional um across the alu type control signal um so mmi wasn't particularly successful with with their bit slice ALUs. I haven't ever heard of a system that was built with them. But one of the managers at MMI moved, and this would have been probably in the mid-70s, moved from MMI to AMD. And at AMD, the 2901 was born. And the 2901 was far and away the most successful bit slice processor. And it was a 4-bit ALU that was more functional than the 6701.

Starting point is 01:22:38 It had a better register file. It had generate and propagate lines that I don't think the 6701 had so you could actually use these bit slice chips from amd with the fairchild 182 chip right and build and take advantage of that generate propagate logic right um and so lots of computers, both mini computers and mainframe computers, were built with 2901s. So, for instance, the Data General answer to the DEC-VAC. So when DEC went from the PDP-11 to the VAX-11-780, which happened in 1978, I believe, right,

Starting point is 01:23:22 that was still using 141s. Data General had a competitive machine called the MV8000, and it used 2901s from AMD. And then you started seeing 2901s in mainframe computers from pretty much everyone except IBM, right? And many computers, so later models of PDP-11s, probably used 2901s. And there were a lot of other companies that looked at the success of Data General and DEC and were lesser players, but still building CPUs. And then in the early 80s through to the late 80s, there was a new range of computers built by a bunch of startups which were called mini supercomputers. So they weren't quite the Cray or CDC 6700 type real super-duper computers,

Starting point is 01:24:27 but they were machines that were bigger than the biggest mini computers, but they could handle workloads and compute that typically used RISC-type instruction sets and bit slice um alus right yeah so anyway um how did i get down on that path don't remember oh other than to say you know the the the style of building building stuff got more integrated um so anyway i was aware of all this stuff in the late 70s right and so by this time i'd already designed my first chip i hadn't mentioned that my i designed my first chip in 1972 when i was 12 years old okay um. At that time frame, Fairchild and National Semiconductor actually had factories in Australia, in Melbourne.

Starting point is 01:25:30 Only an hour's travel by public transport for a 12-year-old. And those, they were primarily, they were test and packaging. Okay. Right. So the chips came in as wafers already fully processed they were then tested diced and packaged right some going to export and some feeding into the minuscule australian electronics industry right so i went and visited one of just as i went off to honeywell's right timeshare building i went off to fairchild right and asked for a tour and so i got a tour by one of the sales

Starting point is 01:26:14 engineers through the manufacturing floor and it wasn't it wasn't a particularly big facility um you know they basically you know there was probably less than 100 people working there right but i got to see all that and you know i asked about you know so how do these you know chips get designed the guy says well some people in america in the product then this is fairchild right yeah that they they they look at the features that people want or the sales people tell them these are the features that customers are asking for and then they then at the features that people want or the sales people tell them these are the features that customers are asking for and then they then plan the products and then and this guy really didn't understand how that then turned into a product right so then you know stuff happens and the chips end up here and here the exciting stuff happens we We cut them up, we test them, and we package them, right?

Starting point is 01:27:05 Right. So he skipped over the whole, you know, design the chip part because he really didn't have any handle on that. He just talked about what was locally there. I said, so what if I had an idea for a chip because I keep wanting this function, but it isn't in the 7400 catalog? And he says, I have no idea yeah I

Starting point is 01:27:26 said well if I if I wrote a datasheet for the chip right would you send it to those people and he said sure and so I so I wrote a data sheet and laid out the pinout and the logic function and the timing table. So basically all the stuff, you know, and I typed it up on a typewriter. I didn't have a word processor back then, right? And I gave him this four or five sheet data sheet for a chip that would have something that i was using often which was an rs latch which was a pair of cross-coupled nand gates which is used among other things to debounce switches right right so that you get nice clean transitions i said you've got all these pins. You could have four of these in one chip rather than me going through all of these 7400 NAND chips, right? You could have four of these circuits in one package.

Starting point is 01:28:35 That chip is still built today, and it's called the 74LS279. So that was my first product planning experience did you get any uh attribution in in this process at all never never but but but the two but i claim and there's no one there's no one still alive to deny it um i claim i designed the 74 ls 279 well that is uh probably not something many 12 year olds can say yeah well and and all all believe at this point right it doesn't matter the point was it got me into understanding right right the process of you need a data sheet you need to have thought through the functionality you need to make sure it doesn't need more pins than the package has right right etc etc okay so late 70s i'm now programming pdp 11s i've got a real job actually working in fortran in a company we're going to skip over

Starting point is 01:29:47 all of that unless you want to go back at it later and i'm looking at what are people using for building processes and the answer is they're using amd 2901s right and so i start learning about bit slice right i mean i already had a handle on it from using the 181s but now i'm looking at what replaced them right and i'm looking at how it changed i said damn it i want to end up somewhere near that so the only company i want to work for is AMD right right so that's so I want to work for AMD I figured there was no way I'd get to work on bit slice directly so my goal was how do I at least get to make coffee for those guys right right that that that was my that was my plan get to make coffee for the guys who worked in the bit slice department so I could talk with them.

Starting point is 01:30:48 Right. From Australia, the closest I could get to AMD in Australia was AMD had a distributor. So AMD didn't have any physical presence in Australia at all, but there was a distributor called R&D Electronics, and they distributed AMD's parts. They also handled InterCell, Monolithic Memories, Zilog, and a bunch of other companies. So in Australia, the common thing is, since none of the silicon companies had a presence in the country, because Fairchild and Nat Semi's packaging division had decamped years ago due to some political silliness, right? So Australia had no silicon

Starting point is 01:31:32 industry whatsoever, but there was a bunch of distributor companies. These days, you'd talk about Digi-Key and Mouser, et cetera, right? So these are much smaller versions of that for the much smaller Australian market, and they represented multiple product lines, typically trying to avoid conflicts of interest, right? So if you look at the product line from AMD, MonoMemories, InterCell, Varo, and Zilog, there's almost no overlap between them, right?

Starting point is 01:32:11 And so they were presented to customers. If a customer says, I want this, we would look through our catalog of parts, right? And say, well, we have a solution for you and it uses these MMI parts or whatever. So anyway, in Australia, all of these distributor-type companies were just basically sales and shipping organizations and a little bit of marketing. The equivalent in the US, though, they typically had something called an applications engineer. In fact, if you look at a company like AMD,

Starting point is 01:32:49 it actually has two categories of application engineers, internal and external. So the internal application engineers answer the telephones, solve problems, write the data sheets, probably write the application notes, and they take the calls from the other application engineers who are out in the field, who are called field application engineers. Right.

Starting point is 01:33:17 Big surprise. So a field application engineer who has a problem with a specific customer would work with the customer to refine the problem description, right, whatever that might be, and then approach the applications department inside the company to have someone take ownership of it and either resolve it using the internal resources, which might include going off and talking to the chip designer. Right. Right. Depends what was, you know, what level of skill is needed to resolve the issue. So, in the US, the distributors all had application engineers, right, who were external. Not all of them worked for AMD, right, Or Intercel or whatever, right? They actually worked for the distributor. So you had two categories of external FAEs. You have field application engineers who are distributor FAEs, second-class citizens, and then you have first-class citizens who worked in AMD's sales offices spread across the country. Okay.

Starting point is 01:34:25 Right? Who primarily only ever talk to the really big customers. Right. Right? For all the little customers, they sent them off to the distributors. Right. Right? But once a year, AMD held their international sales conference and FAE conference. And they upgraded the distributor FAEs, and they were invited as well. And so they had a conference that would run for a week,

Starting point is 01:34:56 right? And they had all the distributor FAEs, all the company external FAEs, and all the internal application engineers, all in a huge conference facility, in AMD's case, usually in Hawaii. And then factory people from the engineering department would make presentations about what is in the pipeline and what to look out for in terms of opportunities, plus marketing and salespeople would say, here's the products we already have, right? Here's how you go about, you know, here's the types of customers you look for, here's the products you offer them, right? So, it's a sales and engineering conference. AMD had them once a year. That included external FAEs that work for distributors. In Australia, none of the

Starting point is 01:35:54 distributors had any FAEs. So I approached Australia's AMD distributor, a company called R&D Electronics, and said, have you considered having an application engineer across all your products? And the owner of the company said, we've been thinking about it, but we didn't know who we'd go get. Right. And I put my hand up and said, I'll do it. I would love to be Australia's first field application engineer.

Starting point is 01:36:30 Right. And so that happened at the end of 1979. I joined as an FAE. And while my primary goal was eventually to make coffee for the bit slice people in California. The first thing was to become competent as an application engineer. And I did it across all of that company's product lines that it represented. So I didn't just become an AMD FAE, I became an MMI FAE, and I became an InterCell FAE, and several other companies. The MMI FAE position turned out to be, in the short term, the most useful, because I got to see PALS before they got released. And so I was basically training people how to use PALS when they first became available from MMI.

Starting point is 01:37:34 Right. Which was a wonderful experience. Wonderful. But I also became an FAE for AMD, and so within my first year of joining this company, they sent me to America. Awesome. Right.

Starting point is 01:37:52 And I got to attend the application, the sales and application conference. And I got to meet, it wasn't my expectation, but I actually got to meet some of the people who were the actual chip designers in the bit slice group right and and had interesting discussions so i stayed with that company for about three years attending conferences on an annual basis and building up my contacts at AMD and at MMI and at Intercel and various other companies. And I had a load of fun in Australia, talking to lots of people, lots of different engineering teams, all with different problems. So I got to see a very broad range of electronic design and different levels of

Starting point is 01:38:48 competence of teams and whatever um eventually i left that company and i because i wanted to actually do some engineering right and i joined a pair of small companies one after the other both of which had serious management problems okay and didn't last very long uh any Australian companies yeah yeah these are Australian companies with under 20 employees doing primarily s100 based stuff, both designing boards and building systems and selling accounting software and business software and that sort of stuff. That didn't last more than a year. So it was basically two companies, six months each,

Starting point is 01:39:39 and it was like, this is not for me. Then all through that period, I'd also been a lecturer at one of the local universities teaching final year system design. So fourth year EE degree students, I was teaching system design and balancing trade-offs and writing low-level code and a little bit of dsp and a little bit of making ethernet work etc etc so after the two small companies that really didn't suit me, I asked the university if I could become a full-time

Starting point is 01:40:26 lecturer. And so I hadn't lost sight of wanting to work in BitSlice, but I didn't feel I was quite ready to make the jump. And I didn't actually have any opportunity for that. So I I took on a role in Australia. The job title was senior lecturer. The US equivalent would be associate professor. That is responsible for course creation and syllabus and examination and evaluation, et cetera. So, it was reasonably autonomous, but, you know, not professor level, you know, high status. Right, right. Fair enough. Right, right. But, you know, I basically created a new, you know, digital course that, you know, was fun to teach. Unfortunately, when I moved from being a part-time lecturer where I was just teaching two subjects a week, so all the time that I was being an application engineer, I was also

Starting point is 01:41:33 taking off half a day per week to teach at that university. So they already knew me. And that company that I was working for was incredibly nice to let me have half a day off a week to teach at the university. They figured I'm training their next generation of customers. Right. Because guess what? The data books that I handed out for them to do their assignments, the data books were AMD and MMI and Intercel. Right. Right. books were amd and mmi and intercell right right so so it was you know there was clearly um you

Starting point is 01:42:09 know some linkage there anyway um the politics when i became a full-time member of staff was unbearable the the the the trivial politics between the various academics just drove me nuts. to someone in the applications department, right, to see if I could become an AMD internal applications engineer. So that was my path. I figured with my credentials, having been a field application engineer in Australia and having multiple years of them seeing me, that it would be pretty easy. Unfortunately, it took three or four resumes being

Starting point is 01:43:07 sent because the guy kept throwing him in the trash um which was very sad but eventually instead of trashing it he gave my resume to the head of the bitSlice group. Oh, wow. The product planning group for BitSlice. Okay. And so he called me, so a long-distance phone call. He called me and interviewed me for about an hour. Okay. And I told him about how I had, with a bunch of friends of mine, so we've skipped over all of this, but starting in 1980 when I became an employee at this distributor company, I also bought parts at a discount. And I bought a pile of bit slice processor chips and all the stuff that goes around it.

Starting point is 01:44:01 And a bunch of friends and I had actually built a 32-bit mainframe. Oh, wow. Right? So we actually had some very interesting stuff. It's not for this discussion, but as a do I know anything about bit slice, the answer was I have a 32-bit mainframe that my friends and I designed and built, and it's four times the performance of dex vax right that's that i'm sure the interview uh went went pretty well after that

Starting point is 01:44:31 yeah it did it really well and he said i'll we'll probably get back to you and later that day i got a call from h HR with a job offer. And unfortunately, I just started that semester. I said, so all of this is great. I didn't even care how much they were offering me. I wasn't going to be making coffee for this group. I would actually become part of that group as my entrance into AMD. So I just started the semester. I said, I'm going to have to delay by six months.

Starting point is 01:45:06 Otherwise, I'm going to be stranding, right, a whole class of students. Right. And they agreed to it. You know what? I half regret it, but I don't have any guilt of having stranded students, which if I'd instead jumped at it and stranded those students, I'd probably have felt bad about it for at least a while. But instead, I taught that semester and then packed up everything I owned and headed off to America.

Starting point is 01:45:40 And so by this time, I'm now late 20s, and I joined AMD not as a junior in that department, but one of two managers. So there was a guy in charge of that department, and then under him, he had slots for two managers, and then under those two managers were then the rest of the department. And so he brought me in as a, AMD's term was a section manager. And so I was a section manager in the programmable processes division. Okay. Right, responsible for, I thought, bit slice products.

Starting point is 01:46:22 But it turns out, nah-ah. It turns out the department had a secret project that was unannounced to the outside world, which was a RISC CPU. And so they had just finished a round of next-generation bit-sliced parts, and rather than doing another round of it they decided for whatever reason that risk made more sense as the next thing that their skill set right so that the rest of amd was busy copying intel instruction sets right right whereas this

Starting point is 01:47:03 group the bit slice group was designing instruction sets right and the risk cpu would be designing the instruction set as well right so that's where it belonged at least initially right um and so they brought me in because of my bit slice experience of actually having built stuff so bit slice in general microcoded processors is not risk at all it's cisc right it's you're building interpreters that are below that implement the assembler language right so would would the the bit slice components would they fit into the uh the cC processors as like a target essentially for the microcode inside of the larger system?

Starting point is 01:47:49 Very much. Okay. Yeah, right. So pretty much all of these bit-sliced parts were implementing CISC CPUs. Right? So the bit-sliced parts are pretty much inherently not driven by assembler.

Starting point is 01:48:03 They're driven by micro- assembler code or microcode right and so you have you know an instruction that might be 64 bits wide or lot or wider that's controlling all these different microcoded chips concurrently in fact the machine that we built in Australia had a microcode instruction that was 128 bits wide. Right. And so it was controlling a lot of stuff concurrently, which is part of what made it fast. Right. Right. But it was still cycling around an interpreter loop, interpreting an upper level assembler.

Starting point is 01:48:42 Right. interpreting an upper level assembler right um so so anyway i was coming in to work in the risk group but i'm the guy that doesn't believe in risk right i'm i have immersed myself for at least the last five or six years on cisc type stuff and so and this risk group was built from the bit slice group which was also of cisc architecture origin right so the manager when he brought me in said i want you to spend a few weeks going through all of their documents and analyzing it and tell me if we're heading in the wrong direction. Right? Because I'm coming in as a very solid CISC-type viewpoint and experience of actually building CISC CPUs.

Starting point is 01:49:43 Is this RIS risk thing really better right so i worked on it for two weeks went through all the documentation they had all their estimates of timing that the instruction flows whatever else and at the end i said i said to the boss this is insane right it's crazy to keep me doing this task put me on the risks team yeah this is the future wow right and so basically i became one of the co-architects i mean in total there were eight engineers who architected the 29 000 right everybody contributed. So I wasn't like a lead architect or anything. I was one of a team. Although I had three other engineers under me,

Starting point is 01:50:33 which by the way, I'd never had reports. And I'm not a great people person. You could have called me. Well, I'm not sure I did my best, but I'm much better at pushing gates than pushing people. Fair enough. So, I did my best and I really enjoyed working. We really worked more as colleagues than as manager and underlings right so you know there was like for instance we uh myself and one of my underlings wrote the cross assembler for the 29 000 right um

Starting point is 01:51:18 and you know i worked with another person on various other tasks, right? So there was a lot of real work that, as a manager, I still got my hands totally dirty along with everybody else in the team. Right. In that two-week window, though, that you were talking about where you kind of became convinced, if you will, I've read quite a bit about the kind of like risk versus academic debates more in that time period. What was it that you saw in those two weeks that made you a convert, if you will?

Starting point is 01:51:57 So one of the things that I carried from my Australian project is when I convinced a bunch of friends to help me build this machine, which was all done as wire wrap, was I said, I think we need some guiding ideas that are forward-looking, right? Rather than working with what we have, let's predict where something is going and design for that target, right?

Starting point is 01:52:26 And I said, among the things that we've seen is that chips get more and more complex, but the other thing is that the cost of memory keeps dropping and the size of memory keeps going up. Let's use as a guiding principle for this CISC machine that we were building, let's use as a guiding principle that memory is free. So wherever we're making decisions

Starting point is 01:52:49 where memory might be involved, don't treat it as a scarce resource. Treat it as something that's free. So the fact that the microcode memory was 128 bits wide, not an issue. The fact that it was a quarter of a mega word deep when most companies were looking at like a killer word or two killer words we were doing 256k deep microcode memory totally outside of any norm right but it let us do things right in in terms of the way we

Starting point is 01:53:28 cracked instructions in the interpretive loop we could just have separate routines for groups of instructions that are very very similar but would have an overhead to sort out the the minor differences we could just with that muchcode, we just have unique microcode for each one and avoid a branch and a compare, right, as part of the decode loop. Right. As an example. Late in the project, we virtualized it. Okay.

Starting point is 01:54:01 So, we had demand demand paged microcode right because memory is free the register file the register file was 4096 registers right and the instructions that the microcode engine could execute could fetch any two registers, perform an ALU op, and write back to any third register every clock. Right? Because memory is free. I mean, it wasn't. And so we didn't implement everything. Right.

Starting point is 01:54:39 But we used that as a guiding principle. So the memory is free i brought that with me when i started looking at the amd project and the number one thing that people pointed at for risk is because the instructions are so simple right you're going to have to have so many more of it and memory's expensive and i said no no it isn't memory's free right memory's free and so that's no longer an issue for risk right because in time right that memory that you're worried about, you know, how much it costs, it's not going to matter, right? The performance is going to matter, right? How fast can I fetch instructions? And with risk, there were two things. One was that memory is free thing got rid of one of the anti-risk sort of arguments. But the other one, I didn't see it directly.

Starting point is 01:55:49 It was an education that I got from one of the guys that was... Okay, so by the way, that team of architects on 29,000 at AMD, of the eight people, five of them were fresh out of college oh wow right so five of them were new grads who hadn't worked anywhere prior to amd and the rest of them there was me a guy from national semiconductor and a guy from ibm the ibm guy had worked on ibm's risk processor that eventually became power pc right so one of the guys who was fresh out of college educated me on compiler optimizations and he was very strong on how, because instead of having microcode where everything is set in concrete for how each instruction is executed, on a RISC CPU, you can take a step up a layer out on the onion to the next layer out. So you're not trying to execute single assembler instructions.

Starting point is 01:57:09 You're trying to do some algorithmic piece. And there are optimizations you can do. The best example is, let's say, instruction A has to store to a register and instruction B has to fetch from the same register right and they happen one after the other well in a risk cpu maybe that's a direct path and there is no you're not touching the register file at all there's an intermediate register that's otherwise invisible right but lets you over two clocks do something that would have taken three or four

Starting point is 01:57:46 clocks and blown away a register along the way right which you know could have been used you know for holding a useful coefficient or something so it was that for the risk cpus there were lots of opportunities for optimizations in the compiler that that would help you do a better job um with the limited resources that the risk cpu provides and so it really mitigated more of the it basically allowed finer grain optimizations and i i really liked that so it was the finer grain optimizations it was the argument that memory is expensive you know it i mean we we had benchmarks we had that same guy had written a gcc uh back end for the sketched out architecture. The architecture that AMD had for 39,000 was not yet set in concrete. The broad strokes were done, but the detailed instruction set hadn't been done. We knew what types of instructions there'd be.

Starting point is 01:58:57 We knew the approximate layout of the bit fields in the instructions. Some of that surprisingly is important in that you want to be able to take instruction bits and route them directly to where they're needed, like maybe controlling a MUX or directly going into an ALU to select which ALU function, rather than having to be decoded and then generate a control word, which is a different bit pattern. You want to avoid that. And so some of that is literally the artistry of figuring out the bits of an opcode. So the 29,000 is a 32-bit instruction,

Starting point is 01:59:41 eight bits of opcode, and three 8-bit fields. And sometimes two of them are joined to create a 16-bit field. But the eight-bit instruction field, certain of those bits are routed directly to the ALU. Certain bits go off other places. But suffice to say, there's a lot of stuff there where you avoid having to put a mux in a critical path because you just lay out the bit fields. So that was stuff I knew about. And in fact, I was one that did the detailed design, right? The instruction set had already been figured out by other people, but they hadn't yet figured out how does that get mapped to the op codes. Right.

Starting point is 02:00:21 And so one of the tasks, I said, it it's a tedious job but i'm willing to do it right and so i actually figured out the exact bit patterns for every instruction and optimize them so that it would work in fact even things like there's fetch two registers and right back to the third, do you really want those three fields to be in some arbitrary combination of three 8-bit fields? Or is there an optimal one? And the answer is, well, if I'm going to have 16-bit constants

Starting point is 02:00:58 and I want to write them to a register, then maybe the 16-bit constant should be the bottom 16 bits, and the right-back register should be the next eight bits, and the opcode should be the top eight bits, right? Right. As an example. Yeah, the simplest version of this, like, just considering where bits are in the instruction that I've experienced, at at least is, you know,

Starting point is 02:01:26 implementing a RISC-V CPU. RISC-V, one of the nice things about it is arguments are always in the same place. There's lots of different instruction formats, but if there's RS1 and RS2, they're always at the same place in the instruction if they're present. And once you actually try to implement the logic for a CPU, you understand how important that is. But if they're present. And once you actually try to implement the logic, right, for a CPU, you understand how important that is. But, you know, if you haven't done that before, then it's not obvious. If all you've ever done is programmed in Assembler, that's opaque to you.

Starting point is 02:01:57 And what you're describing, literally the 29000 does that too, right? So there are instructions that only have a destination register. There are instructions that have a destination and one operand and a short immediate, right, of eight bits. Right. And then there's instructions which fetch two registers and write back to a third. And because there is a place in the data path where you have to choose between a register being fetched and that short immediate or two registers being fetched and the long immediate right the 16 bit immediate

Starting point is 02:02:33 those mean there are muxes somewhere right right right that have to decide am i taking the bits from the instruction or am i taking the bits that the instruction is indexing into the register file and there's a register file access that is pipelined so I won't actually get that register for another clock right so that there's a bunch of you know juggling there right and it it it wasn't like something I'd knocked out in an afternoon, right? Right. I probably spent two or three weeks designing the bit patterns for the instruction set, right?

Starting point is 02:03:14 So anyway, there was just a bunch of things that I saw in what the 29,000 was capable of doing and what I knew it was competing against in CISC would always require multiple instructions and those those interpreted that interpretation process was blocking optimizations that were open to well-written compilers. Mm-hmm. Right. Right.

Starting point is 02:03:54 You know, it may surprise some of the listeners that it is, if you know what's going on behind the scenes risk one a hundred percent right and he said oh but but x86 dominates the market x86 on the package dominates the market inside both amd and intel's x86s are risk cpus right and they all have a an on-the-fly translation engine and caches that translate the x86s instructions into um risk operations. AMD calls them ROPS, R-O-P-S. And they then go into the ROPS cache, right? And the actual compute engine is executing from the ROPS cache, running risk instructions, right? Right.

Starting point is 02:05:01 So, you know, and some of those are doing optimizations on the fly as well although again there are opportunities in the compiler to help things along right okay um where would you like me to talk next well one of the other things about the uh i don't know if it was the 29 000 or uh another chip that you worked on um but we also briefly i think in the past talked about register windows um was that were those on the 29 000 yeah that was on the 29 000 so okay it's interesting. I've wondered, so the comparison is, so there were two processes that used a rolling window of registers that the execution stream is busy modifying. And one way, the term we used,

Starting point is 02:06:04 and you'll have to forgive me a little bit because we're talking 36, 37 years ago. It's been a little while since I worked on this. Someone commented to me that as you get older, you talk less about the great things that you'll do in the future and more about the great things you've done in the past. So I'm clearly in the older category. And some of it, the memory fades because it's back far enough. But so one of the things that was part of the 29,000s architecture well before – so, I mean, a lot of work had been done before I arrived, a lot of work.

Starting point is 02:06:49 And one of them was that there'd be this register window. And so basically what a register window is, is it says that all the registers live in memory, right? And there were processes before the 29 000 that did that the among them uh was the texas instruments 9900 which dates from the 70s except it really had all the registers sitting in memory right so kind of slow right right the register file in the 29 000 we referred to it as a stack cache so it's a and there is well that for the purpose of this discussion we'll say there's 128 registers there isn't there's actually 192 was originally 256, and we ran out of room.

Starting point is 02:07:47 So originally, because we had 8-bit fields, it was intended to index into the register file with 256 registers. Well, it turns out the chip design guys couldn't build the chip because there were too many registers. And so something had to go. And we ended up throwing away 64 registers.

Starting point is 02:08:15 And so we have a lower half of 128 registers and the upper half of 64, which used to be 128, right? So there's a negative optimization because we ran we ran out of silicon right and that then propagated to every member of the family afterwards okay so the 120 so let's talk about half of the register file that was still there right was 128 registers nominally the stack pointer is a pointer somewhere into those 128 registers and all the registers above that pointer position are parts of the stack that haven't been touched yet or are stale from previous execution and the stuff below the stack pointer is active stack frames that we we plan to look at as we unroll the stack okay so the register file 128 registers is like a barrel right you have a pointer says, here's where I am in the 128 registers.

Starting point is 02:09:28 There's a marker that says, this is the first of the valid registers. And so there up to the stack pointer is the top of stack with N registers. And then you have from there to, you wrap around to that same boundary register boundary pointer which are the registers that you could use because this cache is sitting there you just haven't dug deep enough in your sub

Starting point is 02:09:56 routines yet right okay so when you want to do a subroutine call, you compare the current stack pointer, bottom seven bits, to the boundary register, bottom seven bits, and you say, do I have enough room to allocate the registers that I need for this function, right? So the compiler optimization already knows I'm passing three parameters plus a return address plus a frame pointer or something. And they're all going to get pushed on the stack. And then we're going to jump off to the subroutine. So I need seven registers. Are seven registers available above the current stack pointer? Right. And you look at it and say, yes, there are. So you just update the frame register, you update the stack pointer,

Starting point is 02:10:51 you put your parameters into the appropriate places in the stack cache, and you go to your subroutine, and it can then do indirect or offset references to the new stack pointer to get its parameters off the stack. And if it wants to call another routine, it does the same calling convention. Right. Eventually, if you just keep digging down, right, deeper and deeper into a subroutine call tree, eventually there won't be enough registers.

Starting point is 02:11:24 Right. into a subroutine call tree, eventually there won't be enough registers. Right. The registers that are furthest away, which are right at that boundary register, but looking at it from going backwards, those are the ones you're not going to want to look at for a long time until you unwind the stack. So you flush them out to real memory. Right.

Starting point is 02:11:43 Okay. That's called a spill, right right and you just move the reference pointer right to however many registers you threw away and now magically you've got more stack right so the cost the cost of doing a spill which is when you're digging down through the subroutine tree, right, is the cost of writing the registers out and updating the pointer. is only as many as you need or spill everything got a good choice or maybe spill half right and i don't remember i i we certainly didn't spill only as many as you need and we certainly didn't spill everything i think the default was you just always spill half right so you spilled 64 registers out to memory and then you continued executing code now eventually you start executing return instructions and eventually the frame pointer is pointing

Starting point is 02:12:59 back before the reference pointer which says the registers you want aren't in the cache, right? Right. And so now you do a fill, right? So you're unwinding the stack call. So now you do a fill and the fill is bring in 64 registers. Right. Right. And so nominally, the stack pointer is always in the middle with 64 registers behind it and 64 registers after it. to figure out that whatever it was that the spill and fill quantities were, they basically meant that a very high percentage of subroutine calls triggered neither a spill or a fill. So you had something like 95%, 97% of subroutine calls didn't trigger a spill or a fill. So the efficiency was outstanding, right?

Starting point is 02:14:10 So you always had the top of stack, right, which had all your subroutine parameters or return values or whatever, are just always available in the fastest memory the chip has, right? Right. The register file. are just always available in the fastest memory the chip has, right? Right. The register file.

Starting point is 02:14:32 And so that's an optimization versus, like, some other register window strategies. But just to, like, compare to not using register windows, right? If you have registers you need to preserve across a subroutine call, you have to have a function prologue or prelude where you're going to push it onto the stack. And so one of the things you're saving is all of those operations where you're manually pushing and popping things to stack. x86 code right or arm code for that matter because arm has this has a similar problem you'll see particularly for things that might end up being interrupt service routines right is they end up having to flush all the registers right right out out to memory because you need a working set. On the 29,000, you say, my interrupt service routine needs X registers. Can I just move the pointer and give it X of its own registers? And when we're done, we'll return them all.

Starting point is 02:15:37 So absolute register numbers, they don't exist in 29,000 code. They're all, what is the offset from the top of stack? And the top of stack is always on chip, not out somewhere in main memory. Right? It's somewhere within those 128 registers. I did some analysis of, I basically heard about register windows from learning about Spark. And I was looking at x86 and ARM and now REST 5. And I was curious why none of these newer, if you will, right?

Starting point is 02:16:18 I mean, some of these trace back to before. But why they don't use register windows. And I did a bunch of reading. I ended up writing a blog post that kind of like went through some of the trade-offs but i know uh i had briefly mentioned to you seeing a comment about uh spark register windows were perhaps not implemented quite as efficiently do you think that contributed to register windows not being included in subsequent architectures? I don't think so. I think that the Spark decision – so Spark also has register windows that spill and fill.

Starting point is 02:17:10 The difference between it and 29,000 is the pool of registers is smaller and the allocation size is fixed. So when you enter a subroutine, the movement of the frame register is always by the same amount. Whereas for 29,000, it's optimized on a per function basis. So if you only need three registers, you only consume three registers. Whereas Spark always consumes 12 or something. I don't remember the number, but whatever it is, because of the call instruction that said, here's how many registers I need to allocate. Right. Right. So in the 39,000, part of that subroutine call is the frame increment. Mm-hmm. Right? So, and, you know, that was because of the way that those bit fields were laid out,

Starting point is 02:18:15 was, you know, we had an 8-bit immediate field that you weren't needing for the subroutine call. Right. Actually, it was was the how is it done it was actually not the it was not the immediate field it was the destination register field right the one that was always available and that left you with a 16-bit uh relative offset for how far away the function could be so you could be up to 64k words of offset right which was split 32k before and after the current pc value and if you needed more than that there was a different call instruction that said call indirect through a register right which would have a 32-bit absolute address or a 32-bit offset from the current program counter.

Starting point is 02:19:11 Right, makes sense. So, you know, I suspect it was just a function of how the bits were laid out. And, you know, there was also an overhead that it required an extra 8-bit adder in a somewhat critical register access path. Mm-hmm. Right?

Starting point is 02:19:33 So, you know, it wasn't without its own costs in the silicon implementation. Right. Yeah. Right. That makes sense. Yeah. From a, uh, uh, maybe as a, a kind of, uh, um, wrap up of, of where we've gone so far and we haven't even gotten to, uh, Xilinx and FPJs and everything. So I am absolutely going to have to have you back. But, uh, from your time at AMD,

Starting point is 02:19:59 um, maybe specifically on the 29,000 or other work that you did there. Was there any other kind of like lessons or takeaways that you had, both from the perspective of things we've been talking about of, you know, specific to processor architecture, but also just, you know, general organizational takeaways as well. Oh,

Starting point is 02:20:21 I have, I have a career worth of organizational takeaways. Be really careful picking your boss um and when you do make sure on their team and working to meet whatever their goals are um so that's that's really important and understand what the hot buttons are of of whoever is managing you and make sure you don't don't do anything to shoot yourself in the foot right i did that a few too many times um if you have the ability of getting a boss who really understands i guess i've had this attitude that um the function of a manager slash boss is to protect the people below from the crap above right so you have upper level people above your boss's level who are busy jerking chains and

Starting point is 02:21:36 changing schedules and you know doing things that would really upset what the team is trying to get done on what is probably a tight schedule. And so a good boss insulates the workers, the worker bees from the politics that I really shouldn't affect them. Right. And I've had bosses both who do that well and don't do it well at all right and it it it really affects you um i have a a fairly complex I guess architectural uh observation not mine this is when I joined AMD's so AMD's product planning uh and I know it's in your question list which is how different was AMD product planning from Xilinx and there's an Australian term which I think we should propagate to America it's like a chalk and cheese okay right they might look the same right sort of you know just a piece of chalk piece of cheese look take a bite you'll see there's a difference

Starting point is 02:22:57 right so anyway the Australian to I mean here it's like apples and oranges or you know something and pickup trucks or whatever but the Australian term here it's like apples and oranges or something in pickup trucks or whatever. But the Australian term is it's like chalk and cheese. It's just really different. So anyway, one of the things that AMD had was actually a whole product planning department. So the group I was part of, the processor products group, was one of several teams there was so our team was like eight or nine people total including its section manager i think product planning was like maybe 90 engineers oh wow and one of right because amd back in those days had a very broad range of proprietary products.

Starting point is 02:23:47 They sold all of that stuff off, but there was a PLD group that did PALS. There was a networking group that built Ethernet chips and communication chips and RS-485 stuff. There was a group that was doing video graphics chips these are all 1980s type stuff right so not not like pc type stuff right the pc wasn't really much on the horizon it was just it was basically monochrome monitors at that point, right? Right. So there was a graphics group. There was a group doing disk controllers. There was a group doing microcontrollers, sort of 8051 type stuff. There was a memories group.

Starting point is 02:24:37 I used to joke with other people in my group that the memory groups product planners had the easiest job in the company they needed to come in once a year take a data sheet mark it up by doubling all the numbers yeah throw it back into the thing and then go away for a year right right so anyway there was about 90 engineers in product planning almost not all but almost all of them had industry experience building systems okay that is almost unique to amd right i i well i think intel has at least something similar in that their processor architects, I think, have all worked on building motherboards, right? Whereas AMD's, like the guys that ended up, you know, for instance, in AMD's product planning group for graphics chips

Starting point is 02:25:37 had worked somewhere else on graphic systems. The disk controller guys had worked for some big disk manufacturer, right? So the engineers in product planning at AMD all had, not all, almost all had actual domain-specific industry experience. Right. Usually when I'm being cynical, hard to believe I could be cynical, usually when i'm being cynical hard to believe i could be cynical usually when

Starting point is 02:26:07 i'm being cynical about just chip companies in general i might my common comment and it's every time i see something stupid in some chip that i didn't design right is these people couldn't design a board with two chips to save their life. And I will say that I've seen that as true of most chip designers. They go through the Carver Mead playing with rectangles path and lambdas and whatever else but actually building a system with 50 chips on a board and making the clock tree work and the race conditions work and the logic worked first time you know whatever actual system design most chip designers have made it through college out with a degree saying i know how to run allegro or cadence tools or whatever and i can design you a billion dollar

Starting point is 02:27:16 chip but give them give them a you know an 8-bit microprocessor chip and a LED and an EPROM and ask them to put it together, they have no idea which end of the soldering iron to pick up. Right. Right? So that was certainly a difference with AMD's product planners. At Xilinx, which is where I worked later, I was the first product planner so so i couldn't

Starting point is 02:27:48 point to other people who you know who didn't have that but certainly the ic designers there you know not i mean they were great ic designers but they the the common thing you see, and not just at AMD, I saw it at AMD, at Xilinx, I saw it at MMI, and I've seen it in data sheets from other companies, is chip designers will have a problem and they'll push it out to the pins and say, the chip is perfect. Right. And they've left something that's now pushed out to the pins right which may be impossible to deal with right i've certainly seen chips where you know literally there's stuff that's happening on the pins where there's no external way to make the chip work reliably right right very sad the the products end up failing terribly because everyone who tries to build it either has to put enormous amounts of effort to work around the

Starting point is 02:28:53 problem or they don't spot the problem and they go into production where 20 of the boards don't work because the chip has this inherent bug and no amount of patch wires is going to make it suddenly better right right interesting yeah i think the uh the analog that we have in the um software engineering world is you always want to work with product managers who used to be engineers that's uh frequently something we look for so um yeah it's a it's been a good rule to live by yeah so anyway one of the so the reason i talked about the fact that you know there was about 90 engineers in product planning at amd was there was a guiding light for overall for AMD in that time frame, which was that AMD was building, the products were building blocks, right?

Starting point is 02:29:50 So all of the bit slice was the building blocks to build arbitrary size computers from mini computers, mini mainframes to mainframes. The graphics chips were building block chips. So there were things like chips that did texturing, chips that did the line rendering, right? There were basically building blocks in different combinations might let you build different types of graphic systems. And that permeated all of AMD's products in the 70s and the 80s. The guiding light within AMD's product planning group, which was, you know, I was educated about this from an early time of joining, was that the term

Starting point is 02:30:36 was mechanisms, not policies. So create the mechanisms to let an end customer build the system he wants. Don't do something in the chip that sets a policy which he or she has no way of working around or the chip is just not a good fit because you've already decided on a policy that might not match what they need for their system, right? So when you look at all of those types of products, that was kind of a guiding light. I mean, there were some products where, you know, if it was a 2400 board modem chip amd made a 2400 board modem chip then guess what that one that was policy from beginning to end because there was a spec for this is what a 24 kilobit modem chip does these are the frequencies it has to transmit and receive on these are the signal levels here's where the serial comes in and out on rs232 levels build a chip right right so there were some chips where that you know where policy was set you know

Starting point is 02:31:55 by specs but in general it was that if you're doing building blocks, build general functionality and flexibility, right? And leave it to the end user to figure out which combination of building blocks is the best way to build his castle. Right. Awesome. Well, like I said, we're definitely going to have to have you back,

Starting point is 02:32:23 but this was an incredibly enjoyable conversation for me. I hope you enjoyed it. Awesome. Awesome. That's great to hear. Well, Philip, thank you for joining us, and we'll look forward to next time having you on again. Yeah. Okay, Dan, thank you very, very much.

Microarch Club - 1: Philip Freidin

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.