Microarch Club - 1: Philip Freidin
Episode Date: February 14, 2024Philip Freidin joins to talk about developing a passion for electronics and computer architecture while growing up in Australia, getting started on the PDP-8, his grand plan to work on AMD bi...t-slice processors, and plenty more.Philip on X: https://twitter.com/PhilipFreidinPhilip’s Site: http://www.fliptronics.com/Show NotesWelcome Philip Freidin (00:01:02)Growing up in Australia (00:03:25)Teletype Model 33 ASR (00:07:10)https://en.wikipedia.org/wiki/Teletype_Model_33Kilocore Ticks (00:09:15)General Electric GE-235 (00:11:50)https://en.wikipedia.org/wiki/GE-200_serieshttps://www.computerhistory.org/revolution/mainframe-computers/7/178/720Learning Fortran and Algol (00:16:03)https://en.wikipedia.org/wiki/Fortranhttps://en.wikipedia.org/wiki/ALGOLPeeling Back Abstractions (00:19:02)Working on Hospital Electronics (00:19:51)Making a Digital Clock at Age 14 (00:24:31)DEC PDP-8 (00:26:26)https://en.wikipedia.org/wiki/PDP-8Why DEC Used the PDP Name (00:29:40)https://en.wikipedia.org/wiki/Programmed_Data_ProcessorGlass Teletypes (00:31:01)Programming in FOCAL and Fortran (00:31:31)https://en.wikipedia.org/wiki/FOCAL_(programming_language)Linking and Loading with Paper Tape (00:33:27)https://en.wikipedia.org/wiki/Punched_tapeDECtape (00:35:57)https://en.wikipedia.org/wiki/DECtapeDesigning a Floppy Disk Drive System for PDP-8 (00:37:01)PDP-8 OMNIBUS Backplane (00:37:38)https://gunkies.org/wiki/OMNIBUSSoftware Support for Floppy Disk Drive (00:39:42)OS/8 Operating System (00:40:26)https://en.wikipedia.org/wiki/OS/8DEC Manuals (00:43:53)https://bitsavers.org/pdf/dec/The Onion Model for Abstraction (00:45:21)Understanding Computer Architecture (00:48:29)Moving to the PDP-11 (00:52:31)https://en.wikipedia.org/wiki/PDP-11PDP-11/34 and Microcode (00:54:36)https://gunkies.org/wiki/PDP-11/3474181 ALU Chip (00:54:49)https://en.wikipedia.org/wiki/74181DEC VAX 11/780 (00:55:29)https://gunkies.org/wiki/VAX-11/78074182 Chip (00:57:55)https://www.ti.com/lit/ds/symlink/sn54s182.pdfPerformance Optimization by Understanding Dependencies (01:00:01)DSP and FPGAs (01:01:06)https://en.wikipedia.org/wiki/Field-programmable_gate_arrayhttps://en.wikipedia.org/wiki/Digital_signal_processingFIR Filter (01:05:12)https://en.wikipedia.org/wiki/Finite_impulse_responseTMS320 (01:06:16)https://en.wikipedia.org/wiki/TMS320Tradeoffs Between DSP Chips and FPGAs (01:11:46)Applications of FIR Filters (01:13:38)FPGAs in Communication Systems (01:15:28)Optimization Starts with Algorithms (01:16:20)Misuse of Floating Point (01:16:55)https://en.wikipedia.org/wiki/Floating-point_unitJoining AMD (01:18:57)Bit Slice (01:19:53)https://en.wikipedia.org/wiki/Bit_slicingIntel 3002 (01:20:52)https://www.cpu-zone.com/3002/intel3002.pdfMMI 6701 (01:21:00)https://www.cpushack.com/2011/03/31/cpu-of-the-day-mmi-6701-bit-slice/AMD Am2901 (01:22:16)https://www.righto.com/2020/04/inside-am2901-amds-1970s-bit-slice.htmlData General Eclipse MV/8000 (01:23:24)https://en.wikipedia.org/wiki/Data_General_Eclipse_MV/8000Mini Supercomputers (01:24:13)https://en.wikipedia.org/wiki/MinisupercomputerDesigning first chip at age 12 (01:25:11)RS Latch (01:28:03)https://www.allaboutcircuits.com/textbook/digital/chpt-10/s-r-latch/74LS279 (01:28:39)https://www.ti.com/lit/ds/symlink/sn74ls279a.pdfLearning about Bit Slice (01:30:00)R&D Electronics (01:30:53)Internal and External Applications Engineers (01:32:45)Becoming Australia’s First Field Applications Engineer (01:36:11)MMI Programmable Array Logic (PAL) (01:37:08)https://en.wikipedia.org/wiki/Programmable_Array_LogicMeeting the Bit Slice Designers (01:38:03)S-100 Bus (01:39:01)https://en.wikipedia.org/wiki/S-100_busTeaching at University (01:39:50)Sending Resume to AMD (01:42:27)AMD Interview (01:43:16)Moving to the U.S. (01:45:40)AMD’s Secret RISC CPU (01:46:19)Am29000 (01:50:19)https://en.wikipedia.org/wiki/AMD_Am29000Why RISC over CISC? (01:51:38)https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/risccisc/Memory is free (01:52:40)Compiler Optimizations (01:56:36)Mapping Instructions to Opcodes (02:00:15)RISC-V and Fixed-Position Operands (02:01:16)CISC Became RISC (02:03:47)Register Windows on Am29000 (02:05:22)https://danielmangum.com/posts/retrospective-sparc-register-windows/Texas Instruments TMS9900 (02:07:04)https://en.wikipedia....
Transcript
Discussion (0)
Hey folks, Dan here. I'm excited to share the first interview on the MicroArch Club
podcast. Today I am joined by Philip Frieden. Philip has a great backstory, growing up in
Australia and getting involved in electronics, programming, and computer architecture at
a young age. He went on to work on a number of well-known products, such as the AM2900
family of bit-sliced logic chips and the AM29000 RISC processor line at AMD before moving to Xilinx to work on FPGAs.
We only covered a small fraction of the experience and wisdom that Philip has to offer,
so I hope to have him back again in the future. I also should extend a special thank you to Philip
for being willing to be the first guest on the podcast and to Jan Grey for initially connecting
us. With that, let's get into the conversation.
All right, well, Philip, welcome to the MicroArch Club podcast, and thank you for being the first guest.
Thank you very much for inviting me. Really looking forward to the discussion.
Absolutely. Absolutely. Well, definitely honored to have you here. I think we had an interesting introduction to one another.
I had written a blog post about Lookup Table Ram or LUT Ram
and had posted on Twitter. And then we had a mutual connection who mentioned,
hey, maybe I should talk to you because of your background in the area. And that might be
something we're able to touch on in the interview today. But I always appreciate getting connected
to folks like yourself, just kind of randomly or by happenstance like that.
Yeah.
You're referring to Jan Grey?
That's correct.
Yeah.
I've known him for about 31 years.
Okay.
And we met while discussing register files and LUT RAMs and doing CPUs in FPGAs, which has been very much an active hobby for both of us.
Absolutely.
I've mostly come into contact with him through, now he leads or co-leads the RISC-V soft CPU special interest group.
So I know he's doing some work on that front,
but he's definitely someone else who I'd like to have on the podcast in the future.
Oh, absolutely.
He's a very smart guy.
Certainly worth talking to.
Absolutely.
With regard to soft CPUs, we are both pretty much together the pioneers of doing that.
And it was actually through him asking some questions and then presenting his ideas that we discovered that almost concurrently
we had architected CPUs that looked extremely similar using Xilinx FPGAs.
That's awesome.
Well, maybe we'll get into that when we get through your story a little bit.
Going kind of to the beginning, I know when we were chatting kind of off the air last week,
you mentioned that you didn't grow up in the United States, but part of your plan in getting
into the industry and that you actually had a plan, right? Was this process of moving to the US.
So I wonder if you could just take us back to kind of growing up and your education
and then what sort of your plan was to enter the industry.
Yeah.
So I grew up in Melbourne, Australia.
And I was doing electronics and then computing from an extremely young age.
I probably started soldering components around eight years old with my mother's help,
primarily holding a screwdriver over the gas stove and getting it hot enough to melt solder and
then eventually someone bought me a soldering iron you know initially just
making things like crystal radios and one and two transistor radios that sort
of stuff along the way I know if you're aware of it, but there's these kits like 101 Projects that Radio Shack and other companies made. And Philips, which you've probably heard of, also used to make such kits. Phillips electronic engineer number eight kit and so you know my earliest
sort of transistor constructions was using that kit and you know it's it's
the Phillips kits probably weren't that prolific here in the US because it
really primarily existed in Europe and Australia was probably more aligned with Europe
than the US at the time. So here it would have been Radio Shack kits, Australia, UK, etc. It
would have been these kits made by Philips. The lasting memory is that they used these
really strong springs and for someone aged eight or nine years
old it really hurt my fingers uh putting it so basically whereas the american kits had these
fairly floppy cylindrical springs you just push over the side and you could slide something in the phillips kits were built on a sheet of
masonite with with a one inch by one inch grid of holes and you pushed a pin from the underside
which couldn't go all the way through and then you then pushed this very strong cylindrical pin
from above and then if you pushed it down really hard, you'd have an
exposed loop, you'd put your wires in, then the spring would then come up and hold it. But
getting those springs in place and pulling them off when you finished a project and wanted to do
something different, I mean, it didn't quite bring your fingers to bleeding, but it was close so that's what I remember most about the kid the school I went to so we sort of jumping forward from that by
about four years so around age 12 the school I was going to there was a maths teacher who got a computer, what's the right word,
timeshare computer terminal.
So we're talking 1972.
That dates me a little bit.
This probably does it as well.
So around 1972, the school got a computer terminal.
And back in those days, computer terminals were something called an ASR-33,
which is also called a teletype.
So it transmits and receives at 10 or 11 characters a second.
And the mass storage available is called paper tape.
And so you would, with the ASR33 terminal not connected to the computer,
so you're not paying for services,
you would type in your program and punch out a paper tape.
And then when you'd finish creating your program, you'd then use a phone to dial in to the computer.
So this is a little bit like bulletin boards of 20 years ago in terms of dial-up type stuff, except the data rate was 110 board um the modem was the size of a pc mini tower case
okay right and all it did was 110 board right and so what you do is you'd quickly go through
the login process turn the paper tape reader on and read in the tape that you'd typed so it was
you know banging away at 10 characters a second right and as soon as it finished reading the tape
you'd type run the program would run for some amount of time or it would loop forever or whatever
and it would print out results at 10 characters a second.
And then unless there was a good reason,
you would then disconnect from the service. So it was how short a time can you stay connected?
And the way services back then were measured and charged
was with a unit you've never heard of before called a kilo core tick
kilo was the amount of memory that your program was using in kilobytes so like five kilobytes or
10 kilobytes right core was in core memory and a tick was a second so So if you used 10K of memory for 50 seconds,
you'd be charged 50 kilocortics.
Right.
Common currency there.
Everyone's walking around with some kilocortics in their pocket.
Right.
So anyway...
What kind of operations were you doing on the computer at that time?
So it was mostly math problems.
So this is, let's see, I'm in approximately, this would have been eighth grade.
Okay.
In Australia, that's called Form 2.
So the naming of the stuff in it is quite different in australia from
from here in australia it's primary school one through six and then secondary school is form one
to through to form six so it's still 12 years of schooling but for secondary school it's just
form one two three four five and six so it would have been form two um so you know we're doing
some calculus type stuff some linear programming type stuff so you know simple algebra um some
things like finding minima of functions that sort of thing um and you know maybe writing some computer games as well
but because of the um the budget you know which was for the whole school you really couldn't get
too carried away with with what you were doing with this one terminal shared across probably 60 or 70 students, although most of them weren't into it.
But, you know, there was a small group of us who were. For me, I needed much more. And so
it turns out, so the service was provided initially by General Electric, and then it was by Honeywell.
So the computer was actually a General Electric 235, just called GE 235.
But they sold that division to Honeywell.
And so it started off being GE Timeshare, then it became Honeywell Timeshare.
The computer center was located in central Melbourne
and that for me was a half hour trip on public transport and so after school one day I headed
into the city and found the GE building and went in there.
And, you know, I'm like a 12-year-old, right?
Right.
And I surprisingly know something about this service.
So I came up to the reception desk and I said, I've been using your service, but I really
want to see the computer that we're connected to. Is there some way I can want to see the computer that I'm connect that we're connected to
right is there some way I can get to see it and the receptionist when I came in she was
chatting with with what turned out to be one of the computer system technicians
okay and so she handed me off to the technician.
And whether because he was interested in helping me
or maybe to impress the receptionist,
either of which is possible,
he took me in the elevator up to the computer center.
And there was a big glass window
where you could look in at all these machines and the people in with white coats loading and unloading listings from the line printer, mag tapes. storage like back in those days a disk drive would be a pack that was maybe 14 inches in diameter
and maybe six to eight inches high and that might be 20 megabytes wow and and that's removal this
is removable media right right so there was a plastic cake tin, as I refer to it, over the top and a cover at the bottom.
And there was a way to hold the handle in the top, rotate a latch on the underside, which would remove the bottom.
You then place it into the disk drive, which looked like a washing machine.
Right. And then again, an appropriate twist of the wrist, lock the pack into the drive and you took the cake tin off.
And then the heads then came in from the side to access 20 megabytes of data.
I mean, I have text files that are a thousand times the size of that now.
So anyway, a big glass window to show me, you know, there's the computer.
And he pointed to different things and told me what they were,
and I asked appropriately intelligent questions.
And he said, so then it turns out that as someone
who actually did maintenance on the machine,
he had an unlimited account, right?
No kilocortic limits for him.
Right. I bet you were excited to hear this at 12 years old. Yeah, it was like, wow. So it turns out Honeywell Timeshare had a training room
with 10 teletypes and a desk and a big blackboard at the front. and so they did training for corporate customers and he said you know let
me set you up with some stuff you probably can't do from the school terminal and it was some game
programs right nothing particularly exciting and you know it's still at 10 characters a second
right right so he left me there you know to play for i don't know half an hour or so when he came back i said i've played
enough right how do i get to really learn more about you know using it and he said well for our
corporate customers we have training programs right and so instead of just programming in basic
you could if you want to spend the time learn Fortran and Algo and I was
all for it and so that started what was probably six months of me going into Honeywell timeshare
wow three or four nights a week and getting there usually around four or 5 p.m. and staying till about 10 p.m.
And as I said, I did it for about six months until my parents blew the whistle on me
because my school grades had fallen off a cliff.
But I was now writing three or four hundred line Fortran and Algo programs at age 12.
Oh, and basic as well right so so basically honeywell timeshare
technician i don't know his name i don't know how i would ever thank him but he he gave me a resource
that no other 12 year old in australia would have had right right it was just phenomenal and you know
so there i am and i'm i've already learned the three the only language i didn't learn because
that machine didn't have it was kobol okay oh and assembler so i hadn't yet done assembler
what instruction set architecture was that machine?
I do not know.
I'll look it up and put it in the show notes.
Yeah, it almost certainly does get a reference in Wikipedia.
Okay. And there were several machines.
It was GE215, 235, and 245.
And the machine we had was a 235.
And, you know, these machines were 19-inch cabinets,
you know, maybe 16 to 20 of them in a row,
where something like 4K of memory was a 19-inch rack,
six foot high, three foot deep, and, you know,
25 inches wide or whatever that a
19 inch rack is by the time you put the the bat the the wrapper on it right so
anyway so that was my introduction to programming and learning multiple languages and getting already to do some comparative type stuff.
The damage that did to my educational path was pretty severe and lasted me through to the end of school
because there was a lot of stuff that I just didn't learn.
Because I had, I mean, on the other hand what i did learn
has lasted me a lifetime right right um but for instance i'd be lost without a spell checker
right right right well because that was one of the things that went by the wayside
right right well that's so interesting.
You know, a lot of ways, it's a more physical or real world example of kind of like diving through the abstraction layers, right? You're at the end at this teletype,
and you have to kind of like go physically, take public transportation to go see the thing.
And, you know, it's transportation to go see the thing.
And, you know, it's interesting to compare that to experiences today.
One of the things with this podcast, for example, I'm trying to do is kind of like pull back some of those abstraction layers and see what happens in the machine.
But for me, a lot of times that looks like, you know, going to YouTube and watching an interview or listening to another podcast. Well, or listening to a gray beard on YouTube telling you this type of story.
Right.
Yeah, and I didn't mention that, of course,
that technician got me the appropriate credentials,
no idea how,
to let me not have to look for him every evening
when I came in to to play on the
system but basically you know no one in management either knew about it or cared about it um but it
was just a wonderful learning experience for me that's um yeah so after, a few years later, so around 1974, I started working, well, playing with, playing and some work on a PDP-8.
So the DEC computers really dominated my early hands-on computing. So there was a,
how did this happen? So I was working as a summer job. So in Australia, the end of the school year is you know end of november and then we have like
eight to ten weeks off which is end of school year and christmas and it's the middle of summer
in australia and so we can get pretty long summer jobs, right?
Right.
Significantly longer than what you can get here in the U.S.
And so I got a summer job working in the electronics department of a hospital, right?
Interesting.
And this was arranged by my father, who was the head surgeon at that hospital. And he put a nice word into the head
of the electronics department. And so I don't know if they do it now, but at least back in those days,
hospitals had, as I said, a electronics department with maybe four or five electronic engineers and primarily
they were doing equipment repair right but they also had um they did some design work
because the hospital of the large hospitals and this was a large hospital
tended to be linked to a university.
And so you'd have doctors who taught at the university,
practiced at the hospital, and might be doing some research.
And if that research needed special instrumentation,
then the hospital had an electronics department who often helped along.
So I got this 10-week or so Christmas holiday job,
and the head of the department had me primarily doing,
go pick up this ECG machine on the eighth floor of the hospital and bring it back for repair,
or go see if they you know if they
forgot to plug it in right there was always there was always these jokes of nurses you know failing
to plug in something or you know getting you know terminals around the wrong way um once had
you know it was some piece of uh equipment that was somewhat portable in a box about that big by that big.
Whoops.
Right.
And someone said, disinfect it.
And what they probably meant was get some appropriate cleaning agent and wipe down the front panel or whatever.
They put it into a bucket of disinfectant.
Right. and whatever they put it into a bucket of disinfectant right so you know so sometimes
you know some things will be on repair anyway um because they had design capability
the head the department said you know as well as doing this repair stuff and they had me doing
you know once i'd actually shown that i understood
what was going on i was actually doing repair of equipment you know some of it was fairly simple
you know things like broken battery terminals or wires to a to a potentiometer or you know
a frayed terminal to an electrode for an ecg machine some it was simple, but some of it was some bug tracking.
He said, you know, we have the parts, why don't you make a digital clock? Right. And back then,
you know, it was 7400 series ICs and Nixie tubes. And so I made a four-digit clock and he let me just basically you know work through the data book and gave me
guidance when I asked for it but as well as the experience of working you know in a real
engineering department this is so now I'm I guess around 14 yeah so I'm you I guess, around 14. Yeah.
So I actually got to design a digital clock.
And back then, you were having to put a lot of components together, right,
to actually build different modulo counters for 59 rolling over to 00 and 12 rolling over to zero zero and right you know 12 rolling over to one um you know that that sort of
stuff so um so i ended up with a with a digital clock i ended up with some experience but as i
said this was a teaching hospital and one of the departments that i served had a PDP-8 computer.
And so I requested ability to use it in the evenings when no one was using it.
And the medical researcher knew me not by my correct name, Philip Frieden, but my name was joe friedenson as in my father's name is joe
and i'm joe frieden's son fair enough fair enough and that carried insane um creds in that hospital
right so so i was allowed to play with the pdp8 now the pdp8 is in this case was 8k of 12-bit
memory it executed uh what today we would call pretty much a risk instruction set and yet it
was from the 9th pdp8 family dates back to the mid60s and extends through to the end of the 1970s in terms of
six or seven different models of machine and the 8e was probably by far the most prolific the deck ever sold. It was priced down around 10 to 12K,
and you could run...
It had an interpreted language similar to BASIC called Focal,
which looks not that different from Modem Line Noise.
It's a little bit more readable than than apl and a
little bit less readable than basic um but it it ran focal and so you'd have well actually let me
think the interpreter was actually 2k but 2k words and so on a 4k pdpa which is the smallest configuration you've got 2k words which was
approximately 4k half words which are six bits and if you don't need lower case because your
teletype only does uppercase you can pack two characters per 12-bit word, and if you put three 12-bit words together for floating point.
So for integer, it was 12-bit integers, so plus minus 2,000 through to plus 2,000, 2048.
But floating point would be a 12-bit exponent and a 24-bit mantissa so it's uh its range was actually better than ieee
right single precision floating point which is 32 bits uh split as uh 8 and 24. so there's four more
bits of uh mantissa and so it was it actually had better range than ieee which didn't exist back then i mean that
didn't come along for another 15 20 years but you had this pdp8 with only 4k of memory and it had
very good floating point and you could write you could write programs that fitted into that 2K words of memory,
including both your program and any array data.
And the PDP-8s could go up to a maximum of 32K.
Gotcha.
And yeah, the PDP-8 has come up a few times in chatting with folks. And was PDEC called their machines PDPs rather than computers, right, they weren't called computers.
They were called programmable data processors. computer department that was in charge of buying all computers and keeping them behind
big glass-walled rooms with the high priests in their white coats.
And so for researchers who wanted to get a computer in their lab, they could order it
as a PDP.
Right.
And the computer department...
Didn't go through the red tape.
Right.
Bypass all the red tape. Right, bypass all the red tape.
So, and they were also very prolific in universities
because DEC made some large systems.
The PDP-10 and the PDP-20 were campus-sized machines
that might have 100 terminals connected to them.
So the terminals were still these ASR-33s, right?
These dominated the 70s.
They were eventually supplanted by what they called glass teletypes,
which we would now refer to as dumb terminals.
Okay. teletypes which we would now refer to as dumb terminals okay right so 24 lines of 80 characters
and a keyboard and probably depending on when you did this there may still be a paper tape
reader and punch associated with it gotcha okay yeah so anyway um So anyway, there was a department there that had a PDP-8,
and they didn't have anyone to program it.
And I said, I can program.
And so one of the researchers who had got the computer for his department
but hadn't managed to get the funding to get a programmer,
had me doing programming.
And so I was writing simple data acquisition programs,
absolute real-time stuff written either in Focal or in Fortran 2.
That's a much earlier version of Fortran than,
I mean, most people who learn Fortran learn Fortran 4
or something more recent,
but there was an earlier version called Fortran 2
that was sufficient for what we were doing.
And so this was programming on the PDP-8,
still everything with paper tapes,
but now with a Fortran compiler,
and that meant the Fortran compiler was on a paper tape.
So you'd load the Fortran compiler from a paper tape,
then load your program,
and it would then read it in, and you'd read in the paper tape then load your program and it would then read it in and you'd read
you'd read in the paper tape several times because it would first of all
build a symbol table then read again right and so you know compiling might
take two or three hours of feeding the first of all, the Fortran compiler in, then loading the linker,
and then loading the libraries,
and then loading the intermediate binary,
and then at the very end of three or four hours,
you'd get a new paper tape,
which is a binary version of what was your Fortran program.
Right.
And were you, like the linker in the libraries,
you're manually loading these into the machine or, yeah?
Mm-hmm.
Yeah.
So it's interesting.
It's like steps we still take today, right,
to compile and link a program,
but you're physically participating in the process.
Right, right.
In fact, so for the Fortran compiler,
it was like a two or three pass compiler.
And so you'd load, so the first paper tape you'd load would be Fortran compiler pass one,
and then put in your paper tape. It would build a symbol table in memory. You would then load in Fortran compiler pass two, which would overwrite pass one,
but it would now have the data left over from pass one.
Then you read in your program source a second time, right?
And now it's starting to build,
it's starting to build the,
I guess it's the data structures and the code tree
and figuring out branch distances and that sort of stuff.
And then there was a third pass.
Oh, the third pass output was the one that took whatever the data structures had been built and then turned it into relocatable binary, right?
So now you've got a relocatable binary version of your program
but no libraries right yet right then you load the linker and then you might if you wrote your
program in multiple modules you'd have a relocatable binary paper tape for each one
and so you'd load in the linker program then you'd load in all of your things you would then link between them for where symbols
match between your different modules right right and then it would see here's the list i don't yet
know and it would then tell you which um library tapes it needed to merge into your program
right so things like the floating point library uh the sign and cos library you
know that sort of thing right and then at the end would it be a so uh it would do all of that and
compile a statically compiled binary there's no uh like runtime dependencies yeah and and it was a
a bare metal thing there's no operating system in this environment.
So, yeah, in the paper tape version of this stuff,
there's no operating system.
Gotcha.
Okay.
At some point, you can add mass storage, like DEC's primary thing was a tape drive
that had a one-inch wide or three-quarter-inch wide tape
called DEC tape, D-E-e-c-t-a-p-e and it
was a randomly accessed tape that was pre-formatted into blocks and so there was an operating system
at the beginning of the tape and you could have multiple files and the basic compiler and the Fortran compiler lived on tape.
And so all of that paper tape stuff went away, right,
if you could afford to buy that operating system
and have the mag tape drives.
Right.
So a lot of tape spinning around, right?
Right.
These are on little hubs that are only about
four and a half five inches in diameter right okay and the tape just goes off a reel over the head
and down onto the other reel right so anyway um i managed to talk the researcher that I was doing work for into finding the money to let me design a floppy disk
drive system and so this was my first big hardware design and that was to design a board to plug into
the PDP-8 that could talk to a disk drive and the disk drive was from Memorex.
It used 8-inch floppies and could hold about a quarter of a megabyte.
And what was the interface that you'd be plugging into on the PDP-8?
Okay, so the PDP-8 has a backplane called Omibus and actually probably in terms of in terms of um
a sea change in how computers were built i believe the pdp8 was the first bus based computer
okay where where any slot in the back plane could take any card.
Everything prior to that, every place that you plugged a card in,
there was only one card that went into that position, right?
But the PDP-8 used these boards that are, I don't know,
about 10 inches wide and 8 inches high,
and there'd be a single board was the clock another one was the registers another one was the alu three of them together was 4k of uh memory the middle board
was the um was the actual core plane and then one of the boards was the read write amplifiers and the other one was the bus interface gotcha but but the bus interface basically ticked away at 1.2 or 1.4 microseconds
oh the the model i'm talking about is the pdp8e okay right and it it ticks away at about 1.2 or 1.4 microseconds per instruction. Typical instructions took three clock cycles, so around four or five microseconds per instruction, so 200 operations per second, 200K operations a second.
And so, you know, it's 12 bits of address, 12 bits of data,
and a bunch of timing signals that tell you whether you're doing reads or writes, etc.
And so not only did I have to design the board,
I had to write the device driver,
and then I had to read in, I had to write the device driver and then i had to read in i had to get the operating system onto the floppy for the very first time and the operating was system was distributed
as about 20 paper tapes so the actual so now instead of using the Fortran compiler, I'm now using the assembler, right?
And the assembler had a similar multiple passes to do an assembly and a linking process to end up with basically the binary version of my device driver, right? right and so the this the operating system was called os8 and there are characteristics of os8
that you will still see today in the dos boxes on a windows 11 piece pc right the the syntax on
that command line it all goes back to os8 so everything in OS 8, when you get to the PDP-11s, it was something called
RT-11. When you get into PCs, it was called DOS, MS-DOS, right? They all use the same basic
command line syntax. And, you know, that still exists today in the dos box or the command box that you see
on your latest windows computers i mean there's differences but it's basically looks very very
similar right right so you're working on this disk drive right and uh yeah go ahead and then and then i had to load the operating system and deck built
the operating system to be distributed as paper tapes knowing that some people would have their
own device drivers rather than decks and so there was a a build phase for the os8 operating system
where you aren't actually running the operating system you're running a program called build and you're telling
it which devices you want right and which services you want OS 8 to provide
and you get to build a custom image of the operating system and the very last
stage is you discover if the device drive you wrote actually works, right,
which is when it now has to write it to the media.
Right.
And if that fails, of course, you know, it's back to the assembler.
Right.
To figure out what went wrong.
So, you know, but all these machines all had front panels
with toggle switches and LEDs, right?
And, you know, you might want to insert an image.
There's lots of PDP-8 images available on the web that you could insert a pop-up of, right?
So, you know, basically you can single step through a program, examine any memory location,
change the value of any memory location,
all from the front panel.
So, you know, you can certainly just load the device driver, right,
and then create a little five-line program that calls it
and tells it to transfer 128 bytes from this place in memory
to that place on disk, right?
Right. that place on disk right right so so anyway this is me aged 14 writing operating system device
drivers and then building operating systems right right and so this was this was now my path up
sort of through the really getting into the lower levels, not just of writing assembler code,
but of understanding operating system principles.
And again, there weren't any other kids around
that were doing the same stuff as what I was doing.
Right. I can imagine it was fairly unique.
Yeah.
So, again, a great education i would say the manuals that digital equipment made available for the pdp8 and then later for the pdp11
those they had some introductory programming books that started at the very low levels of Assembler that were absolutely excellent.
For someone who had no detailed knowledge or university training, their books, they're all available on BitSavers. right um you know i mean it's kind of historical and you know not really relevant but if you if
you want to have a read there's a book that came out in 1969 called introduction to programming
by digital equipment for the pdpa and it takes you through assembler and explains they didn't call them pointers back then
but indirect addresses and memory references and a superb coverage of two complement arithmetic
and you know just understanding at the very lowest level you mentioned about sort of levels of abstraction this has been very much something that i've always um thought a lot about and from back in those days because
already i could see it and yet the the model that a lot of people It's layers upon layers. And I've strongly held that the quality of the work
that you do at a given level of abstraction
is improved by understanding the level below
that you're not using,
and maybe the level above that you're providing a service to right
right yeah so i heard a uh a quote pretty recently that you know when programming a lot of times
we try to be very respectful of the abstraction boundaries on either side right because that's
what allows things to interface well together but But from the programmer perspective, right, it's our job to actually venture across those boundaries
to understand them, to understand what happens when we interface with them. So I think that's a
key thing. Absolutely. In fact, really, there are two things that come from
understanding the level below that you're working. So if's say you're writing a C program or a BASIC program.
Well, actually, C and BASIC are different enough, right?
Because C gets compiled to Assembler.
Right.
BASIC is, at least traditionally, is interpreted.
So there's another layer for BASIC that doesn't exist.
Well, it's a different layer. For C, it's well it's a different layer for c it's the compiler for basic it's the interpreter but then eventually you're executing
assembler to implement um the desired effect right um the better you understand that layer below
whether it's basic or c that is the assembler level, the better you write your code, right?
When you're writing C code, if you know how it's likely to be compiled, right, there's a reason
that you might write A equals B left shift one, as opposed to A equals b plus b right or a equals b times b i mean they all achieve the
same thing but if if you don't have a lot of optimization going on the a equals b left shift
one is by far the fastest right right unless the machine happens to not optimize the execution time for that
right you know in my experience the uh uh that kind of like understanding uh let's say this is
not always the quality right there's many things that impact performance at the physical level but
the number of instructions right that you're you're going to be translated to that's kind of
like the first level of understanding maybe how how you know, higher level code you write is going to be running.
But then even beyond that, right, understanding, and it varies in complexity across different
instruction sets, but what's the performance impact of executing a given instruction, right?
And that goes into how the processor is designed and that
sort of thing. And it seems like you ventured down, kind of, you know, you kept going into
that level of understanding that. What was kind of like the impetus? Was it just continuing to
peel back the layers of the onion? Absolutely. Well, I mean, designing a disk controller
with having never done a tough design, and back then, there weren't disk controller with having never done a you know a tough design and back then there weren't
disk controller chips right it was you know probably 60 or 70 ttl chips right um right
connecting all right so i had to learn bus timing and you know how the machine executed instructions and oh that was the other thing um deck provided full service manuals for
these machines and detailed schematics and theory of operation and so it was expected that some
percentage of customers would do all the maintenance of their computers uh down to the chip level. Wow.
Right?
And so whereas IBM kind of did the exact opposite, right?
You couldn't buy an IBM, you had to lease it, and they sent out service people, right,
who maintained them. And that was true for most of the mainframe companies for DEC and then the other similar companies, Data General
and Naked Mini and a few others, they all provided enough service information that a
competent EE could actually maintain these machines.
So I actually did repair work on the PDP-8 as well.
And so that meant I was getting into the partitioning of the computer
into different sections and actually like,
hey, it doesn't add correctly anymore.
And the way you would nail it as that specific thing is
DEC made diagnostic programs
on paper tape right and you would load them into memory and if memory worked the program would then
run and maybe indicate you know hey you know i'm trying to do this operation it doesn't work
and then what you could do is you could write little five or ten line programs and button them in from
the front panel after you hand assemble them and then exercise a specific instruction say okay i've
got this operand here and this one here and it's doing this instruction and the result is wrong
right right and you stick that in the loop and then you go in with an oscilloscope and you trace out the circuit while it's cycling
that one faulty instruction right right or that one failing memory address or whatever
right right so yeah so anyway um we are really miles away from what you wanted to interview me
oh no i think i think you've actually uh uh intuitively
gone the direction of uh that i'd like to but yeah no let's keep let's keep going through your
your journey fine okay fine well do you want me to keep it at this level of detail and pace
sure that that's great i'll go as long as you'll as you'll stay on so well we may be here for a
few hours before we get to lots yeah well we'll have to do a part
two if that's the case well actually you can just carve you know if we go too long make it into two
episodes or whatever right okay so so i did basically um so this was most this started off
as a summer job but i ended up doing again, again, after-school work for this researcher.
And so for about two years, I was doing programming in the evening and building a disk drive controller and, you know, learning about the internals of PDP-8s.
And that continued up until around 1976, so two years later, when that researcher changed
from one hospital to another. And I followed him and he said, what are we going to get?
Are we going to get a PDP8 or do you want to try one of these newfangled PDP-11s?
And the PDP-8 was what I knew.
And I said, well, I really know this one, but if you don't mind me taking time coming up to speed, let's go for a PDP-11.
Now, the PDP-11s had been around all through this time, but we couldn't, well, the doctor couldn't afford them.
But now we had more research money, and so we got probably one of the first PDP-1134s. So there were four or five models before this,
and I won't go through the models, but they basically all execute almost the same instruction set.
That is, there were different ways that you would never normally write, right?
But they would execute it differently. And so there were a set of different little tests you could do to figure out what machine you were running on.
Ah, I see.
Right.
Okay.
So I think the earliest machines were just strict pile of gates type decode the instruction and run it type stuff by the time you get to the pdp 1134
it was a microcoded machine okay okay so microcoded machines and this is actually worth doing because
this actually moves into a lot of other stuff in that frame, so from the late 60s onwards, there was a chip made
initially by Fairchild called the 74181. And it is a 4-bit ALU with an internal 4-bit carry chain
and carry in and carry out that let you put multiple of these 741-81s side by side
to build arbitrary width arithmetic paths.
And so for a PDP-8,
there would be three of those chips side by side.
For a PDP-11, there'd be four of them.
And when eventually you get to the vax 11 780
which was deck's first 32-bit computer there's eight of them right and so that that's an alu
only chip um but it also introduced um special so basically the 7400 series is mostly just building block chips, but there
were a few very function specific stuff. And so one of the things the 74181 ALU chip did was it
didn't just generate carry, carry in and carry out. It also had additional pins called generate and propagate.
Okay.
And generate and propagate are a fast analysis of the operands and the carry logic that bypasses the ripple process of rippling carry from one bit to the next right it's a
it doesn't create an answer but it in broadside so very very quickly can say if the four-bit
operation that that chip is doing if there's a carry-in it always will generate a carry out ah okay so sorry regardless of the carry in right it'll
always generate a carry out so that's a generate right so the generate pin of a 181 says i know
i'm going to generate a carry no matter what the carry-in is right the propagate signal coming out of a 74181 says i will propagate a carry if there's a carry
in gotcha so that means outside of the 181 you can look at the signal you're providing to the
carry-in pin and the propagate signal and the generate signal that are calculated independent of the actual calculating the add or the subtract right
and you can pre you can pre-calculate or concurrently calculate the carry into the next
four bits right right so so that lets you get a very fast um carry in across four bits. Right.
If you're putting four of those chips together,
Fairchild designed a part called the 74182,
and it had generate and propagate inputs that would come from all four of those chips
and look at the carry-in at the very beginning of a 16-bit thing and it
would generate a generate and propagate across all 16 bits okay right very fast right and so you built
so if you're depending on the size of the process you're building you built a tree of these 182 chips
right and one so the 181s are doing the arithmetic having to deal with actual carry propagates
going carry ripples through the uh arithmetic arithmetic path right but you've got these
generate propagate signals that are predicting right ahead of stuff right so the domain of adding two numbers in two's complement or one's complement
for that matter has been an area of research for decades right and so if you go look at how do you
add two numbers together right there's the simplest stuff at one end which is you add the first two
bits and you look at the is it a carry the one
problem right right right and then you do the next two bits and then the next two bits and the next
you know two well one bit two operands right right and you keep doing that right right then you have
things like the generate propagate tree but that isn't the fastest way to go.
There are other things.
There's things called carry select adders,
where you calculate both answers concurrently,
one with carry in and one without carry in,
and then you use the carry in signal to drive a MUX
to select between the two answers.
Right.
And you still use those generate propagate 182s
to control or carry select data.
Right.
It has the overhead of muxes.
It has the bonus that it's faster.
Right.
I love examples, even if they are, you know,
not the ultimate optimization of something like the generate and propagate
because they make it abundantly clear that a big part of performance improvements is understanding what dependencies
do exist and what things are false dependencies, right? And if you can look at something that's a
false dependency and say, oh, we can actually do this at the same time, then sometimes you get the
incremental speed up. Sometimes you get, big performance changes. Right, yeah. So, in fact, as your adders get bigger,
this generate propagate stuff actually improves performance
on an exponential curve, right?
Because you end up, I mean, if you were doing a 64-bit ad,
you're doing it in the same time that it takes a five-bit ad to to occur right right because most
of most of the stuff that takes time is now it's the generate propagates because all the other
stuff runs sort of concurrently right right um so we actually see some of this in DSP for why FPGAs dominate the very high end of the DSP marketplace.
So I'm going to take a little side trip to expand on this thought.
So one of the most common DSP functions that is insanely computationally heavy is digital filters
and and but they're simple right they're simple you take a vector of coefficients and a vector of
data and you multiply each element of the of the data vector by the appropriate coefficient, right?
And the set of coefficients will basically –
and so it's actually that the term is sum of products.
You may have heard that term.
So the product is the multiplier of the coefficient by the operand
repeated by n different elements.
Right. vector instructions where you point at the coefficient array and the data array, which
might have come in from an A to D converter, or it might be a scan line from a video image
or whatever.
And then it, as fast as possible, fetches two operands at once, multiplies them together,
and adds the result.
So if you start with two 16-bit operands, you multiply them together, the worst case
is a 32-bit result.
If you add two of those together, the worst case thing is 33 bits, right?
If both of those 32-bit numbers were near max, right, you'll end up with a 33-bit result.
So you need one extra bit if
you're only adding two of them if you add four of them you need one more bit if you have eight of
them you need one more bit 16 32 so depending how many elements you're adding together you need more
bits in that accumulator right right so now we look at where does it make sense of how big should that accumulator be
and that in fact will limit how big the array can be before you have to do something special right
right so if you're going to do let's say add 4096 well multiply 4096 operands with 4096 coefficients,
if that's what you want to do,
then your accumulator needs 12 extra bits.
So it will be 32 bits plus 12.
And if you're doing floating points
as a way to avoid doing that,
needing extra bits then you have to consider whether you're really doing the arithmetic you think you're doing
because you're going to start throwing away all your low order bits because whether you add them
or multiply them or whatever they're going to get pushed out by all the other arithmetic that you're doing.
Right.
Right?
Right.
So there's a balance of when does...
Well, first of all, floating point
typically is massively slower than integer.
Mm-hmm.
Right?
Typically.
Except for some chips
where they just throw an insane amount of hardware at it
and they bring floating point
not up to the performance of pure integer but maybe a factor of three to five times slower
right it depends on the processor and how much silicon you want to throw at it okay so that's
how dsp is done in a dsp processor for a common task, which is what I've described to you
has a name.
It's called an FIR filter, right?
And it's finite impulse response.
And I could go into what that means, but it's irrelevant to this discussion.
What's important is that if I have a, let's say I have a data vector of 1,024 elements, right,
then I've got to do 1,024 multiplies and 1,023 adds.
Right.
Right?
And so that's going to take me, even if I overlap the adds over the multiplies, and
the multiply accumulate section always the multiply, accumulate section
always does that, right?
You're up for approximately 1,024 system cycles.
Right.
Plus, you've got to fetch 2,048 things out of memory.
Right.
Right.
And so if you look at DSP chips,
you will find that different regions of memory can be accessed concurrently. the on-chip memory, which might be a quarter of a megabyte, is broken up into four blocks, right, of 128K,
and the processor can fetch four operands in one clock.
Right, right.
Right?
So that lets them, you know, have two pointers,
both fetching one from the operand list,
one from the coefficient list,
pushing them through the multiplier, into the adder, into the accumulator,
and then keep doing that.
It still takes order N, right?
It takes 1,024 cycles.
How would you ever improve that?
And then along comes fpgas and fpgas say i've got 1024 operations i can do the first multiply and accumulate and the second multiply accumulate i can do them
concurrently right and store the result if i can fetch enough operands right right and some of
the dsp chips will do that too some of them have dual multiply accumulators right but there's a
limit to how many they have right right in an fpga there are fpAs that have tens of thousands of multiplier accumulators in them.
That's current technology.
The early stuff, they were measured in tens to low hundreds.
But even so, let's say you've got a chip.
Let's say you've got a chip that has 1,024 multipliers right you can fetch if you slide if you set things up right
maybe you can fetch all 1024 coefficients all at once
and you can take your data that was coming from an a to d converter
have that going through a shift register that is word wide.
Right.
And when all 1,024 words are in exactly the right place,
fire off all 1,024 multipliers all at once.
Right.
And you get 1,024 multipliers in one cycle.
And you combine the results of two of those multipliers so you
actually don't care about the intermediate ads right you don't have to store them all you want
is to know the results so you have 512 adders right that combine two different multipliers
which on a dsp chip would have been two consecutive multipliers.
On the FPGA, it's a pair of concurrent multipliers, and you get 512 additions,
which all happen in one cycle. You follow those 512 adders by 128 adders, followed by 64, 32, whatever.
You end up with about 11 stages deep.
Right.
Right.
And so in 11 clock cycles, plus the one clock to do all the multiplies, you have the result of 1,024 multiplies and 1,024 adds.
You have it in 11 clocks.
Right.
But each of those stages was only used for one eleventh of that time.
Right.
So take that data that was in that shift register
and shift it one position
and now do another 1 another 1024 multiplies.
Right.
And have those results marching behind down through that adder tree.
You've got a pipeline situation going.
Now you've got something that took 1024 clocks to fill the pipeline.
So that's your startup latency.
Right. right to fill the pipeline so that's your startup latency right but after you've paid that latency of 1024 clocks which is a millisecond right you're now delivering a result every single clock cycle
yep so now so now so here's how an fir filter works You set up the coefficients to be a pattern that you're looking for, right?
And you set up your data is streaming,
and it's somewhere in this data is the pattern I want to find.
And now in a single clock cycle, whereas a DSP chip,
it will take 1,024 clocks before you can shift the shift register once.
In the FPGA, you shift at every clock.
So the FPGA is running 1,024 times faster than a DSP chip.
And they both had the same startup latency, right?
Right.
They both couldn't start until you had your first 1024 data items
but the dsp chip won't after it shifted everything one position it's another 1024 cycles until you
get an answer right the fpga it's one clock because they were coming down through the pipeline of
adders so could the the dsp chip could in theory also have 1024 multiply and then the
add tree as well right it's just that most of the time they they do not and there's no you couldn't
control expanding it or contracting it as you needed right you you couldn't make it application
specific and it wouldn't be able to do anything else. Right, right. Right?
And it would be,
instead of being the relatively cheap price
that a DSP chip is,
which might be in the $10 to $12,
it might end up being like these high-end FPGAs,
which are $2,000 to $3,000.
But for the FPGA,
if what you really needed was that 1024
position fir filter running at 100 megahertz delivering a new result every 10 nanoseconds
right so every 10 nanoseconds it does 1024 multiplies and 1023 ads and delivers a result right that fpga might cost
you a few thousand dollars right right maybe i don't know what the current prices are maybe
maybe it's a few hundred dollars it doesn't matter right the point is no amount of banging on the side of your DSP chip will hit that performance target.
Right.
Right.
And the cool thing for the FPGA is if you only need to do that until the radar says,
I've now found the target, now switch from search mode to tracking mode right that's load a different bitstream
and now track the target that you found with that FIR filter right or now go
into an image recognition algorithm or you know what whatever else right by the
way that there is of course a two-dimensional version so what I've
described is a single dimensional FIR right there a two-dimensional version. So what I've described is a single-dimensional FIR.
There are two-dimensional FIRs which get applied to images
and do things like edge sharpening and blurring
or feature recognition, et cetera.
The FIR filter is a workhorse of the DSP domain,
which covers image, audio, radar,
and almost everywhere where you're dealing
with real-world signals,
other than on and off,
there are DSP algorithms often involved.
Right, right.
So, and, you know, I mean, an FIR filter can also, you know,
boost the base and trim the treble or whatever else.
I mean, it's a general purpose algorithm
and it's functionally easy to implement,
but it is computationally heavy.
Right.
And again, it's the example of looking at a DSP chip,
which is inherently sequential,
and the FPGA, which offers but doesn't enforce
a parallel solution.
Right.
If you're working only with DSP chips,
that parallel tree-based solution
just never occurs to you because it's not part of the architecture for the fpga it takes a while
to reshape into thinking in terms of the extreme parallelism that fpgas are capable of, right? But that's the reason that, you know,
high-end communication systems,
satellite communications, wireless, G3,
next-generation TV distribution,
FPGAs are used in all of these systems
because they can implement these highly parallel solutions
and you don't have to commit to a custom chip, which when they go and change the standard
will now all be unusable.
Right, right.
Absolutely.
I definitely heard that.
For the FPGA, yeah.
Yeah, when the algorithm changes or the standard changes, the FPGA says, well, okay, send me a new bitstream.
Right, exactly.
Right.
So, an interesting view of understanding your, is really important.
You know, when you want to do optimization, I generally say look at your algorithms first before you start looking at boosting the clock speed or throwing more chips at it or putting
a water cooler on it and overclocking it.
Right.
Right.
Before you do those things, go look at your algorithm and see if there's a better way right um in a very well i won't say cynical but
and i'm it's cynical i've i've been of the general impression that when someone uses floating point for an algorithm, they
don't understand their algorithm, right?
Floating point is the crutch for not knowing what's going on in your data, right?
If you know what you're, if you really understand what your data looks like and how you're
manipulating it at each stage, for most applications, not all, but for most applications,
there is a pure integer solution,
or at worst, an integer solution
with an invisible binary point somewhere.
Right.
Breaking it into the integer part
and the fractional part of a binary number.
Right.
And I hope you noticed I was very explicit that it's not the decimal point.
Right, right.
I get angry when people refer to that as a decimal point when they're looking at binary
or octal or hex data.
Right. as a decimal point when they're looking at binary or octal or hex data.
The term that, and by the way, all of this,
this comes from when I worked at AMD where I was designing these types of chips.
That is the bit slice products and the RISC CPUs. Right.
I resolved that by giving it a, My generic name is the radix point.
Okay, yep.
So rather than the decimal point,
for whatever radix you're working,
whether it's decimal, hex, octal, or binary,
that dot is the radix point.
Right, right.
And then I get less frustrated and angry.
Right.
Well, to encourage you to continue listening to the podcast after
after this episode i'll try to uh ensure that that other guests as well adhere to to uh to that
practice but you you mentioned amd right and and this is kind of i guess fast forwarding a few years
from uh where where we last left you but i think you joined amd in. Is that right? That's correct.
So that's five.
We've jumped over about six or seven years there.
Okay.
And that's fine.
We could do it later or not do it. It doesn't matter.
You did ask on the notes that you gave me
how much planning was there.
And so let me at least talk a little bit about that because...
Yeah, please do.
Yeah, one of the things I guess I would say permeates my approach to stuff is planning, right?
I mean, I ended up in a department as a manager of product planning at AMD and then later at Xilinx. But my planning stuff dated back six years
before I ended up at AMD.
So I'm going to jump us back to that 7.4.181.
Okay.
I'm going to jump over everything else that's relevant,
but I'm going to come back to the 1.81.
That approach to building the ALU as a slice of four bits
of what is a larger data path,
that ended up getting a name called bit slice.
And its origin was, and I don't remember who had it first,
but it was either Intel had a two-bit slice alu or monolithic memories
had a four-bit slice alu right and these were more than the 74181 they added the register file
so rather than having the the the two operands that feed into the ALU being in external chips, the bit slice processors,
which were basically a 181 with a register file, right?
There was a two-bit version called the 3002, I think, from Intel.
And then there was a 6701, I think,
was the part number from monolithic memories.
Okay.
And it was a 4-bit slice.
And the ALUs in both of them were not as good as the 74181.
Okay.
Right.
But they did do add, subtract, and, or, XOR, shift left and shift right by one bit.
And actually, the shift left and shift right by one bit. And actually, the shift left and shift right
meant you needed additional communication bits.
So if you're only doing add and subtract,
you just need a carry borrow line.
But for shift in the shift left,
you can use the carry chain.
But for shift right,
you need something that's pointing the
other way right right okay so you know that added an additional um across the alu type
control signal um so mmi wasn't particularly successful with with their bit slice ALUs. I haven't ever heard of a system that was built with them.
But one of the managers at MMI moved, and this would have been probably in the
mid-70s, moved from MMI to AMD. And at AMD, the 2901 was born. And the 2901 was far and away the most successful bit slice processor.
And it was a 4-bit ALU that was more functional than the 6701.
It had a better register file.
It had generate and propagate lines that I don't think the 6701 had so you could actually
use these bit slice chips from amd with the fairchild 182 chip right and build and take
advantage of that generate propagate logic right um and so lots of computers, both mini computers and mainframe computers,
were built with 2901s.
So, for instance, the Data General answer to the DEC-VAC.
So when DEC went from the PDP-11 to the VAX-11-780,
which happened in 1978, I believe, right,
that was still using 141s. Data General had a competitive machine
called the MV8000, and it used 2901s from AMD. And then you started seeing 2901s
in mainframe computers from pretty much everyone except IBM, right? And many computers, so later models of
PDP-11s, probably used 2901s. And there were a lot of other companies that looked at the
success of Data General and DEC and were lesser players, but still building CPUs. And then in the early 80s through to the late 80s,
there was a new range of computers built by a bunch of startups
which were called mini supercomputers.
So they weren't quite the Cray or CDC 6700 type real super-duper computers,
but they were machines that were bigger than the biggest mini computers,
but they could handle workloads and compute that typically used
RISC-type instruction sets and bit slice um alus right yeah so anyway
um how did i get down on that path don't remember oh other than to say you know the the the style of
building building stuff got more integrated um so anyway i was aware of all this stuff in the late 70s
right and so by this time i'd already designed my first chip i hadn't mentioned that my i designed
my first chip in 1972 when i was 12 years old okay um. At that time frame, Fairchild and National Semiconductor
actually had factories in Australia, in Melbourne.
Only an hour's travel by public transport for a 12-year-old.
And those, they were primarily, they were test and packaging.
Okay.
Right.
So the chips came in as wafers already fully processed
they were then tested diced and packaged right some going to export and some feeding into the
minuscule australian electronics industry right so i went and visited one of just as i went off to honeywell's right timeshare
building i went off to fairchild right and asked for a tour and so i got a tour by one of the sales
engineers through the manufacturing floor and it wasn't it wasn't a particularly big facility
um you know they basically you know there was probably less than 100 people working there right but i got to see all that and you know i asked about you know so how do these you know
chips get designed the guy says well some people in america in the product then this is fairchild
right yeah that they they they look at the features that people want or the sales people
tell them these are the features that customers are asking for and then they then at the features that people want or the sales people tell them these are the features that
customers are asking for and then they then plan the products and then and this guy really didn't
understand how that then turned into a product right so then you know stuff happens and the
chips end up here and here the exciting stuff happens we We cut them up, we test them, and we package them, right?
Right.
So he skipped over the whole, you know, design the chip part
because he really didn't have any handle on that.
He just talked about what was locally there.
I said, so what if I had an idea for a chip
because I keep wanting this function,
but it isn't in the 7400 catalog?
And he says, I have no idea yeah I
said well if I if I wrote a datasheet for the chip right would you send it to
those people and he said sure and so I so I wrote a data sheet and laid out the pinout and the logic function and the timing table.
So basically all the stuff, you know, and I typed it up on a typewriter.
I didn't have a word processor back then, right?
And I gave him this four or five sheet data sheet for a chip that would have something that i was using often which was
an rs latch which was a pair of cross-coupled nand gates which is used among other things to
debounce switches right right so that you get nice clean transitions i said you've got all these pins. You could have four of these in one chip rather than me going through all of these 7400 NAND chips, right?
You could have four of these circuits in one package.
That chip is still built today, and it's called the 74LS279.
So that was my first product planning experience did you get any uh attribution
in in this process at all never never but but but the two but i claim and there's no one
there's no one still alive to deny it um i claim i designed the 74 ls 279 well that is uh probably not something many 12 year olds can
say yeah well and and all all believe at this point right it doesn't matter the point was
it got me into understanding right right the process of you need a data sheet you need to have thought through the
functionality you need to make sure it doesn't need more pins than the package has right right
etc etc okay so late 70s i'm now programming pdp 11s i've got a real job actually working in fortran in a company we're going to skip over
all of that unless you want to go back at it later and i'm looking at what are people using
for building processes and the answer is they're using amd 2901s right and so i start learning about bit slice right i mean i already had a handle on it from
using the 181s but now i'm looking at what replaced them right and i'm looking at how it
changed i said damn it i want to end up somewhere near that so the only company i want to work for is AMD right right so that's so I want to work
for AMD I figured there was no way I'd get to work on bit slice directly so my
goal was how do I at least get to make coffee for those guys right right that
that that was my that was my plan get to make coffee for the guys who worked in the bit slice department
so I could talk with them.
Right.
From Australia, the closest I could get to AMD in Australia was AMD had a distributor.
So AMD didn't have any physical presence in Australia at all, but there was a distributor
called R&D Electronics, and they distributed
AMD's parts. They also handled InterCell, Monolithic Memories, Zilog, and a bunch of
other companies. So in Australia, the common thing is, since none of the silicon companies
had a presence in the country, because Fairchild and Nat Semi's packaging division
had decamped years ago due to some political silliness, right? So Australia had no silicon
industry whatsoever, but there was a bunch of distributor companies. These days, you'd talk
about Digi-Key and Mouser, et cetera, right?
So these are much smaller versions of that for the much smaller Australian market,
and they represented multiple product lines,
typically trying to avoid conflicts of interest, right?
So if you look at the product line from AMD, MonoMemories,
InterCell, Varo, and Zilog, there's almost no overlap
between them, right?
And so they were presented to customers.
If a customer says, I want this, we would look through our catalog of parts, right?
And say, well, we have a solution for you and it uses these MMI parts or whatever. So anyway, in Australia, all of these distributor-type companies
were just basically sales and shipping organizations
and a little bit of marketing.
The equivalent in the US, though,
they typically had something called an applications engineer.
In fact, if you look at a company like AMD,
it actually has two categories of application engineers,
internal and external.
So the internal application engineers answer the telephones,
solve problems, write the data sheets,
probably write the application notes,
and they take the calls from the other application engineers who are out in the field,
who are called field application engineers.
Right.
Big surprise.
So a field application engineer who has a problem with a specific customer would work with the customer to refine the problem description, right, whatever that might be, and then approach the applications department inside the company to have someone take ownership of it and either resolve it using the internal resources, which might include going off and talking to the chip
designer. Right. Right. Depends what was, you know, what level of skill is needed to resolve
the issue. So, in the US, the distributors all had application engineers, right, who were external.
Not all of them worked for AMD, right, Or Intercel or whatever, right? They actually worked for the distributor.
So you had two categories of external FAEs.
You have field application engineers who are distributor FAEs, second-class citizens,
and then you have first-class citizens who worked in AMD's sales offices spread across the country. Okay.
Right?
Who primarily only ever talk to the really big customers.
Right.
Right?
For all the little customers, they sent them off to the distributors.
Right.
Right? But once a year, AMD held their international sales conference and FAE conference. And they upgraded the distributor FAEs,
and they were invited as well. And so they had a conference that would run for a week,
right? And they had all the distributor FAEs, all the company external FAEs, and all the internal application engineers,
all in a huge conference facility, in AMD's case, usually in Hawaii.
And then factory people from the engineering department would make presentations
about what is in the pipeline and what to look out for in terms of opportunities,
plus marketing and salespeople would say, here's the products we already have, right? Here's how
you go about, you know, here's the types of customers you look for, here's the products
you offer them, right? So, it's a sales and engineering conference. AMD had them once a year.
That included external FAEs that work for distributors. In Australia, none of the
distributors had any FAEs. So I approached Australia's AMD distributor, a company called
R&D Electronics, and said, have you considered having an application
engineer across all your products?
And the owner of the company said, we've been thinking about it, but we didn't know who
we'd go get.
Right.
And I put my hand up and said, I'll do it.
I would love to be Australia's first field application engineer.
Right.
And so that happened at the end of 1979.
I joined as an FAE.
And while my primary goal was eventually to make coffee for the bit slice people in California. The first thing was to
become competent as an application engineer. And I did it across all of that company's
product lines that it represented. So I didn't just become an AMD FAE, I became an MMI FAE, and I became an InterCell FAE, and several other companies.
The MMI FAE position turned out to be, in the short term, the most useful, because I got to see PALS before they got released. And so I was basically training people how to use PALS
when they first became available from MMI.
Right.
Which was a wonderful experience.
Wonderful.
But I also became an FAE for AMD,
and so within my first year of joining this company,
they sent me to America.
Awesome.
Right.
And I got to attend the application,
the sales and application conference.
And I got to meet, it wasn't my expectation,
but I actually got to meet some of the people who were the actual
chip designers in the bit slice group right and and had interesting discussions so i stayed with
that company for about three years attending conferences on an annual basis and building up my contacts at AMD and at MMI and at Intercel and various other
companies. And I had a load of fun in Australia, talking to lots of people, lots of different
engineering teams, all with different problems. So I got to see a very broad range of electronic design and different levels of
competence of teams and whatever um eventually i left that company and i because i wanted to
actually do some engineering right and i joined a pair of small companies one after the other both of which had serious management problems okay and
didn't last very long uh any Australian companies yeah yeah these are Australian companies with
under 20 employees doing primarily s100 based stuff, both designing boards and building systems
and selling accounting software and business software
and that sort of stuff.
That didn't last more than a year.
So it was basically two companies, six months each,
and it was like, this is not for me.
Then all through that period,
I'd also been a lecturer at one of the local universities
teaching final year system design.
So fourth year EE degree students,
I was teaching system design and balancing trade-offs and
writing low-level code and a little bit of dsp and a little bit of making ethernet work etc etc
so after the two small companies that really didn't suit me, I asked the university if I could become a full-time
lecturer. And so I hadn't lost sight of wanting to work in BitSlice, but I didn't feel I was
quite ready to make the jump. And I didn't actually have any opportunity for that. So I I took on a role in Australia. The job title was senior lecturer.
The US equivalent would be associate professor.
That is responsible for course creation and syllabus and examination and evaluation, et cetera. So, it was reasonably autonomous, but, you know,
not professor level, you know, high status. Right, right. Fair enough.
Right, right. But, you know, I basically created a new, you know, digital course that, you know,
was fun to teach. Unfortunately, when I moved from being a part-time lecturer where I was just
teaching two subjects a week, so all the time that I was being an application engineer, I was also
taking off half a day per week to teach at that university. So they already knew me. And that
company that I was working for was incredibly nice to let me have half a day off a week to teach at the university.
They figured I'm training their next generation of customers.
Right.
Because guess what?
The data books that I handed out for them to do their assignments, the data books were AMD and MMI and Intercel.
Right.
Right. books were amd and mmi and intercell right right so so it was you know there was clearly um you
know some linkage there anyway um the politics when i became a full-time member of staff was
unbearable the the the the trivial politics between the various academics just drove me nuts. to someone in the applications department, right, to see if I could become an AMD internal applications engineer.
So that was my path.
I figured with my credentials,
having been a field application engineer in Australia
and having multiple years of them seeing me,
that it would be pretty easy.
Unfortunately, it took three or four resumes being
sent because the guy kept throwing him in the trash um which was very sad but eventually
instead of trashing it he gave my resume to the head of the bitSlice group. Oh, wow. The product planning group for BitSlice.
Okay.
And so he called me, so a long-distance phone call.
He called me and interviewed me for about an hour.
Okay.
And I told him about how I had, with a bunch of friends of mine, so we've skipped over all of this, but starting in 1980 when I became an employee at this distributor company, I also bought parts at a discount.
And I bought a pile of bit slice processor chips and all the stuff that goes around it.
And a bunch of friends and I had actually built a 32-bit mainframe.
Oh, wow.
Right?
So we actually had some very interesting stuff.
It's not for this discussion, but as a do I know anything about bit slice,
the answer was I have a 32-bit mainframe that my friends and I designed and built,
and it's four times the performance of dex vax
right that's that i'm sure the interview uh went went pretty well after that
yeah it did it really well and he said i'll we'll probably get back to you
and later that day i got a call from h HR with a job offer. And unfortunately, I just started that semester.
I said, so all of this is great.
I didn't even care how much they were offering me.
I wasn't going to be making coffee for this group.
I would actually become part of that group as my entrance into AMD.
So I just started the semester.
I said, I'm going to have to delay by six months.
Otherwise, I'm going to be stranding, right, a whole class of students.
Right.
And they agreed to it.
You know what?
I half regret it, but I don't have any guilt of having stranded students, which if I'd instead jumped at it and stranded
those students, I'd probably have felt bad about it for at least a while.
But instead, I taught that semester and then packed up everything I owned and headed off
to America.
And so by this time, I'm now late 20s, and I joined AMD not as a junior in that department, but one of two managers.
So there was a guy in charge of that department, and then under him, he had slots for two managers, and then under those two managers were then the rest of the department.
And so he brought me in as a,
AMD's term was a section manager.
And so I was a section manager
in the programmable processes division.
Okay.
Right, responsible for, I thought, bit slice products.
But it turns out, nah-ah.
It turns out the department had a secret project
that was unannounced to the outside world,
which was a RISC CPU.
And so they had just finished a round of next-generation bit-sliced parts,
and rather than doing another round of it they decided
for whatever reason that risk made more sense as the next thing that their skill set right so
that the rest of amd was busy copying intel instruction sets right right whereas this
group the bit slice group was designing instruction sets
right and the risk cpu would be designing the instruction set as well right so that's where
it belonged at least initially right um and so they brought me in because of my bit slice
experience of actually having built stuff so bit slice in general
microcoded processors is not risk at all it's cisc right it's you're building interpreters
that are below that implement the assembler language right so would would the the bit
slice components would they fit into the uh the cC processors as like a target essentially for the microcode
inside of the larger system?
Very much.
Okay.
Yeah, right.
So pretty much all of these bit-sliced parts
were implementing CISC CPUs.
Right?
So the bit-sliced parts are pretty much inherently
not driven by assembler.
They're driven by micro- assembler code or microcode
right and so you have you know an instruction that might be 64 bits wide or lot or wider
that's controlling all these different microcoded chips concurrently in fact the machine that we built in Australia had a microcode instruction that was 128 bits wide.
Right.
And so it was controlling a lot of stuff concurrently, which is part of what made it fast.
Right.
Right.
But it was still cycling around an interpreter loop, interpreting an upper level assembler.
Right. interpreting an upper level assembler right um so so anyway i was coming in to work in the risk group
but i'm the guy that doesn't believe in risk right i'm i have immersed myself for at least the last five or six years on cisc type stuff and so and this risk group was built from the
bit slice group which was also of cisc architecture origin right so
the manager when he brought me in said i want you to spend a few weeks going through all of their documents
and analyzing it and tell me if we're heading in the wrong direction.
Right?
Because I'm coming in as a very solid CISC-type viewpoint
and experience of actually building CISC CPUs.
Is this RIS risk thing really better right
so i worked on it for two weeks went through all the documentation they had all their estimates of
timing that the instruction flows whatever else and at the end i said i said to the boss this is
insane right it's crazy to keep me doing this task put me on the risks team yeah this is
the future wow right and so basically i became one of the co-architects i mean in total there
were eight engineers who architected the 29 000 right everybody contributed. So I wasn't like a lead architect or anything.
I was one of a team.
Although I had three other engineers under me,
which by the way, I'd never had reports.
And I'm not a great people person.
You could have called me.
Well, I'm not sure I did my best, but I'm much better at pushing gates than pushing people.
Fair enough.
So, I did my best and I really enjoyed working.
We really worked more as colleagues than as manager and underlings right so you know there was like for instance we
uh myself and one of my underlings wrote the cross assembler for the 29 000 right um
and you know i worked with another person on various other tasks, right? So there was a lot of real work that, as a manager,
I still got my hands totally dirty along with everybody else in the team.
Right.
In that two-week window, though, that you were talking about
where you kind of became convinced, if you will,
I've read quite a bit about the kind of like risk versus academic debates more in that time period.
What was it that you saw in those two weeks
that made you a convert, if you will?
So one of the things that I carried
from my Australian project is
when I convinced a bunch of friends
to help me build this machine, which was all done as wire wrap,
was I said, I think we need some guiding ideas
that are forward-looking, right?
Rather than working with what we have,
let's predict where something is going and design for that target, right?
And I said, among the things that we've seen
is that chips get more and more complex,
but the other thing is that the cost of memory keeps dropping
and the size of memory keeps going up.
Let's use as a guiding principle
for this CISC machine that we were building,
let's use as a guiding principle that memory is free.
So wherever we're making decisions
where memory might be involved,
don't treat it as a scarce resource.
Treat it as something that's free.
So the fact that the microcode memory was 128 bits wide,
not an issue.
The fact that it was a quarter of a mega word deep when most companies
were looking at like a killer word or two killer words we were doing 256k deep microcode memory
totally outside of any norm right but it let us do things right in in terms of the way we
cracked instructions in the interpretive loop we could just have separate routines for groups of
instructions that are very very similar but would have an overhead to sort out the the minor
differences we could just with that muchcode, we just have unique microcode
for each one and avoid a branch and a compare, right, as part of the decode loop.
Right.
As an example.
Late in the project, we virtualized it.
Okay.
So, we had demand demand paged microcode
right because memory is free the register file the register file was 4096 registers
right and the instructions that the microcode engine could execute could fetch any two registers, perform an ALU op, and write back to any third register every clock.
Right?
Because memory is free.
I mean, it wasn't.
And so we didn't implement everything.
Right.
But we used that as a guiding principle.
So the memory is free i brought that with me when i started
looking at the amd project and the number one thing that people pointed at for risk is because
the instructions are so simple right you're going to have to have so many more of it and memory's expensive
and i said no no it isn't memory's free right memory's free and so that's no longer an issue
for risk right because in time right that memory that you're worried about, you know, how much it costs, it's not going to matter,
right? The performance is going to matter, right? How fast can I fetch instructions? And with risk,
there were two things. One was that memory is free thing got rid of one of the anti-risk sort of arguments. But the other one, I didn't see it directly.
It was an education that I got from one of the guys that was...
Okay, so by the way, that team of architects on 29,000 at AMD,
of the eight people, five of them were fresh out of college oh wow right so five of them were
new grads who hadn't worked anywhere prior to amd and the rest of them there was me a guy from
national semiconductor and a guy from ibm the ibm guy had worked on ibm's risk processor that eventually
became power pc right so one of the guys who was fresh out of college educated me on compiler
optimizations and he was very strong on how, because instead of having microcode where everything is set in concrete for how each instruction is executed, on a RISC CPU, you can take a step up a layer out on the onion to the next layer out.
So you're not trying to execute single assembler instructions.
You're trying to do some algorithmic piece.
And there are optimizations you can do.
The best example is, let's say,
instruction A has to store to a register
and instruction B has to fetch from the same register right and
they happen one after the other well in a risk cpu maybe that's a direct path and there is no
you're not touching the register file at all there's an intermediate register that's otherwise
invisible right but lets you over two clocks do something that would have taken three or four
clocks and blown away a register along the way right which you know could have been used you
know for holding a useful coefficient or something so it was that for the risk cpus there were lots of opportunities for optimizations in the compiler that that would
help you do a better job um with the limited resources that the risk cpu provides and so it
really mitigated more of the it basically allowed finer grain optimizations and i i really liked that so it was the finer grain optimizations
it was the argument that memory is expensive you know it i mean we we had benchmarks we had
that same guy had written a gcc uh back end for the sketched out architecture. The architecture that AMD had for 39,000 was not yet set in concrete.
The broad strokes were done, but the detailed instruction set hadn't been done.
We knew what types of instructions there'd be.
We knew the approximate layout of the bit fields in the instructions. Some of that surprisingly is important in that you want to be able to take instruction
bits and route them directly to where they're needed, like maybe controlling a MUX or directly
going into an ALU to select which ALU function, rather than having to be decoded and then
generate a control word, which is a different bit pattern.
You want to avoid that.
And so some of that is literally the artistry
of figuring out the bits of an opcode.
So the 29,000 is a 32-bit instruction,
eight bits of opcode, and three 8-bit fields.
And sometimes two of them are joined to create a 16-bit field. But the eight-bit instruction field, certain of those bits
are routed directly to the ALU. Certain bits go off other places. But suffice to say,
there's a lot of stuff there where you avoid having to put a mux in a critical path because you just lay out the bit fields.
So that was stuff I knew about.
And in fact, I was one that did the detailed design, right?
The instruction set had already been figured out by other people, but they hadn't yet figured out how does that get mapped to the op codes.
Right.
And so one of the tasks, I said, it it's a tedious job but i'm willing to do it
right and so i actually figured out the exact bit patterns for every instruction and optimize them
so that it would work in fact even things like there's fetch two registers and right back to the third, do you really want those three fields
to be in some arbitrary combination
of three 8-bit fields?
Or is there an optimal one?
And the answer is,
well, if I'm going to have 16-bit constants
and I want to write them to a register,
then maybe the 16-bit constant
should be the bottom 16 bits, and the right-back
register should be the next eight bits, and the opcode should be the top eight bits, right?
Right.
As an example.
Yeah, the simplest version of this, like, just considering where bits are in the instruction
that I've experienced, at at least is, you know,
implementing a RISC-V CPU. RISC-V, one of the nice things about it is arguments are always in
the same place. There's lots of different instruction formats, but if there's RS1 and RS2,
they're always at the same place in the instruction if they're present. And once you
actually try to implement the logic for a CPU, you understand how important that is. But if they're present. And once you actually try to implement the logic, right,
for a CPU, you understand how important that is.
But, you know, if you haven't done that before, then it's not obvious.
If all you've ever done is programmed in Assembler,
that's opaque to you.
And what you're describing, literally the 29000 does that too, right?
So there are instructions that only have a destination register.
There are instructions that have a destination and one operand and a short immediate, right,
of eight bits.
Right.
And then there's instructions which fetch two registers and write back to a third.
And because there is a place in the data path where you have to choose between a register being fetched and
that short immediate or two registers being fetched and the long immediate right the 16 bit immediate
those mean there are muxes somewhere right right right that have to decide am i taking the bits
from the instruction or am i taking the bits that the instruction is indexing into the register file
and there's a register file access that is pipelined so I won't actually get that register
for another clock right so that there's a bunch of you know juggling there right and
it it it wasn't like something I'd knocked out in an afternoon, right?
Right.
I probably spent two or three weeks designing the bit patterns
for the instruction set, right?
So anyway, there was just a bunch of things that I saw
in what the 29,000 was capable of doing and what I knew
it was competing against in CISC would always require multiple instructions and those
those interpreted that interpretation process
was blocking optimizations that were open to well-written compilers.
Mm-hmm.
Right.
Right.
You know, it may surprise some of the listeners that it is,
if you know what's going on behind the scenes risk one a hundred percent right and he said oh but but x86 dominates the market
x86 on the package dominates the market inside both amd and intel's x86s are risk cpus right and they all have a an on-the-fly
translation engine and caches that translate the x86s instructions into um risk operations. AMD calls them ROPS, R-O-P-S.
And they then go into the ROPS cache, right?
And the actual compute engine is executing from the ROPS cache,
running risk instructions, right?
Right.
So, you know, and some of those are doing optimizations on the fly as
well although again there are opportunities in the compiler to help things along right okay um
where would you like me to talk next well one of the other things about the uh i don't know if it was the 29 000 or uh
another chip that you worked on um but we also briefly i think in the past talked about register
windows um was that were those on the 29 000 yeah that was on the 29 000 so okay it's interesting. I've wondered, so the comparison is,
so there were two processes that used a rolling window of registers
that the execution stream is busy modifying.
And one way, the term we used,
and you'll have to forgive me a little bit because we're talking 36, 37 years ago.
It's been a little while since I worked on this.
Someone commented to me that as you get older, you talk less about the great things that you'll do in the future and more about the great things you've done in the past.
So I'm clearly in the older category.
And some of it, the memory fades because it's back far enough.
But so one of the things that was part of the 29,000s architecture well before –
so, I mean, a lot of work had been done before I arrived,
a lot of work.
And one of them was that there'd be this register window.
And so basically what a register window is,
is it says that all the registers live in memory, right?
And there were processes before the 29 000 that did that the
among them uh was the texas instruments 9900 which dates from the 70s except it really had
all the registers sitting in memory right so kind of slow right right the register file in the 29 000 we referred to it as a stack
cache so it's a and there is well that for the purpose of this discussion we'll say there's 128
registers there isn't there's actually 192 was originally 256, and we ran out of room.
So originally, because we had 8-bit fields,
it was intended to index into the register file
with 256 registers.
Well, it turns out the chip design guys
couldn't build the chip
because there were too many registers.
And so something had to go.
And we ended up throwing away 64 registers.
And so we have a lower half of 128 registers and the upper half of 64, which used to be 128, right?
So there's a negative optimization because we ran
we ran out of silicon right and that then propagated to every member of the family afterwards
okay so the 120 so let's talk about half of the register file that was still there right was 128 registers nominally the stack pointer is a pointer somewhere into those 128
registers and all the registers above that pointer position are parts of the stack that haven't been
touched yet or are stale from previous execution and the stuff below the stack pointer is active
stack frames that we we plan to look at as we unroll the stack okay so the register file 128
registers is like a barrel right you have a pointer says, here's where I am in the 128 registers.
There's a marker that says,
this is the first of the valid registers.
And so there up to the stack pointer
is the top of stack with N registers.
And then you have from there to,
you wrap around to that same boundary
register boundary pointer which are the registers that you could use because
this cache is sitting there you just haven't dug deep enough in your sub
routines yet right okay so when you want to do a subroutine call, you compare the current stack pointer, bottom seven bits,
to the boundary register, bottom seven bits, and you say, do I have enough room to allocate the
registers that I need for this function, right? So the compiler optimization already knows I'm passing three parameters plus a return address plus
a frame pointer or something. And they're all going to get pushed on the stack. And then we're
going to jump off to the subroutine. So I need seven registers. Are seven registers available
above the current stack pointer? Right. And you look at it and say, yes, there are.
So you just update the frame register,
you update the stack pointer,
you put your parameters into the appropriate places
in the stack cache, and you go to your subroutine,
and it can then do indirect or offset references
to the new stack pointer to get its parameters off the stack.
And if it wants to call another routine, it does the same calling convention.
Right.
Eventually, if you just keep digging down, right, deeper and deeper into a subroutine call tree,
eventually there won't be enough registers.
Right. into a subroutine call tree, eventually there won't be enough registers.
Right.
The registers that are furthest away,
which are right at that boundary register,
but looking at it from going backwards,
those are the ones you're not going to want to look at for a long time until you unwind the stack.
So you flush them out to real memory.
Right.
Okay.
That's called a spill, right right and you just move the reference
pointer right to however many registers you threw away and now magically you've got more stack
right so the cost the cost of doing a spill which is when you're digging down through the subroutine tree, right, is the cost of writing the registers out and updating the pointer. is only as many as you need or spill everything got a good choice or maybe spill half
right and i don't remember i i we certainly didn't spill only as many as you need and we
certainly didn't spill everything i think the default was you just always spill half
right so you spilled 64 registers out to memory and then you continued executing code now
eventually you start executing return instructions and eventually the frame pointer is pointing
back before the reference pointer which says the registers you want aren't in the cache,
right? Right. And so now you do a fill, right? So you're unwinding the stack call. So now you do a
fill and the fill is bring in 64 registers. Right. Right. And so nominally, the stack pointer is always in the middle with 64 registers behind it and 64 registers after it. to figure out that whatever it was that the spill and fill quantities were,
they basically meant that a very high percentage of subroutine calls
triggered neither a spill or a fill.
So you had something like 95%, 97% of subroutine calls
didn't trigger a spill or a fill.
So the efficiency was outstanding, right?
So you always had the top of stack, right,
which had all your subroutine parameters
or return values or whatever,
are just always available
in the fastest memory the chip has, right?
Right. The register file. are just always available in the fastest memory the chip has, right?
Right.
The register file.
And so that's an optimization versus, like,
some other register window strategies. But just to, like, compare to not using register windows, right?
If you have registers you need to preserve across a subroutine call, you have to have a function prologue or prelude where you're going to push it onto the stack.
And so one of the things you're saving is all of those operations where you're manually pushing and popping things to stack. x86 code right or arm code for that matter because arm has this has a similar problem
you'll see particularly for things that might end up being interrupt service routines right
is they end up having to flush all the registers right right out out to memory because you need a working set. On the 29,000, you say, my interrupt service routine needs X registers.
Can I just move the pointer and give it X of its own registers?
And when we're done, we'll return them all.
So absolute register numbers, they don't exist in 29,000 code.
They're all, what is the offset from the top of stack?
And the top of stack is always on chip, not out somewhere in main memory.
Right?
It's somewhere within those 128 registers.
I did some analysis of, I basically heard about register windows from learning about Spark.
And I was looking at x86 and ARM and now REST 5.
And I was curious why none of these newer, if you will, right?
I mean, some of these trace back to before.
But why they don't use register windows.
And I did a bunch of reading.
I ended up writing a blog post
that kind of like went through some of the trade-offs but i know uh i had briefly mentioned
to you seeing a comment about uh spark register windows were perhaps not implemented quite as
efficiently do you think that contributed to register windows not being included in subsequent architectures? I don't think so.
I think that the Spark decision – so Spark also has register windows that spill and fill.
The difference between it and 29,000 is the pool of registers is smaller and the allocation size is fixed. So when you enter a subroutine, the movement of the frame register is always by the same amount. Whereas for 29,000, it's optimized on a per function basis. So if you only need three
registers, you only consume three registers. Whereas Spark always consumes 12 or something. I don't remember the number, but whatever it is, because of the call instruction that said, here's how many registers I need to allocate.
Right.
Right.
So in the 39,000, part of that subroutine call is the frame increment.
Mm-hmm.
Right?
So, and, you know, that was because of the way that those bit fields were laid out,
was, you know, we had an 8-bit immediate field that you weren't needing for the subroutine call.
Right. Actually, it was was the how is it done
it was actually not the it was not the immediate field it was the destination register field
right the one that was always available and that left you with a 16-bit uh relative offset for how far away the function could be so you could be up to 64k words
of offset right which was split 32k before and after the current pc value and if you needed
more than that there was a different call instruction that said call indirect through
a register right which would have a 32-bit absolute address
or a 32-bit offset from the current program counter.
Right, makes sense.
So, you know, I suspect it was just a function
of how the bits were laid out.
And, you know, there was also an overhead
that it required an extra 8-bit adder in a somewhat critical register
access path.
Mm-hmm.
Right?
So, you know, it wasn't without its own costs in the silicon implementation.
Right.
Yeah.
Right.
That makes sense.
Yeah. From a, uh, uh, maybe as a, a kind of, uh, um,
wrap up of, of where we've gone so far and we haven't even gotten to, uh, Xilinx and FPJs and
everything. So I am absolutely going to have to have you back. But, uh, from your time at AMD,
um, maybe specifically on the 29,000 or other work that you did there. Was there any other kind of like lessons or takeaways that you had,
both from the perspective of things we've been talking about of,
you know,
specific to processor architecture,
but also just,
you know,
general organizational takeaways as well.
Oh,
I have,
I have a career worth of organizational takeaways.
Be really careful picking your boss um and when you do make sure on their team and working to meet whatever their goals are
um so that's that's really important and understand what the hot buttons are of
of whoever is managing you and make sure you don't don't do anything to shoot yourself in the foot
right i did that a few too many times um if you have the ability of getting a boss who really
understands i guess i've had this attitude that um the function of a manager slash boss is to protect the people below from the crap above
right so you have upper level people above your boss's level who are busy jerking chains and
changing schedules and you know doing things that would really upset what the team is trying to get done on what is probably
a tight schedule. And so a good boss insulates the workers, the worker bees from the politics
that I really shouldn't affect them. Right. And I've had bosses both who do that well and
don't do it well at all right and it it it really affects you um i have a a fairly complex I guess architectural uh observation not mine this is when I joined AMD's so AMD's product
planning uh and I know it's in your question list which is how different was AMD product
planning from Xilinx and there's an Australian term which I think we should propagate to America it's like a chalk
and cheese okay right they might look the same right sort of you know just a
piece of chalk piece of cheese look take a bite you'll see there's a difference
right so anyway the Australian to I mean here it's like apples and oranges or you
know something and pickup trucks or whatever but the Australian term here it's like apples and oranges or something in pickup trucks or whatever.
But the Australian term is it's like chalk and cheese.
It's just really different.
So anyway, one of the things that AMD had was actually a whole product planning department.
So the group I was part of, the processor products group, was one of several teams there was so our team was like eight or nine people total including its section manager
i think product planning was like maybe 90 engineers oh wow and one of right because amd
back in those days had a very broad range of proprietary products.
They sold all of that stuff off, but there was a PLD group that did PALS.
There was a networking group that built Ethernet chips and communication chips and RS-485 stuff.
There was a group that was doing video graphics chips
these are all 1980s type stuff right so not not like pc type stuff right the pc wasn't really
much on the horizon it was just it was basically monochrome monitors at that point, right? Right. So there was a graphics group.
There was a group doing disk controllers.
There was a group doing microcontrollers, sort of 8051 type stuff.
There was a memories group.
I used to joke with other people in my group that the memory groups product planners had the easiest job in the company they needed to come
in once a year take a data sheet mark it up by doubling all the numbers yeah throw it back into
the thing and then go away for a year right right so anyway there was about 90 engineers in product
planning almost not all but almost all of them had industry experience building systems
okay that is almost unique to amd right i i well i think intel has at least something similar
in that their processor architects, I think,
have all worked on building motherboards, right?
Whereas AMD's, like the guys that ended up, you know, for instance, in AMD's product planning group for graphics chips
had worked somewhere else on graphic systems.
The disk controller guys had worked for some big disk manufacturer, right?
So the engineers in product planning at AMD
all had, not all, almost all had
actual domain-specific industry experience.
Right.
Usually when I'm being cynical,
hard to believe I could be cynical, usually when i'm being cynical hard to believe i could be cynical usually when
i'm being cynical about just chip companies in general i might my common comment and it's every
time i see something stupid in some chip that i didn't design right is these people couldn't design a board with two chips to save their life.
And I will say that I've seen that as true of most chip designers.
They go through the Carver Mead playing with rectangles path
and lambdas and whatever else but actually
building a system with 50 chips on a board and making the clock tree work and the race conditions
work and the logic worked first time you know whatever actual system design most chip designers have made it through college out with a degree saying
i know how to run allegro or cadence tools or whatever and i can design you a billion dollar
chip but give them give them a you know an 8-bit microprocessor chip and a LED and an EPROM
and ask them to put it together,
they have no idea which end of the soldering iron to pick up.
Right.
Right?
So that was certainly a difference with AMD's product planners.
At Xilinx, which is where I worked later,
I was the first product planner so so i couldn't
point to other people who you know who didn't have that but certainly the ic designers there
you know not i mean they were great ic designers but they the the common thing you see, and not just at AMD, I saw it at AMD, at Xilinx, I saw it at MMI, and I've seen it in data sheets from other companies, is chip designers will have a problem and they'll push it out to the pins and say, the chip is perfect.
Right.
And they've left something that's now pushed out to the pins right which may be
impossible to deal with right i've certainly seen chips where you know literally there's stuff
that's happening on the pins where there's no external way to make the chip work reliably
right right very sad the the products end up failing terribly because
everyone who tries to build it either has to put enormous amounts of effort to work around the
problem or they don't spot the problem and they go into production where 20 of the boards don't work
because the chip has this inherent bug and no amount of patch wires is going to make it
suddenly better right right interesting yeah i think the uh the analog that we have in the um
software engineering world is you always want to work with product managers who
used to be engineers that's uh frequently something we look for so um yeah it's a it's been a good rule to
live by yeah so anyway one of the so the reason i talked about the fact that you know there was
about 90 engineers in product planning at amd was there was a guiding light for overall for AMD in that time frame, which was that AMD was building,
the products were building blocks, right?
So all of the bit slice was the building blocks
to build arbitrary size computers
from mini computers, mini mainframes to mainframes.
The graphics chips were building block chips.
So there were things like chips that did texturing, chips that did the line rendering, right?
There were basically building blocks in different combinations might let you build different types of graphic systems.
And that permeated all of AMD's products in the 70s and the 80s. The guiding light within AMD's product planning group,
which was, you know, I was educated about this from an early time of joining, was that the term
was mechanisms, not policies. So create the mechanisms to let an end customer build the system he wants. Don't do something in the chip that sets a policy which he or she has no way of working around or the chip is just not a good fit because you've already decided on
a policy that might not match what they need for their system, right? So when you look at all of
those types of products, that was kind of a guiding light. I mean, there were some products where,
you know, if it was a 2400 board modem chip amd made a 2400 board modem chip then guess what that one
that was policy from beginning to end because there was a spec for this is what a 24 kilobit
modem chip does these are the frequencies it has to transmit and receive on these are the signal
levels here's where the serial comes in and out on rs232 levels
build a chip right right so there were some chips where that you know where policy was set you know
by specs but in general it was that if you're doing building blocks, build general functionality and flexibility, right?
And leave it to the end user to figure out
which combination of building blocks
is the best way to build his castle.
Right.
Awesome.
Well, like I said,
we're definitely going to have to have you back,
but this was an incredibly enjoyable conversation for me.
I hope you enjoyed it.
Awesome.
Awesome.
That's great to hear.
Well, Philip, thank you for joining us, and we'll look forward to next time having you on again.
Yeah.
Okay, Dan, thank you very, very much.