Advent of Computing - Episode 140 - Assembling Code

Starting point is 00:00:00 The weather up here at the central office is a little... particular. I'm in that part of the Pacific Northwest that has mild winters, our warmest months are always in the fall, and we have more rainy and foggy days than we have blue skies. There are only a few roads in and out of the county I live in, and any good trail for some reason ends up being at the end of a pretty sketchy dirt road. For me, this means my life falls into pretty seasonal patterns. When the weather is good, I can basically live outside. But once winter actually sets in, I become something of a homebody. This is helped by the fact that once we hit the height of the

Starting point is 00:00:38 stormy season, roads to the outside world tend to get shut down, although to be fair that's become less and less frequent in recent years. One result of these cycles is that in winter I spend a lot more time programming, and invariably that leads me into long stints writing assembly language. Just about every January, sometimes a little earlier, I start a new assembly project. At one point, I did a full BF interpretation for DOS, I've done simple graphics packages, and most recently I was working on a new memory manager. I actually think I might get back to that one soon, it's been tickling my brain lately. Assembly language has always held my fascination. It's low enough down to the computer that you can actually get a feel for how the machine works.

Starting point is 00:01:30 At the same time, a good assembler gives you all these tools to work around the tedium of machine code. It turns programming, something I do as my day-to-day job, into this fascinating and grand puzzle. Every time I dip back into assembly, I end up finding some new way to solve problems, some new approach I'd never considered. And, well, we had the first cold day of the year here a few weeks ago. We haven't quite hit the hot fall yet, and it's been making me a little wistful, so I figure it's high time to go back to the source. It's about time we really talk about assembly language. Welcome back to Advent of Computing. I'm your host, Sean Haas, and this is episode 140,

Starting point is 00:02:25 I'm your host, Sean Haas, and this is episode 140, Assembling Code. It's time to do one of those foundational comp sci kind of episodes. I bring up assembly language quite a bit on the show. It's a side character in many episodes, since it's hugely important to computing. That's especially true in the historical context. Not many people use assembly nowadays, but during earlier periods, it was a real force of nature. The thing is, I've never really explained assembly and its history in depth. This episode, I plan to fix that omission. There are two parts to this, a discussion of assembly languages, and then a discussion of where they came from.

Starting point is 00:03:04 a discussion of assembly languages, and then a discussion of where they came from. I can tell half the story right off the top of my head. The other half is actually a bit of a mystery to me. Assembly languages are very simple in concept. An electronic digital computer can be instructed to carry out any number of different operations. A program is composed of a sequence of those operations. The computer only knows binary, ones and zeros. When we're talking about programs in binary, we call that machine code. That's not a very human-friendly type of code. Assembly languages put nice little mnemonics to those instructions. Instead of encoding the proper binary incantation

Starting point is 00:03:46 to add two numbers, you can just say add one and two. Then you have some program called an assembler that converts that nicer mnemonic code into binary for you. It's a simple solution to an obvious problem. At least, it seems simple looking back on it, and the problem seems very obvious in hindsight. But is that the case? Did people actually view machine code as hostile back in the day, or am I just ascribing my modern sensibilities to the past? The other complication here is, well, the simplicity of it all. The idea of assembly language is dead simple. This means there could be multiple starting points. I'm used to languages having these lineages, literal family

Starting point is 00:04:33 trees. But with assembly, that may not be the case. It may be so simple that it just gets reinvented whenever someone uses a computer unaided. In this episode, I want to untangle all of this. Does assembly have a starting point, a concrete beginning? Can it be considered a language in its own right, or is it just a collection of similarly looking tools that are unique to each computer that comes along? And why does assembly language appear in the first place? But first, I have an announcement corner that I need to go over. The first announcement is something that I don't know why I keep forgetting to say, but about a month and a half ago, I did a guest spot on Eastern Border again. This time I was talking about some networking news that was going on in Russia.

Starting point is 00:05:21 And Kristaps, I am sorry for not mentioning this sooner. I swear I didn't forget. It's just, sometimes the microphone does silly things to my neurons. Anyway, I highly recommend checking out the episode and checking out Eastern Border in general. It's a great podcast if you're interested in Soviet history, post-Soviet politics, or the war in Ukraine. You can find the show at theeasternborder.lv. Alright, the second announcement is my upcoming trip and the conflict with Spook Month. As some of you may know, I'm going to be gone for half of October. I finally have an honest-to-goodness vacation. The issue is, October is usually the only seasonal event I do on the podcast.

Starting point is 00:06:06 So to remedy that, I'm going a little bit off format. For the month of October, I'm going to make a collection of weekly short episodes, talking each in turn about something spooky in the history of computing. You should see one each week for that month. I'm doing that production so I can get a few in the tank for while I'm out of town. So anyway, if the feed looks like this, let's get into the episode itself. If you're freaked out, it's just going to be fun. We're just going to have a fun, spooky month. Assembly language is one of my favorite ways to use a computer.

Starting point is 00:06:46 Long-time listeners will know I'm a huge fan of x86 assembly language in particular. I've also done a little 6502, but I'm no expert there. But we're already hitting a point where I need to explain some things. So let's start at square one so we all know what's going on. So let's start at square one so we all know what's going on. As I said in the opening, assembly language just gives nice little mnemonics to machine code instructions. There's this one-to-one mapping between a line of assembly and a machine code operation. That means that each processor or computer has to have its own assembly language.

Starting point is 00:07:22 Assembly for the Intel 8086 will be radically different than that for the MOS 6502. That can make things a little tricky, since when I say assembly language, I don't really mean any one language. I mean a collection of these hardware-specific languages that have similar traits and features. That can be a little confusing, I know. Just look at it this way. Assembly languages are a category of languages. They're all tied closely to some specific computer. Within that category are some very similar ideas.

Starting point is 00:07:58 The core is that one-to-one translation thing. A line of assembly turns into an instruction. That's a feat shared by all assemblers. There are two other features that some assembly languages support. These both break from that one-to-one paradigm, but in very simple ways. The first is the concept of a label. These let a programmer label a chunk of code in their program with some symbolic name, just a string of characters. Then you can reference that name in other places. That should sound kind of boring, right? For labels to seem cool, you have to have some added context. An assembly language program, once turned into machine code,

Starting point is 00:08:37 is loaded into a computer's memory for execution. When it's in memory, each instruction is actually stored at some unique memory address. And memory addresses are numeric to the machine. Any instruction that deals with memory has to talk in terms of those numeric addresses. Let's take the jump instruction as an example, since I think every computer has an equivalent to jump. That instruction tells the computer to jump and start execution from some new address, as in jump to address 123. It's how you handle branching, or with some added stuff, how you implement subroutines and loops. The computer expects you to jump to a number,

Starting point is 00:09:20 since it only knows numbers. That works for a programmer, but it's not very human-friendly. Assembly language adds this concept of a label to help us poor, poor flesh folk. You can label a chunk of your program with a more readable name. You have a part of your program that prints to the screen, so you label it PRINT, usually by tradition in all caps. You just kind of have to roll with it. Then, any place in your program, you can jump to that part of your code by just writing JUMP PRINT. The assembler does all the bean counting and all the calculating needed to turn that label into a number. Once the computer gets the code, it just sees a normal jump to some number, but you, as the programmer, you just sees a normal jump to some number, but you as the programmer,

Starting point is 00:10:06 you get to use a nice name. The second added feature is the macro. As with labels, not all assembly languages have macros. These, to put them simply, let you define replacement rules. You have a chunk of code that always has to appear at the start of a function, let's say, and you get sick of writing it out every time. So you define a macro just called start function. Then every time you write a new function, you open it by typing start function. The assembler knows whenever it sees that specific name to replace it with the macro's body before assembling your code. Once everything is assembled and transformed into binary, the computer is none the wiser. Each function starts with a few instructions. Your assembly language code,

Starting point is 00:10:50 however, has this nice little trick that saves you some time and space. And that's it. That's the basic rundown. All assembly languages give you little mnemonic codes, most add in labels, and some have macros. Perhaps you're starting to see the shape of my issue. Assembly is so simple that it could be created almost by accident. So how do we try to build up a lineage? There is another fun issue for our study. Assembly is also something of a stepping stone on the path to more sophisticated languages, at least in a very rough manner of speaking. Assembly is the first jump from raw machine code to something more human-readable. In looking at programming's past, we run into a number of early languages that superficially look or act like assembly, but are actually an early step towards some other language.

Starting point is 00:11:47 IPL, for instance, can look like assembly or even machine code. In reality, however, it's a list processing language, like Lisp. It just happens to take on a very primitive form. So where should we even start searching? Well, luckily there is an often-cited origin point for assembly language, but I have some reservations that I want to discuss. I figure we can begin with the proposed story and work out from there. This takes us back to 1947 and a pair of proto-computer scientists named Andrew Booth and Kathleen Britton. At least, those were their names in 1947. A few years after 47, they would be married, which adds another little fun quirk to the paper trail. Some papers are authored by Booth and Britton, some by Britton, some by Booth,

Starting point is 00:12:39 and some by Booth and Booth. Really, really good for maintaining a good document trail, right? Just after the war, the two were working at Birkbeck College, part of the University of London. They also did some side work for the British Rubber Producers Research Association in the same period. It was while working through some kind of rubber experiment that Andrew struck upon an idea. Why not build an automatic calculator? It would save a lot of time and it would be a nice little project. In short order, he drafted up plans for ARC, the Automatic Relay Calculator. Now, there is scarce sourcing around ARC.

Starting point is 00:13:26 Now, there is scarce sourcing around ARC. This isn't helped by the fact that it's alternatively referred to as a calculator and a computer in some cases. ARC was built during this short transitionary period in the middle of the 40s. Computers were just starting to exist, and the idea of a stored program computer was out there, but there was yet to be an example of this type of machine. ARC was initially planned and built out to use a magnetic drum for memory, run calculations through a parallel circuit, and be configurable via a plugboard. Inputs and outputs would come by way of a paper tape. That would make it something like a souped-up calculator, or maybe a souped-up

Starting point is 00:14:05 tabulator, but development actually came in fits and starts. In 1947, Andrew and Kathleen made a trip across the Atlantic to Princeton, New Jersey, to meet, who else, but Johnny von Neumann. This is the inciting incident that changed the fate of the Booth's computers. They would both become devotees of the so-called von Neumann architecture. From what I gather, the majority of ARC was built after this encounter with Johnny, but the relay machine would stick with older practices. The next computer that Andrew Booth designed, SEC, would be a stored program machine. It used a single memory space for both code and data, so it was a true von Neumann architecture computer. The important piece here is that SEC could actually be programmed. It had somewhere that code could

Starting point is 00:14:57 be loaded and then executed from. So where do you think folk point to for the origin of assembly in this story? It's gotta be SEC, right? Well, no, that's too obvious. They point to a 1947 paper titled Coding for ARC. Confused? So am I. From everything I've read, including an article written by Andrew Booth himself, ARC was not a stored program computer. We don't get that until SEC.

Starting point is 00:15:33 But there could have been a bridge between ARC and SEC. According to some sources, there was another machine called ARC-2, which was a redesign of ARC that used a von Neumann architecture. That wouldn't have happened until late 47, meaning our coding paper is probably talking about ARC2. The reason for the confusion is simple. ARC2 wasn't an official name. Andrew doesn't mention ARC2 in his papers. We just get ARC and SEC. But the only way for this paper to make sense is if it's talking about some intermediary computer with a bad naming convention. So, ARC2 it is, I guess. That also explains something strange about the coding for ARC paper. When you read it, it seems a little vague about some pretty important details.

Starting point is 00:16:27 The paper describes a way to notate code for arc. The details here are all pure von Neumann architecture through and through. It addresses memory as a place for both code and data. It talks about code living in the same address space as useful numbers. So far, so good. Very normal. The notation here isn't quite what I'd expect from assembly language, but these are early days. Each line of code presents some expression with an operation, followed by a destination for the result. So addition would read roughly as do the accumulator plus two, store it in the accumulator. But we'll get back to that. This is where the vagueness starts to come in. The paper has no full listing of instructions. We don't get a list of operations that are available to us. Rather, the paper discusses

Starting point is 00:17:18 a few possible operations as they pertain to larger programs. Once we reach a section on multiplication, things get a little more clear. The paper goes through how addition, subtraction, and a left and right shift can be used to write a program to multiply two numbers. The paper then says, in many more words but I'm paraphrasing, that you shouldn't really write a program this way, and instead the computer should have a built-in multiplication operation. It does the same thing with conditional jumps. The implication here, at least if I'm reading between the lines right, is that ARC isn't actually running code in 1947. Andrew and Kathleen are still trying to nail down what the computer should do. So this first option is an assembly language for a computer that doesn't yet exist. Maybe that's a little too theoretical for your tastes. Now, you might say, why not just

Starting point is 00:18:21 reach for some paper on SEC to get the full instructions set? Sadly, we don't get that much detail on SEC either. We get mentions of its existence and some details about its construction. Besides that, we're kind of left in the dark. We get this arc coding paper and not really that much else. Let's get down to the brass tacks, though. What does this paper actually present in terms of a language? That is, of course, assuming that all the hardware is there and working in the right ways.

Starting point is 00:18:53 Simply put, we do get an assembly language of a sort. At least it meets the most basic requirement. Each line of arc code corresponds to one instruction. However, the language is pretty basic and a little strange. In most ladder assembly languages, a line is structured as an operation followed by optional arguments. Something like, move to AX. That would say, move the number 2 into the register AX. Or, add 1 AX says add 1 to the register AX,

Starting point is 00:19:27 that kind of thing. ARC's code doesn't follow that format. Here, each line is written as a mathematical expression, followed by where you want the results stored. So you get lines like 1 plus 1 to A, that's T-O. It actually uses the word to. This is a little confusing if you're coming to Arc from more modern assembly languages. In my beloved x86 assembly, mathematics are all their own operations, and storage is its own operation. A line like 1 plus 1 to a is saying to add 1 and 1 and then store that in the a register. On an 8086 or any more modern computer, this would be two instructions.

Starting point is 00:20:16 You'd have 1 to put a value of 1 into the register, and then you'd have a second where you add 1 to that register. This may initially feel like a different language, but that's not the case. This is an architectural difference. Once again, it's hard to compare assembly languages. It turns into comparing computers more than anything. Arc, whatever version is prophesied in this paper, must have just worked this way. It was set up to perform math and then store results wherever you wanted. I think in theory, and this is making some educated guesses, this means that the actual machine instructions on this theoretical arc

Starting point is 00:21:00 would be an operation in operands followed by a storage location. So the computer would be reading an operation followed by where to store those results. Once again, this is actually a one-to-one mapping. Things get more complex as we look at memory, though. One of the fun little rabbit holes in assembly language, and by proxy in computer architecture is this idea of addressing modes. That is, how can a computer discuss memory? Modern chips provide a truly staggering array of options. The x86 architecture has 17 modes for memory addressing. Usually, different modes have different rules about how and when you can use them, so you end up with this little flowchart in your head for working out how you're supposed to be talking

Starting point is 00:21:51 about memory in different contexts. That has to somehow be represented in your assembly language. For Arc, we get two modes, which, for lack of an actual name in the text, I'm going to just call direct mode and plus mode. In direct mode, you're just addressing memory directly at some address. The code represents that with a capital M for memory, then the address in parentheses. So M of N would be the word of memory at location n. Plus mode is a variant on direct mode. You could say m at n plus 1 to get a word at address n plus 1. There are also some spots where we see things like just plus 1 used as an address, which would mean to take the address of this line of code and add one. So it's something like a relative address, I guess. That's pretty close to modern assemblers,

Starting point is 00:22:52 just a little more verbose. You don't really write out the M in modern assembly, but you do have to use something like parentheses or square braces to tell the assembler, hey, I'm talking about memory, not a number. You even have plus mode in most modern assemblers, or something similar. So right now, that's fine. This is actually pretty close to what I do today. Where this gets weird is how the ARC paper handles symbolic memory addresses. The paper is mainly example code, so we kind of have to guess here. There are some spots where we see m of 10, or some literal number. That's very straightforward. There are other spots where we get m of k, or some other symbol in those parentheses.

Starting point is 00:23:43 or some other symbol in those parentheses. Crucially, these aren't referencing registers or some other variable. I know this because whenever the paper uses symbols inside M, they're always lowercase, and when the paper uses uppercase symbols, those are to talk about registers. This seems like a case of half-labels, if that's even a phrase I'm allowed to use. Let me explain.

Starting point is 00:24:08 In this paper, example programs start with an equation and then show how to translate it into assembly language for ARC. Those equations use lowercase letters for their variables. So there's kind of this built-in assumption that the lowercase variables are the same as the symbols used for memory addressing. So when you see something that says, oh, I want to get address k, that just means to point at some spot in memory that we've decided to use to store this variable named k. The half part is the language doesn't define where that label points to. We don't get a line that says, hey, K is stored at location 12 or anything like that.

Starting point is 00:24:50 In other assemblers, you would have to explicitly declare a new kind of label, give it a name and give it a location. But here, we just get the use of a label without definition. That's why I'm calling it a half label. I think this interpretation jives with the rest of the paper. There is a section where the process of coding is explained. It starts by describing how to draft a flowchart, the industry classic, then explains the next steps like this. Quote, when the flow diagram is complete, the next stage is to write out the detailed code which will contain 1. A full list of all control orders 2. Detailed memory locations for orders

Starting point is 00:25:32 3. Detailed memory locations for numbers and for any transient storage required during the computation End quote. I think this is where label addresses would be dealt with. You would have to manually figure out where the label should point. This is also a fun reminder of how early we are. This would have all been assembled by hand. You don't have some other program that reads your assembly code and spits out a binary. Oh no, nothing so fancy yet exists. You work with it all yourself, pen and paper. In that sense, ARC's assembly language is... I struggle to call it a development tool. It's more like a clerical process, not really a full language. You draw out a flowchart, then you

Starting point is 00:26:19 use that to write your symbolic code. You then have to work out the address for each instruction and what address each label needs to reference. From there, you manually transform that into binary for Arc to read. There is one big implication with this whole half-label thing. Since labels aren't explicit, you can't actually jump to a label. Really, the label is just a small tool used when writing out symbolic code. But you don't have labels in your code, so all jumps or control transfers, as the paper calls them, have to be to explicit memory locations. This

Starting point is 00:27:00 is another place where the paper is vague. Some example programs use line numbers that would then, presumably, be translated into memory addresses. Other programs just use memory addresses on their own. Either way, the half-implementation of labels makes the language look particularly archaic. One of the big reasons to use an assembler comes down to targeting jumps and calls. At least, that's one of my favorite things with an assembler. I'm gonna get a little extra technical here, and forgive me, but this is a really important feature. It's very common to do a conditional jump. As in, if some condition is true, then jump to this part of the program. Otherwise, just ignore this and keep going. So let's say you have a program that's worked up

Starting point is 00:27:51 to run a loop. You have a register set up as a countdown, so each iteration of the loop ends in something like, decrement the count register. If the register is not zero, then jump to the beginning of the loop. Otherwise, keep going or something. I don't know. If I wrote that, I'd use a label for the start of the loop. Usually I'd name it loop because I'm a bad programmer and I have no imagination for this. The specific instruction here on the 8086 is JNZ, jump if not zero. So you end up with a line that literally will read JNZ loop. Jump if not zero to loop. As a programmer, that's all you need to think about.

Starting point is 00:28:38 That's the end of my thought process. The assembler, the translation program, pulls a really nice trick for you here. As we already know, it replaces the symbol loop, that label, with the correct memory location. In the code, that's just a line that says loop, usually with a colon or some kind of syntax around it. Really simple. The nuance here is that the location of loop can actually

Starting point is 00:29:05 change around as you work on your program. If you add code before you define loop, then the address of loop in the final program will be larger. If you remove code before that address then the address goes lower, or you could even move the entire loop altogether. Without an explicit label, you, as the programmer, have to take all these changes into consideration. You would have to calculate, by hand, just where loop actually is in memory. That isn't just annoying, it introduces a whole class of bugs into your code. This is what Grace Hopper means when she talks about eliminating bugs by making programmers do less. For a first outing, Arc's code is pretty neat, but it's clear to see how early the discipline is.

Starting point is 00:29:56 It's definitely an assembly language, it's just that it's the most bare minimum set of features. It's for a theoretical machine, maybe. And it's all handled by hand. Where do things progress from here? Well, it's not necessarily a straight line. ARC code doesn't get adopted and adapted to other machines. Instead, it seems like different groups of computer users just had different ideas about how to make programming possible for humans. We start seeing something closer to traditional mnemonics as early as 1949. By traditional mnemonics, I mean characters or groups of characters that represent operations.

Starting point is 00:30:42 Arc starts off a little strange because it uses expressions to represent machine code instructions. That doesn't form a trend, however. ARC is the outlier. In 1949, EDZAC begins operation. This is one of the earliest stored program computers. No hand-waving or theory involved.

Starting point is 00:31:02 EDZAC, however you want to pronounce it, has memory that you actually load code into, and data into. It runs instructions encoded in binary. It has some quirks, but in general, we're looking at a recognizable machine. How does assembly factor in? Well, it turns out EDZAC had an assembler. factor in. Well, it turns out EDZAC had an assembler. Kind of. This gets weird less because of weird design and more because of antiquity. EDZAC's primitive assembler was called the Initial Orders. Does that sound exciting enough? I've said before that a computer is not useful on its own. It needs software in order to do anything. That's always been the case. I think this is put best by some old UNIVAC manuals I recently read. You need software to make a computer, quote, do your bidding. If you just turn on a computer without

Starting point is 00:31:57 any code, it doesn't do anything. Most machines are designed to start looking for software to run as soon as they're flipped on, and in the absence of that software, it just keeps looking. Different machines will start that search in different ways. EDSAC specifically tries to execute whatever data was stored as the first address in memory, and then moves forward from there. If there wasn't anything in memory, the machine would do, well, you should know by now, nothing. So in order to do anything with EDSAC, some initial program had to be loaded into memory. That was what the initial orders were. Turn on, do this thing, and then keep going. How were the initial orders an assembler? Here, I'm working off the 1949 paper The EDSAC,

Starting point is 00:32:47 an electronic calculating machine by Wilkes and Renwick. The paper is one of the first to fully describe EDSAC publicly, and gives us an early glimpse into the initial orders. One thing to note, and a point that is expanded on in the paper, is that the initial orders weren't meant as some unchanging pillar. They were meant to be expanded and swapped out over time. It would be possible for this boot-up code to change as needs and research progressed. We're just going to be looking at the initial orders as they stood in this 1949 telling. This program functioned like a type of ROM, a read-only memory. Remember,

Starting point is 00:33:27 early days, so there is some weirdness. The initial orders were hand-wired into a rotating drum. At the flip of a switch, that drum would spin and the initial orders would be read into memory. Then the program could actually take over. It starts by attempting to read the paper tape. The orders expect tape to be formatted in a very specific way. It has to start with a number which specifies how many words of data to read from that tape into memory. That's followed by code, which is very close to assembly language. To quote the paper,

Starting point is 00:34:04 First, there is a letter which defines the function of the order. Next, a group of one to four figures representing the numerical part in decimal form. And finally, the letter S or L, indicating that the number to be transferred to or from the store is short or long. End quote. transferred to or from the store is short or long. End quote. Store here is just EDSAC phraseology for memory, and short or long is just in reference to the number's size. What we have are human-readable letters followed by arguments. You're able to feed human-readable code directly into EDSAC as soon as it starts booting up. From there, the initial orders convert this code into machine code. The program, quote, converts the numerical part to binary form and assembles the order with the functional part

Starting point is 00:34:57 and the numerical part in the correct relative positions, end quote. We even get the word assembles. It does an operation to convert between the letter it reads and the binary operation code EDZAC can understand. It turns integers into binary, and everything is packed up, assembled, in just the right way. That sounds an awful lot like an assembler to me. The paper even mentions one possible expansion to the initial orders. It claims it would be possible to add frequently used routines to the orders, and then call them up by name. In other words, labels for some library of functions. For 1949, that is shockingly sophisticated stuff.

Starting point is 00:35:44 But let's hold on for a minute. How user-friendly was this? And what should we even call this initial order's code? The documents I've read call it symbolic code, or the symbolic form of a program. I think of it as the symbolic form myself. A program can take on many different forms, and an assembler is really just translating between the symbolic form and the binary form, so that makes good sense. That's pretty modern and reasonable phraseology. From what I've seen, programming EDZAC would have

Starting point is 00:36:19 been a little more reasonable than talking straight binary. An add operation is just the letter A, for instance. But that's not the whole picture. We get some mnemonic codes that line up really well. S for subtract, L and R for left and right shift. Then we get things like V for multiply, or G for branch if negative. I to read characters from the tape, it's all a little messy, but it's kind of what you get when you only have one character to work with, so that can be forgiven. But here's where it starts to get even more messy. You see, assembly on EDSAC is actually a trick. It's all a sleight of hand. To be clear, I'm not throwing shade at EDZAC. This is a cool trick, and it needs to be considered when looking at its symbolic code in the context of assembly.

Starting point is 00:37:11 On EDZAC, opcode numbers, the actual binary number used to signal different instructions to the computer, were picked so that they'd line up with specific character encoding. were picked so that they'd line up with specific character encoding. A is add because, to EDZAC, the character encoding to print an A is the same as the binary encoding for the add operation. Both are 111000. If you told EDZAC to print out an ADD instruction, the printer would read A. That means that the initial orders don't actually translate operations. An A is read from take and put into memory.

Starting point is 00:37:59 Since data and code are treated the same by EDZAC, it sees an A, knows that's the opcode for ADD, and just does it. The initial orders are just composing arguments and doing a little formatting between integer and binary numbers. While that may deflate some sales, keep the context in mind here. This is still a step towards more automation. The previous assembly language for ARC was a full manual affair. The initial orders still automate away some of the process of programming. A few wise choices in EDZAK's design made programming more human-friendly, at least to a point, right? Now, EDZAK wasn't the only machine to pull this character-to-opcode trick. UNIVAC-1, the purported first commercial computer, did the same thing. On Univac,

Starting point is 00:38:48 opcodes line up with character codes, and A is add, S is subtract, all that jazz. The scheme here is very similar, albeit without the load program that EDZAC relied on. As for timing, well, that's an interesting story. UNIVAC was one of those machines caught up in development hell. To understand this, we need to go back to ENIAC. That machine, developed by J. Presper Eckert and John Moutley and a pile of other researchers, was a compromise machine. The team had wanted to create a computer that had things like memory and could store a program, but they had deadlines and budgeting to think about, so it was decided to create a more

Starting point is 00:39:31 low-tech machine that would be done sooner. Behind the scenes, the ENIAC team kept hammering out details for a better computer. They would call this theoretical computer EDVAC. In 1946, a draft report on their work, written by John von Neumann, is leaked by someone in the lab. We don't know who exactly leaked the paper, but there are theories. We do know that it was none of the three Johns. This ends up being where the idea of a stored program computer picks up steam. The draft, and copies of the draft, spread around the world. They end up inspiring the Booths to make a better machine. It directly inspires the team that creates EDZAC.

Starting point is 00:40:15 This ends up becoming one of the foundational papers in these dark arts. That same year, 1946, two Johns spin off their own company, the Eckert-Mouchley Computer Company. The business plan is to make and sell stored program computers. These would be pretty explicitly EDVAC-style machines. Early documents even call their first proposed computer EDVAC-2. The plan, however, turns out to be nigh impossible to pull off. External forces also conspire to make things really hard on EMCC. The first EDVAC-2, now renamed UNIVAC, isn't completed until 1951, but its architecture was, in large part, based off research done for EDVAC. I mean,

Starting point is 00:41:06 in large part based off research done for EDVAC. I mean, it was designed and built by some of the same people that designed EDVAC, so there is a bit of a lineage here. In other words, I don't think EDZAC inspired UNIVAC's coding, rather there's this shared lineage. And what's the root of that lineage? Well, it's somewhere inside the minds of all the Johns and their co-workers, but we do get a little window into their world. The EDVAC draft report doesn't talk about assembly or encoding characters and operands, but there is a kernel of something in it. The report explains that a computer should carry out a series of instructions. Those instructions will need to be encoded as binary for the computer to understand them. Von Neumann gives a rundown of instructions, some suggestions for binary encoding, and he gives

Starting point is 00:41:52 each instruction a so-called short symbol, to quote, the short symbol to be used in verbal or written discussions of the code, and in particular, in all further analysis of this paper, and when setting up problems for the device." Right at the beginning, we get this recommendation that binary instructions need a human-readable form. The symbols von Neumann used are a little abstract. He describes a computer in terms of organs that each had a single letter identifier. C for the control organ, A for arithmetic, and so on. He used arrows to show sequences of moving data. There would be a one-to-one equivalence, but it wouldn't be super clear. So is this the Wellspring we've been searching for? Maybe.

Starting point is 00:42:47 This is at the very start of stored program computing. It suggests this one-to-one human to machine equivalence would be necessary. It has weird little mnemonics, but this is still the bare minimum for assembly language. Let's spice stuff up a little bit. After all, we deserve a little extra. What's been missing from this episode so far? Well, I'll tell you. It's a three-letter word we should all know and love. IBM. Ever onward. And for the next part of our story, we are moving onward to the 1950s. By the middle of the 50s, computing has a slightly better footing. We've reached actual mass production, culture is starting to form, and programming has become more of a discipline. Still a long way to go, but we're getting there. It's out of this period that we see one of the first widely used symbolic assemblers.

Starting point is 00:43:47 It's called SAP, which can either stand for Symbolic Assembly Program or Share Assembly Program. This was the result of a collaborative effort with Share, an IBM user group, and, in fact, the first computer user group. What exactly do I mean by symbolic here? To put it simply, we're talking about labels still, sometimes termed symbols or mnemonic labels. The idea is that you, as a programmer, give a human-readable name to some location in

Starting point is 00:44:18 memory. It could be the location of the start of a loop, or it could be where you plan to store some data later on. Once that's set up, you can reference the label by name. In other words, you can speak symbolically instead of numerically. This is something that programmers were already doing by hand. Univax manuals even suggest creating these tables of symbols to reference when hand-converting code for the computer. Funnily enough, this antiquated process is actually what's called coding. Maybe think of that whenever someone says they're coding or they're a coder. Do they actually mean

Starting point is 00:44:57 programming in a real high-level language? Or are they stooped over a keypunch converting numbers to binary? A big missing piece early on was automation. Why would programmers want to give up this lovable pastime of encoding ones and zeros? Here, I'm going to actually pull a period argument from an IBM technical newsletter in 1955. This is out of a paper called Symbolic Coding and Assembly for the IBM Type 650. Here, the author is discussing organizations involved in software development. Quote, Coding in actual machine language presents many serious objections for organizations of the second type above. Among these are the following. 1. Changes are difficult to make. 2. Portions of the coding cannot be easily relocated in memory. 3. It is actual coding and as such represents

Starting point is 00:45:56 a compromise between feasible machine design and programming requirements. Since, of course, machine design gets more than programming does from the compromise, actual coding is not an effective means of programming. End quote. The first gripe here is simple enough. Humans don't really like to read binary. It's not a very fleshy language. The second part is more subtle. When you write a program in machine code, you have to reference memory by address, at least in some capacity. You say you want to read data from address 123, or you want to store data in address 345. That on its own doesn't affect relocation that much. As long as you're referencing data, well, you're a-OK. Just assume that those chunks of data are always free for you to use.

Starting point is 00:46:48 But jumps, that's another matter. You say jump to location 456, which you know will be part of your program, as long as your program is loaded into the right spot in memory. If that program gets moved, all bets are off. The computer will dutifully jump anywhere you tell it to. It doesn't care if your code's been moved out from under it. It will jump right off the digital cliff. Older programming practices make relocation even more dangerous. The reason is self-modification. Older computers weren't all that capable. They didn't have much memory space, they didn't have many instructions,

Starting point is 00:47:26 and the instructions they did have weren't the most flexible. To get around this, many programmers wrote self-modifying code. As the program ran, it would rewrite part of itself. This could change the target of jumps on the fly to emulate more complex logic. It was used to implement indexing on machines that didn't have indexed memory modes. It could do all kinds of really wild and powerful tricks. Self-modification in this period wasn't just a hack, either. It was actually recommended by manufacturers. It was expected behavior.

Starting point is 00:48:02 And, crucially, the program was still living in memory. You had to reference addresses to modify them, so in practice, just about any program would have weird and unpredictable issues if it was relocated. There are a few ways to make relocation possible, though. One is to use virtual memory or segmentation, which wouldn't really be a thing for about a decade still. Another is to use relative addressing for everything. Instead of referencing memory by a fixed number, you say, go x bytes ahead of me or x bytes before me. If you only use relative addresses, then your code can be relocated just fine. However, the computer has to support a relative addressing mode.

Starting point is 00:48:51 That's a common feature today, but not all computers in the 50s had relative addressing. Thus, we have to turn to something, well, almost sophisticated. There's this whole class of complex and highly engineered solutions to limitations of computers. I've personally written a lot of code that falls into that category. In theory, you could use labels to make code relocatable. That's what the paper I cited earlier, Symbolic Coding and Assembly for the IBM Type 650, proposes. The idea is somewhat simple. You write your code using labels instead of addresses. Then you pass it through the assembler that converts labels into addresses. The assembler works up a mapping between those symbolic labels and numeric

Starting point is 00:49:40 locations in memory. You can also instruct the assembler where in memory you want to load the program. That's then taken into account during its calculations. That way, if you need to suddenly use a different region of memory, all you'd have to do is rerun the assembler with some different inputs and get a new binary. This isn't the only paper on the subject. By the mid-50s, there's a lot of work to this end. That 650 paper appears side by side with a number of different papers, all taking different passes at relative addressing tricks, labeling, automatic coding schemes, and proposed assembly languages. It appears that there was a huge problem with hand-coding machine code. But there wasn't a good solution. It just wasn't a solved problem yet. There were some solutions, but none had

Starting point is 00:50:30 been widely adopted. But there was one waiting in the wing. For this part of the story, I'm using the fabulous book IBM's Early Computers from MIT Press. It's a bit of a tome, but it's a great reference for this era. In 1955, a team inside IBM started work on a project called SOAP, the Symbolic Optimal Assembly Program. This would be one of the first symbolic assemblers, and it would run on the IBM 650. SOAP pulled three tricks in one program. The first was mnemonic assembly language, the by now familiar one-to-one mnemonic-to-machine code conversion. The second was symbology. This part actually looked pretty modern. You could start a line with a label, composed of normal, alphabetic characters. Then you have your usual line of code.

Starting point is 00:51:27 Forevermore, that label refers to that location in memory. The assembler takes care of switching between labels and addresses. It does this by keeping a table of labels used during assembly. In that way, it's literally just an automation of the hand-coding methods that were already in practice. The third trick is something peculiar to the IBM 650, memory optimization. And I don't mean optimizing for less memory use. The 650 didn't use random access memory. It used a drum.

Starting point is 00:51:59 Data was stored on a spinning magnetic drum. This meant that access speed of an address depended on where the drum currently was in its rotation. This quirk was paired with a little quirk of the 650s instruction format. Each instruction actually ended with the address of the next instruction to execute. Many drum machines worked this way back in the day. You're basically telling the computer to do such and such an operation, and then to jump to some address for the next instruction. In practice, this could be used to optimize for memory speed. You know how long an addition takes.

Starting point is 00:52:39 You know how fast the drum spins. So you can work out the best place for the next instruction. That's complicated and annoying, but it's just a result of these early memory systems. Honest to goodness, RAM just isn't there yet, and when it does arrive, it's very much revolutionary. But until then, we just get some weird stuff going on. Due to drums, programmers had to also be thinking about how to optimize the layout of their program. That's yet another hassle to deal with that's not actually programming. SOAP had routines to automatically optimize code placement on the magnetic drum. This was possible because those placements were all well-defined patterns.

Starting point is 00:53:26 You just have to do some calculations and you can spit out the right path to go down. SOAP would do that automatically for the poor programmer. That, altogether, makes for a somewhat modern-looking assembly language that was uniquely tailored to its platform. And make no mistakes, was uniquely tailored to its platform. And make no mistakes, SOAP was cutting edge for the time. IBM's early computers explains that SOAP was only possible because that specific 650 being used at this one lab in IBM had been recently outfitted with a wondrous new device. The upgrade in question was the alphabetic device, a circuit that allowed the computer to read every character of the alphabet. Truly staggering that that would be an upgrade,

Starting point is 00:54:16 but that was the state of things. Early Computers further explains that SOAP was intended as an internal tool to help IBM developers. It would quickly break containment and spread outside to almost every 650 installation as alphabetic devices became more available. What's interesting here is that SOAP was actually a huge boon to productivity. There's this memo that's cited in early computers that I wish I could find. It's called The Programmer as Clerk, and apparently was an explanation of how hard it was to program without SOAP. I just want to read that so bad. Programs like SOAP

Starting point is 00:54:58 automate away so much tedious work. Hand assembling, handling labels, and optimizing addresses were all long and error-prone facts of life. Programmers the world over were trying to move past that tedium. SOAP was one of the first full solutions. And just speaking from my own experience, this kind of automation is crucial. It gives you more time to actually program, since you have less overhead to work with. I usually call these types of projects investments, at least that's how I justify them to management at work. You're investing time now to save time later. Sure, it may take a lot of work to develop a new assembler, but in the long run, it will make your job much more efficient. Freeing up time causes this feedback loop where you're able to take on projects that

Starting point is 00:55:47 you couldn't even dream of before. The net result is a huge gain in productivity. Share was founded in 1955 as a way to facilitate this whole investment cycle. The goal of the organization was to bring IBM users together to collaborate on larger projects and to share information and software. By spreading out investment work, programming tools, and the like, share could make investments pay off even more. If you spend a few months writing a new tool to use inside your company, well, that can only go so far. That only benefits you and your

Starting point is 00:56:23 co-workers. Under the share model, that same work could be spread out over multiple organizations, and multiple companies could reap the benefit. That makes the cost-to-benefit ratio go way up. The investment is much more valuable. But that requires coordination. In 55, during some of the first share meetings, the topic of a symbolic assembler came up. The time frame here was crucial. IBM's new 704 had just come out the year prior. Share was actually initially founded just for this new computer.

Starting point is 00:56:56 But since it was so new, there were very few tools. There wasn't one standard assembler, for instance. That doesn't mean there were no assemblers floating around. Share members were working on a number of different tools for assembling, coding, and optimizing, but none of them really did it all. The problem here is the model isn't quite up and running yet. Investment isn't being socialized, but it was about to be. Over the first and second share meetings, the issue was discussed. By the end of the second, members had landed on one project, an assembler being developed by Roy

Starting point is 00:57:32 Nutt. That assembler was SAP, a fully symbolic assembler in the mold of SOAP. The features of SAP itself were chosen, in part, by share members. During the first meeting, they had worked up a list of features that any standard assembler would need. By the second meeting, it was discovered that SAP met all those requirements. Among those was something neat that I just want to address on a technical level. SAP introduced a fake kind of addressing mode. This new assembler allowed a programmer to compose an address by using a mathematical expression. You could reference a label and then modify the value, say loop plus

Starting point is 00:58:11 one to point to the instruction after the start of your loop. The final value would be calculated by sap, so the computer would just see a number. But to the programmer, this offered another way to handle memory. This kind of rotation is wildly useful for, say, structured data. Best of all, it's a pretty simple feature to implement inside an assembler. And to the programmer, it would look like suddenly your computer has a new addressing mode. That's just a really interesting and powerful feature. That's not the only cool new feature, so check this out. SAP could create two different kinds of outputs, absolute and relocatable. Remember the whole memory relocation issue we discussed? Well, SAP had that solved. When you loaded the assembler, you could ask it to create a normal absolute binary.

Starting point is 00:59:07 In that mode, it would use absolute addresses for everything, concrete numbers pointing at memory. Alternatively, you could switch it into relative mode. The 704 supported relative addressing, at least, sort of. There were enough memory features that you could work up relative addresses. This meant SAP could create fully relocatable binaries, programs that could be executed from any location in memory. Once again, it's automating away extra work and considerations for the programmer. The final big feature of SAP is less part of the program itself and more an

Starting point is 00:59:43 added value of share. That is, the subroutine library. This is one of those really cool confluences of features. The 704 supported subroutines. It's a little clunky compared to later processors. There isn't a call stack, if that means anything to you. But you could make a call to a chunk of code and expect it to return when done. Of course, the 704 expected you to call an address. SAP made it possible to call a subroutine by a label, which made code much more readable. Then we have the library. Programming libraries weren't anything new. Programmers had been bundling up and reusing handy code since 1947, at least. Maybe. It's at least as old as programming itself.

Starting point is 01:00:31 The principle here is simple. It's another one of those investment things. Let's say you've worked up a slick little routine that you use all the time. The usual example in these early days is either a square root function or some kind of trig routine. So you take that code and bundle it up for later use. In the early days, this could have been as simple as saving it on some punch cards and putting it in a box that you could later pull out. It evolved into creating these shared binaries. Basically, you could load all these useful routines into memory and then consult a table

Starting point is 01:01:05 of addresses whenever you needed to call them. Other programmers could also use those routines, thus saving time and effort. Investment returned. The share library kicked things up a notch. It was a collection of these useful routines created by share members. Better still, it could be used inside SAP. These routines were actually named. You could actually call up sine or cosine without breaking out an addressing table or hand-coding anything. That right there is some real power. Alright, we've reached the end of this episode. I think this leaves us with an interesting conclusion. It seems that assembly language doesn't really have a hard and fast origin.

Starting point is 01:01:55 Allow me to explain my thinking, and allow me to admit I am open to counter-arguments. We see technology similar to assembly language showing up at the same time as stored program computers. That means that, in some capacity, assembly has always existed side-by-side with machine code. The language developed along with the development of stored program computers. However, not all computers had fully-fledged assembler programs. Even in the absence of assemblers, programmers were still using mnemonic codes. There were entire procedures to convert between flowcharts, mnemonic code, and machine code, all by hand. In that sense, assembly language isn't some revolutionary idea, but instead it's an automation of an existing practice.

Starting point is 01:02:46 idea, but instead it's an automation of an existing practice. Where it appears, it's something like a formalization of a cultural practice. That's the best way I can think to describe it. It's a tool that allows the programmer to be less of a clerk and more of a programmer. And oh man, I do really want to find that memo one day. Thanks for listening to Advent of Computing. I'll be back in, well, I guess in a week, with a spooky episode to kick off October. And hey, if you like the show, there are a few ways you can support it. If you know someone else who'd be interested in the history of computing, then take a minute to share the podcast with them. You can also rate and review the show on Apple Podcasts and on Spotify.

Starting point is 01:03:25 If you want to support the show directly, you can do so by buying Advent of Computing merch or becoming a patron on Patreon. Patrons get early access to episodes, polls for the direction of the show, and bonus content. You can find links to everything at my website, adventofcomputing.com.

Starting point is 01:03:42 And as always, have a great rest of your day.

Advent of Computing - Episode 140 - Assembling Code

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.