Advent of Computing - Episode 96 - What Exactly IS A Database? Part I

Starting point is 00:00:00 There are a handful of pretty technical terms that have crossed over into the mainstream. Someone might casually refer to a mainframe as in, oh, they're breaking into the mainframe. Folk talk about networks all the time, or at least the word shows up despite being a very abstract and complicated idea. And databases, well, those even get mentioned on the nightly news. It seems like every other week some database somewhere is getting compromised by someone, whatever that actually means. There's good reason for this last example. Databases really make modern computing work. You listening to this podcast is possible because somewhere there's a database that lists shows you can listen to. There's some stored structure that has every episode of Advent of

Starting point is 00:00:51 Computing neatly laid out. When you download an episode, it gets added to a smaller database on your phone or computer that lists all the episodes you can play locally. You might even post about this episode online saying how impressed you are, and that would actually get fired off and stored in some database or another somewhere. But what exactly is a database? The short answer is it's a place where you store data. That's succinct, but doesn't really differentiate it from, say, a filing cabinet. A database is much more complicated and much more sophisticated than just a pile of data. I've actually worked with databases for most of my professional career. Most days I kind of live

Starting point is 00:01:40 inside a database. I think this makes me more than qualified to at least explain what a database is. That said, there's one major gap in my knowledge here. I don't really know the origins of the database. Oddly enough, this is one of those spots where I have plenty of up-to-date knowledge but lack the deeper archaic lore that I usually go for. So, dear listener, let's go on a journey together. Let's see where databases come from, and let's follow that lineage into the modern day. I already have a thread in mind, so it's just a matter of pulling it until the entire sweater unwinds. Welcome back to Advent of Computing. I'm your host, Sean Haas, and this is episode 69.

Starting point is 00:02:36 What exactly is a database? Part 1. This is going to be one of those episodes, or rather series, where I'm going in kind of blind. At least, on the historical side. You see, as I said in the hook, I actually do work with databases on a daily basis in my professional career. I write statistics software, which relies on databases to function. I've been on this bench for years, so I know my way around quite a few of the big brand name databases that are currently in use. As with anyone who works in the software industry as long as I have, you start to form opinions. In this case, I have my favorite databases and I have my least favorite. One of those that falls into the latter category is SQL, the Structured Query Language.

Starting point is 00:03:28 As with all good stories, I actually do have an inciting incident for these episodes. A buddy of mine was recently looking for a new job. Most of the listings he saw were seeking employees with some SQL experience. Now, my friend is a pretty industrious and capable fellow, so he decided to learn SQL so, you know, he could send his resume to more places. What better way to learn a new tool than with a new project? We were talking one evening and he told me that he had just installed MySQL and was starting to figure things out. I, in turn, gave him a stern warning, starting to figure things out. I, in turn, gave him a stern warning, as any good friend should do. I told him to watch out, that SQL works in weird and surprising ways, that it uses an archaic language interface, that joins can be a little confusing at first until you just learn them,

Starting point is 00:04:21 and that its rigid schemas can really suck when you're starting off a new project. You know, a here be dragons sort of speech. At the time, he didn't seem that perturbed. He said how bad could it be, right? It must be fine, especially if so many companies use it. Well, a few days later, he got back to me. He wanted to complain about SQL, and I was more than receptive. I'm not sharing this story to say that SQL sucks. The database has its own applications, and that's bore out by the state of programming. SQL is still one of the most popular databases out there. I've always thought that this is probably due to some legacy reasons. It can't be because

Starting point is 00:05:06 SQL is the best thing out there, because it simply isn't. There are better options for certain use cases. Sure, SQL has its uses, but it can be a bit of a blunt instrument at times. Let's just leave it at that before I get too far into shop talk. I've been meaning to set aside some time to talk on the show about databases in general, because I really do want to get to the root of this whole SQL thing. I know off the top of my head that it's an old language interface, that it's an old database, but I'm not entirely sure how old. To do that, I've worked out that it'll take two whole episodes. Today we're going to be starting with the earliest roots of the database, and then we'll move on to

Starting point is 00:05:52 SQL itself in the next episode. I'd like to figure out why dragons have chosen to stick around this specific program, and where are those dragons brooded? But hey, this is Advent of Computing we're talking about. My pedagogy is firmly rooted in long-reaching rambles, so we of course can't just look at SQL on its own. So we have to dive into the earliest origins of the database. This will help give us a good footing to figure out why SQL is how it is, and why I've always felt it's so archaic. Like I said, I don't want to spend a whole episode complaining about a program I don't like. That's not productive, and it doesn't really get me anywhere. Instead, I want to see where it's

Starting point is 00:06:39 coming from. I want to understand what SQL was made to replace, what problems it solved. I think that could take me beyond mere frustration, so let's exercise some of my demons. Before we get into this episode proper, as is custom, I have to plug notes on computing history. We have a couple new articles that have come in through the editing pipeline, so we're well on our way to issue one, but we still need more articles. If you want to submit and you don't have to have any experience writing, then get in touch with me. Like I said, I'm not looking for people that have experience per se, more anyone who's interested in writing articles about the history of computing. If that sounds like you, then go to history.computer and read

Starting point is 00:07:26 how to submit. You just send me an email and I'll pair you up with an editor. Now, with that out of the way, let's get into the episode proper. First off, I should just start by explaining what a database is. I realize that if you aren't in IT, then you probably haven't been exposed to this stuff. Like I mentioned, it's kind of back office technology to begin with anyway. A database is just a structured system used to store and manage information on a computer. A database will be running on some server somewhere, then you can connect to it in order to manipulate or otherwise access the data stored within. This can range from adding new information to searching for records, all the way to pretty complex computational tasks.

Starting point is 00:08:13 In most cases, a database has some smallest unit of storage. This can be called a record or a row or a document. I personally like document since I think it's a nice generic name here. Each record will contain fields that store data. So you might have a zoo database that has records for each animal. Each record would have a field for the animal's name, its species, where it came from, its birthday, and where it's currently living. The query part, the ability to carry out searches, is really where a database earns its salt. In the zoo database

Starting point is 00:08:55 example, you could look for every animal record with the species dog, or you could look for every animal from North America, or you could build a query to find, for instance, every animal born in May. In this way, a database can be used to analyze information, to make sense out of random piles of data. Much like computers in general, on their own, a database isn't that useful. These are tools, so to make them shine, they have to be used with a bit of programming. This, as far as I'm concerned, is what makes databases a big deal. Every database has some accompanying library so you can use the database with some other programming language.

Starting point is 00:09:41 This is what I mean when I say a database is really a tool. In this sense, it becomes a cog in a larger machine. This allows you to use a database for persistent information storage. Let me try to explain this since this is going to be important to the larger story. Normally, when you run a program, there isn't any persistence of data. The program starts up and you start making variables for storing information. Those variables only live in the computer's memory, which is intended for very short-term use. Once the program ends, all your data just goes away. That is, unless you do something to save your data while the program is still running. The easiest thing to do is just print the data to screen. Then someone can jot those

Starting point is 00:10:31 numbers down and keep a record of the program's outputs. That's fine, but it sucks. It's not very automatic. If you want to get fancier, you could output data to a file. This has the benefit of being fully digital and easy to manage. You can also read one of those data files when the program starts up again. So you can have some persistence there. This is better than paper notes, but still has problems. Files aren't structured on their own. You have to spend some time and some code working up a way to structure these data files for later use. There's also the matter of file size.

Starting point is 00:11:13 Usually when you read a file, the entire file just gets loaded into the computer's memory. That's kind of a dumb way to do it, but it's often the default. If you have a file larger than your computer's memory, then you need to get tricky with how you handle that data. You have to write more code to solve another annoying problem. Another issue is concurrency. What if some other program or even another part of your own code needs to access that file? Well, that turns out to be a really bad thing. Thus, you need to write up code to carefully manage your data file. You just keep running into these little ancillary issues that you have to solve on your own. A good database will handle all of that for you. On the back end,

Starting point is 00:12:00 a database might even just use normal data files. But the database software has already solved this mountain of little issues for you. You get a canned solution that saves you time, complexity, and gives you more features than raw files ever could. This lets you focus on programming your project and gives you more power in how you manipulate your own data. Using a database, you can simply ask for some data once your program starts, then send your results back to the database when you're done. That's the general rundown, and that's also a very modern view of databases. But where did all this come from? The short answer is General Electric.

Starting point is 00:12:43 The long answer is a little more interesting. You see, old school computers were dumb in quite a number of ways. This isn't a knock against anyone or anything. It's just that we didn't really know how to best utilize resources yet. Back in the day, prior to timesharing, computers operated in what's known as batch mode. Tasks would be scheduled out on a physical calendar. Once a new time slot rolled along, the proper program would be fetched, often in the form of a stack of punch cards, fed into the computer, and executed. You'd eventually get outputs, and then those outputs

Starting point is 00:13:22 would be passed along to whoever asked for the program to be run. The cycle would then start anew back at the top. This is a very serial way to use a computer. One task comes after another after another. Now, I've kind of beaten the time-sharing thing to death on this podcast, because it's a very important point in history. death on this podcast because it's a very important point in history. The transition from running one task at a time to running multiple tasks at once, that fundamentally changed how computers could be used. That's the code-centric view of things, at least. There's also the data side of things here. I think of this as an interesting interplay between technology and human practices.

Starting point is 00:14:06 At this time, pre-1960s, long-term storage devices were almost all serial. Punch cards stored one bit after another after another, and could only be reasonably used in a specific order. Paper tapes stored one bit after another, and had to be scanned from start to finish. Magnetic tape, well, that's the same as paper, just fancier electronics. The digital brain was only furnished one number at a time, so just like batch code execution, data was also handled in batches. It's just how the beast was built. That's not a very relatable mode of operation for us, since, you know, we now, in the modern day, have random access media. Old sequential storage is basically dead, save for certain niche applications that most of us will never get near. I mean, I've held

Starting point is 00:15:07 a backup tape, but I've never used one to carry out a backup. Hard drives changed this at a very fundamental level, but this happened a little slowly. The first commercial drives were released by IBM in 1956. Adoption occurred, but it's not until the 1960s that many third-party drives hit the market. As far as I'm concerned, third-party adoption is kind of a good benchmark of wider spread of a technology. Random access changed things considerably. One change it set the stage for was more complex data management. This brings us up to 1961 at General Electric. Now, you probably don't think of GE as a computer outfit, but back in the day, they actually manufactured a lot of computing machines. Their computer concern was just one part of a much larger operation. Each of these operating units ran with some autonomy, which is kind of weird to read about.

Starting point is 00:16:12 Separate operating units within GE would sometimes contract with other operating units, almost like they were unconnected entities. This might have been something to get around antitrust laws, but I'm not an expert on the history of GE, so let's just skip past that. Anyway, in one of these sections, there was a programmer named Charles W. Bachman. At the time, GE bundled large projects into these yearly integrated system projects, which sounds something like long-term development sprints. Bachmann recounted the outcome of one of these ISPs in an 2009 IEEE paper that he titled The Origin of the Integrated Data Store. You see, Bachmann could, in some circles, be considered the father of the database. The lead-up to this early database,

Starting point is 00:17:05 IDS, is very specific. This big 1961 project at GE had to do with streamlining production. The company's high-voltage switching gear factory had recently ran into an issue. Customers weren't getting their high-voltage switching gear on time. That's a bit of a problem for manufacturers. After some digging, a culprit was found. At the time, the factory had been using a computer to track production. In theory, this was a simple and foolproof setup. An order would come in, it would be entered into the shop's computer, and then the machine would schedule the actual manufacturing process. You'd end up with a set of cards for each step, and each step would have a completion date punched onto it. In theory, this would make operations a breeze, just kick back and let the machine manage everything.

Starting point is 00:18:00 No human overhead necessary. But practice was a different story. From Bachman, quote, The root of the problem became clear after observing the many tote bins stacked under the windows and along the aisles of various shops and assembly areas. Each tote bin contained a pile of parts in the process of being manufactured, and had a small packet of punch cards wired to the box. Each card represented a planned manufacturing step. It displayed a date indicating when that step was supposed to be started.

Starting point is 00:18:34 Many cards for steps that should have been finished were still attached to the tote bins. The original computer-generated completion dates were frequently being missed, and the original schedule was seldom used. End quote. The computerized manager wasn't a very savvy operator. How could it be? It was just a machine, after all. It planned everything under the assumption that tasks would be completed on time, that nothing would go wrong during manufacturing. It was supposed to be getting inputs fed to it when steps were done,

Starting point is 00:19:11 but it sounds like no one really bothered to deal with that. The machine would just spit out poorly planned schedules anyway, so why bother? So we get two things that need correcting here. other. So we get two things that need correcting here. A better system would be able to account for issues in production, while also being easier to use. So that's a start for this bigger project at GE. Bachmann's analysis showed that there were many possible points of failure in the production pipeline. He characterized the process as a network of interconnected steps. Before components were soldered to a board, for instance, those components would have to be sourced. The boards would have to be produced. The soldering station would have to be available for use. Each of those prerequisites would have their own prerequisites.

Starting point is 00:20:02 Call them dependencies, if you will. It's a lot to keep track of, and there weren't good tools to handle this kind of interconnected data. That is, unless we talk about really early research systems like LISP at the time. This left Bachmann and his colleagues in a bit of a bind. To solve this manufacturing problem, they needed a new kind of tool, one that could be used to model interconnected data. Further, the tool had to be powerful and fast enough to actually be used. The solution was IDS, the Integrated Data Store. This is where we have to break out the documentation. Bachman does a good job explaining

Starting point is 00:20:46 how IDS was developed and why and how it was used early on. His accounting is neat and all, but it doesn't have the firm technical details that I like to see. I want to start by looking at the actual data structures used in IDS because, well, it's a data base. The data representation matters here quite a bit. So what does IDS's manual have to say? The primary data structure used here is something that's called a chain. It's similar to something like a doubly linked list, but the older name here is a bit more evocative, I think. The basic idea is that you construct a list out of elements, each element having a link to the next element on the list as well as the previous element. This lets you traverse up and down the list by following links. Something specific to IDS and to these chains is that once you reach the end of a chain,

Starting point is 00:21:50 it links back to the first element. So these are a circular affair. There's also one other complication that kind of makes these different than normal doubly linked lists. Each element of the chain will have a link back to the record that owns the chain. That's something that I'll explain a little later on, but just keep in mind for now that these aren't very normal linked lists. These are special. So why does it matter that these are doubly instead of singly linked elements? Well, in a normal linked list, each element just links to the next element. You can't traverse backwards, only forwards.

Starting point is 00:22:33 In most cases, this is fine. You can program around needing these backwards links. But sometimes the added data does help. In IDS, these backlinks allowed for more rich data modeling, and it gives the database a very distinctive style of use. I'm just going to use the documentations example here, since it's pretty good. Let's say you're building an inventory system, which is what IDS was really meant for. You have three types of information

Starting point is 00:23:06 to track. Vendors, orders, and inventory items. The vendors are obviously at the top of this hierarchy. Orders come next since each vendor has a set of orders you send out to them. Now, this is where the data modeling really opens up. Each entry in IDS has a set of defined fields. A vendor would have something like a vendor name, shipping address, phone number, and so on. You can also have lists associated with these entities. These lists are, of course, the fancy chains I described earlier. The vendor entries in this case would each come with their own list of orders. Those orders would each link out to the inventory items that were actually shipped.

Starting point is 00:23:54 So at each layer, you have chains being used to manage data. We've encountered this type of modeling before. This is a very pointer-rich dataset, as in, there are a lot of links out to other chunks of data. There are two immediate comparisons that spring to mind. First is Lisp, and the second is Hypertext. Those may not sound all that connected, but all three of these technologies share a core feature. IDS, Lisp, and Hypertext are all about links. Lisp is the most ideologically simple here. The language is built around building, manipulating, and traversing singly linked

Starting point is 00:24:39 lists. Each list element can point to the next list element, and it can also point out to another list. This allows Lisp to represent very complex linked data structures like trees. It's pretty easy to see how Lisp and IDS differ. Lisp is a programming language, so it's all about dealing with data in memory, whereas IDS is a database, so it's all about representing and storing data. Then we have the whole hypertext thing. The distinction between hypertext and databases is, well, it's something that keeps me up at night. Let me explain my thinking here. The two technologies feel very different to me. Hypertext uses links

Starting point is 00:25:27 to add context and organization to data, and is intended to present the data to end users. It's a very human-centric kind of thing. IDS uses links to add context and organization to data, but it's intended to be presented to other programs. The actual nuts and bolts here are really similar. A database like IDS could even be used as part of a larger hypertext system. The key difference, at least in my mind, is intent. Hypertext and hypermedia are all about the human factor. The entire field is dedicated to making data accessible and understandable to end users. Databases in general, and IDS specifically, are geared more towards furnishing data to software. An end user isn't going to sit

Starting point is 00:26:22 down at a terminal and issue raw commands to IDS. By the same token, hypertext systems aren't intended to be driven by other programs. I know this kind of sounds like an I'll-know-it-when-I-see-it sort of explanation, but I swear, my convictions here are firm. Just trust me, hypertext and databases are different beasts. Back to IDS. The link is a central part of the database's construction. The next key aspect is the idea of a schema.

Starting point is 00:26:55 At least, that's what we call it today. A schema is just a fancy way of saying how each record is internally structured. What are those fields named? What data type is each field? That sort of thing. In IDS lingo, this is called the data description or record format. Here's how the docs describe the description, and I think it really shows where IDS is coming from. Quote, shows where IDS is coming from. Quote, in general, the format and use of record descriptions required for IDS are the same as those in COBOL record descriptions. The exception to the normal usages are described below. End quote. The invocation of COBOL here is a little bit of a wrinkle. IDS was created a few years after COBOL hit the scene.

Starting point is 00:27:53 The database would initially be used with GE's own programming language, but IDS would see a whole lot of use with COBOL. The connection is kind of unavoidable. These data descriptions are roughly similar to how COBOL handles data. In IDS, you can describe each field in terms of data type and size. That's kind of how COBOL does data types, but there are differences. COBOL uses pretty terse and cryptic syntax for describing data types. It uses a few letters followed by numbers with some special punctuation. IDS appears to use predefined data classes. For instance, the class integer 1 describes any integer between numbers 0 and 999. I haven't been able to find a list of all data classes, but they should all follow a similar convention.

Starting point is 00:28:48 Now, this presents something to take note of. 999 isn't a very digital number. It doesn't have that nice power of 2 ring to it. It's not a 32. Why would you stray from that convention? Well, there are two possible options. The worst answer is that, oh, the conventions just weren't as ironclad in the 1960s. There were 24-bit machines, for instance. Stuff was very much still in flux. But the better answer, or at least the more interesting one, is that Bachmann was trying to

Starting point is 00:29:25 make IDS somewhat platform independent. You see, early on, Bachmann had to jump platforms. The high-voltage switching supply plant never did end up using IDS. But the low-voltage switching supply plant was interested. These departments used different GE computers. They weren't vastly different, but Bachman does explain that he had to rewrite a lot of IDS to work on the smaller machine over at the low-voltage plant. From early on, Bachman was dealing with different platforms. I think in that context, it makes sense to use whatever numbers you want. IDS might not always be on a 32-bit machine. Maybe one day it would be used on a 10-bit computer, or even a 64-bit computer. There's quite a bit more detail about the low-level implementation here, but I think it

Starting point is 00:30:22 would behoove us to escape from that part of IDS, at least for the time being. So far, we've just been talking about the database itself. But what was it like to actually use IDS? How did all this data representation stuff bear fruit? It's at this point that I have the honor of adding another confusing facet to the database story. When programmers talk about a database, that usually encompasses a few different things. One is the actual database program itself, the software that handles everything. Another, at least tangentially, is the programming interface, that's the library that you use to access the database from within your own code. And finally, there's the database's own native tongue. The third one can lead to some confusion

Starting point is 00:31:14 depending on the nomenclature. When you're working with SQL, for instance, you will have an SQL server, but you'll also write commands in SQL, the structured query language. It can be an annoying naming convention. Some newer packages do get away from this, but that's slow going. IDS starts this trend in a fittingly confusing way. Initially, there were two languages involved here, the data description language and the data manipulation language. We've already been talking about the first one, DDL, although somewhat obliquely. That was just the special syntax used to describe IDS records.

Starting point is 00:31:58 This brings us into something of an interplay with COBOL. You see, this business language was codified in 1959, just two years before development started on IDS. I was actually just dipping my toes into COBOL land preparing a bonus episode, so I don't want to dive in again, but I think I have to. COBOL was designed as something of a proto-database system. So here's the deal. A COBOL program is composed of two source files, a data division and a procedure division. The data division describes all the variables and data structures you're going to use. This can be simple things like just normal variables, or it can be entire structured

Starting point is 00:32:45 tables of data. DDL is the IDS equivalent of the data division. It describes your data structures, their field names, field types, and the overall shape of your data. The only real difference is that DDL was used to describe external data structures, whereas COBOL's data division was an internal-only thing. IDS was built to spread around its data, in theory. The second language in play is the IDS Data Manipulation Language, or DML.

Starting point is 00:33:20 This is equivalent to COBOL's Procedure Division. All your queries, all your reads and writes are written in DML. At least, there is a DML at some point in the process. By the time IDS is actually out in the wild and well-documented, everything is mediated through COBOL. IDS was being accessed using a COBOL library that spliced DML commands into COBOL itself. At least, that's kind of what it looks like to me. All the syntax for DML is pretty in line with traditional COBOL-style syntax.

Starting point is 00:33:57 DML is the more important of the interfaces here. Most of the code you write for a database is involved with manipulating data. You only have to describe your data one time, at least that's the ideal situation. Sometimes we mess up, but hey, DML is where it's at. The thing is, the operating principles used in IDS are… strange. It's a lot different than a modern database, so it's been a little hard for me to wrap my head around it. IDS Operations used this scratchpad thing that Bachman calls the working space. This was used to store records that were being actively used. If you wanted to update your record, you'd first load it into the working space, make your changes, and then store the updated record back in IDS. It's a little odd, but I guess not that out of the ordinary.

Starting point is 00:34:55 Most data access works this way in the background, IDS is just explicit about the setup. There's also the matter of chains and links. This is really what confused me for quite some time. In IDS, you don't just ask the system for some specific record. Let's stick with the vendor order item database from earlier. If you wanted a specific order but didn't know the full record name, then you'd first retrieve the vendor record. That would be your starting point. From there, you'd retrieve the first order on the order chain.

Starting point is 00:35:33 Then you'd get the next order on the chain, and so on. You'd keep following the chain until you hit the order you wanted or were returned to the first order in the chain. Instead of writing complex queries, a programmer was expected to navigate this sea of links. As such, IDS and similar systems have been retroactively called navigational databases. There are distinct benefits to this over earlier sequential data storage, but there are also downsides. Really, I think this is leaning into random access in the wrong way. Sure, it's faster to follow a handful of links instead of sequentially searching the entire table. The downside is that this complexity is pushed onto the programmer. So while IDS did handle a lot of things for the

Starting point is 00:36:21 programmer, it didn't go all the way. Every paper or document on IDS that I've seen includes these crazy maps of how a database can be laid out. I'll link to some of the documentation in the show's description so you can see it yourself, because this has to be seen to be believed. They're basically these big circular webs. Arrows represent links with boxes representing documents. And you can get from any record to any other by following these arrows through other records, but it can take quite a few operations to do so. The result is you end up navigating IDS like driving through an unfamiliar city. You get street signs and everything does connect up to everything else, it just might take some time to get from point A to

Starting point is 00:37:12 point B, especially if you don't know your path by heart. During this cruise, there's one command you'll rely on. Retrieve. It's sort of the do-it-all command of IDS, kind of like the move operation in most assembly language. Using retrieve, you grab a record by name. IDS calls this data name or level one name. It's basically a key that each record must have, and it's unique. Call retrieve with this name, and your record is loaded into the working space. From there, you start to navigate. But how do you navigate, you ask? Well, simply run another retrieve. You can use the same command to follow links forward and backwards. Some records even have links going back to the top-level record

Starting point is 00:38:06 that they're all connected to. Those links are followed with, you guessed it, Retrieve. This doesn't smack of bad design to me, it's just kind of weird. Due to this multi-use nature, Retrieve has a variable syntax. You would say Retrie retrieve record name to just get a record, so you might write retrieve vendor one to get the first vendor. Then retrieve next record of record name chain to follow the chain, so something like retrieve next record of vendor one chain. chain. So something like, retrieve next record of vendor one chain. There are also a handful of other ways to invoke retrieve, but I won't list them all here. They're all of a similar type of vein. What I would like to draw attention to is that second example, retrieve next record of name chain. That's pretty verbose. You have the main operation, retrieve, the argument, your record name, and then just a pile of words. I could see the syntax being simplified to

Starting point is 00:39:17 something like retrieve next record name, or heck, maybe just next if the record is already in the working space, because you do have to retrieve the record first. This verbosity is something to keep an eye out for. Another interesting feature is the sparseness of the syntax. Everything is just an English word. You know, the kind you find in a dictionary. DML doesn't use any special characters, not even punctuation. Every command is composed of simple words and record names. This is somewhat unique for a language of the era. Contemporaries like Fortran, Lisp, or Algol are chock-a-block with special characters. Fortran used perins for accessing links and

Starting point is 00:40:06 calling functions. Lisp is composed almost entirely of parentheses. And Algal likes to use colons of all kinds. But IDS doesn't work that way. Part of this could be thanks to the relative simplicity of DML. It's not a full programming language. You don't have loops or even really complex data structures. The entire language is just for describing and navigating data. You aren't even making complex queries, so you don't need to group arguments together. So hey, maybe special characters just weren't needed. There's also the lineage factor at play here. IDS was initially developed for use with GCOM, the General Compiler. This was an internal language used at GE and with GE computers.

Starting point is 00:40:59 Now, GCOM is weird. At the time, it was billed as the future. Quote, The GE225 GECOM system, an advanced and effective automatic coding method, provides the next logical step in programming evolution. GECOM is a step towards fulfillment of the much-needed total system concept, a concept that deems an information processing system to be an integration of application, programming,

Starting point is 00:41:30 and information processing computer. End quote. Now, I don't want to veer off into a discussion of GCOM because I have a feeling I could talk about this for hours. This is a language that can be compiled into other programming languages. Call it a programming language programming language. That brings up some very interesting philosophical questions and technical questions. But to digress a little bit,

Starting point is 00:42:00 G-Com was built in an attempt to further streamline programming. At this point, we had languages that were compiled down to machine code, so why not take the next logical step? I don't know how well that worked in practice, I don't even think that's very logical, and I'm guessing it didn't really sell too well either. G-Com could output a number of languages. That in itself is a wild sentence. The primary language used here, and the inspiration for G-Com's own look and feel, was a language that starts with a letter C. Now, I'm not talking the Bell Labs original. I'm talking COBOL. All roads lead to COBOL today. GCOM was partly an effort to generalize COBOL by integrating features from other languages. But despite all the new trappings, the bottom line is that GCOM looked like COBOL. It even had separate data and procedure divisions. It mainly was

Starting point is 00:43:06 written in simple English words. So we can say that IDS was, in a certain way, built for use with COBOL. That, I think, explains a lot of its syntax. And here's the final thing I want to say about DML. Much like my man MF Doom, it was written in all caps. This is one of those small nitpicky details that really cracks open a story for me. This probably wasn't an intentional design choice, just how things had to be. Lowercase was a bit of a luxury at the time. Either terminals would only support uppercase characters, keyboards couldn't do lowercase, or character encoding only supported uppercase letters. Punch cards only did uppercase, for instance, so any language that used cards

Starting point is 00:43:59 had to be in uppercase. COBOL, bless its heart, is an ALL CAPS kind of language. But to be fair, so is FORTRAN. Modern compilers for these antiquated languages will let you break tradition, but doing so may invoke old ghosts. Anytime you see a language or documentation that shows code in all capital letters, that should be a red flag. That's a warning that you're dealing with something considerably vintage. Or at least, something that's tightly connected to older traditions and older ways. So that's the first database. But what followed? Well, this gives us the familiar story of spread, diversification, and down-the-line standardization. The idea of a system for structuring data and providing slick access to said data, well, that was just plain good.

Starting point is 00:44:56 IDS would see use both inside and outside of GE. It would see a number of ports to different GE machines, but we're still in the era where software was firmly tied to hardware. GE didn't release a version of IDS for IBM computers. GE customers got this swanky new database thing while others had to wait. It didn't take long for vendors to catch on. Big Blue themselves came in hot on GE's heels. on. Big Blue themselves came in hot on GE's heels. Their offering had a similar name, IMS, the Information Management System. Like any good nation-state-sized entity, IBM provides a nice official story about IMS. To quote their propaganda, IMS has been an important part of worldwide computing since its inception.

Starting point is 00:45:54 On May 25th, 1961, U.S. President John F. Kennedy challenged American industry to send an American man to the moon and return him safely to Earth. The feat was to be accomplished before the end of the decade as part of the Apollo program. American Rockwell won the bid to build the spacecraft for the Apollo program and, in 1965, they established a partnership with IBM to fulfill the requirement of an automated system to manage large bills of material for the construction of the spacecraft. End quote. The story is all true, but I really don't think JFK had databases on the mind when he started pushing for a lunar program. We can already see how IMS was birthed out of the same type of problem that we've seen just very recently here.

Starting point is 00:46:35 The simple fact is that building a spacecraft is a huge feat. At the time, the Saturn V rocket, the design that would get us to the moon, was one of the largest machines ever built. Plus, the entire rocket had to be assembled by hand. IBM would play a huge role in the Apollo program, at least they were a big player behind the scenes. IBM mainframes, for instance, were used to plan, test, and run the actual spacecrafts involved in Apollo. run the actual spacecrafts involved in Apollo. And as it turns out, IBM also provides software and hardware for managing manufacturing of these crafts themselves. We actually arrive at a theoretical database structure here that's similar to the vendor order item example that I was using earlier. That should go a long way towards showing that IBM and Rockwell were dealing with the same issues GE faced just a few years prior.

Starting point is 00:47:28 This is all pure manufacturing administrivia, which does take a lot of work, takes a lot of effort. Rockwell needed a system that could track every part that went into the hardware for the Apollo program. That would include rockets, spaceships, and a sundry array of smaller machines. That is very much in the realm of what a database is good at handling. Development on IMS started in 1966. From the beginning, this was a joint venture. Most personnel on the project were IBMers, who were joined by programmers from Rockwell and Caterpillar Tractor Company. From what I understand, Caterpillar built some of the ground-based support equipment

Starting point is 00:48:11 for use in the Apollo program. Now, here's the hiccup we have to deal with. IBM is usually tight-lipped about their corporate history. Part of this is thanks to the fact that IBM survives into the modern day as more or less the same corporate entity it was back during the rise of computing. So there hasn't been some grand leaks of documents after their fall. There has been no fall to produce such leaks. This is also made worse thanks to the fact that IMS is still in use today, if you can believe it. One of IBM's great faults, I think, is that the company hasn't published much of its history. What we do get comes from their official history site, which is little more than a timeline,

Starting point is 00:48:59 really. You get a few paragraphs, maybe a full page, about every major event and product. That's far from useful. It's also very impersonal coverage. IBM just presents everything as coming from Big Blue itself, almost by immaculate conception. You get a few programmer names, but not many details. Here's what we know for sure. IMS was developed by a group of programmers, one of which was named Vern Watts. Watts was an IBM career man who managed the development of the database.

Starting point is 00:49:33 Most early design decisions came from Watts himself. So what exactly did Watts unleash upon the world? Well, IMS doesn't use the same architecture as IDS, but there are similarities. IMS is another one of these pointer-rich programs. However, we don't have the weird cyclical chains of links. Instead, IMS uses links to construct a hierarchy of records. IDS, the old GE special, may have sounded like a hierarchical system, but that's not really the case. The hierarchy there was more implicit, something that just sort of happened naturally due to how the database was structured. IMS, on the other hand, enforced hierarchical data. Each database started with a type of root record. All other records would be linked off of this base structure. Sticking with our canonical example, the first level of records might be

Starting point is 00:50:33 vendor records. From there, you might have WidgetsCo, Sean LLC, and so on. Each vendor then has a set of orders that are linked to from the vendor record. This link is unidirectional. You can traverse from vendor to order, but you can't go from order to vendor. It's also interesting to note that these are somewhat implicit links. In IDS, a link was an actual accessible field in your record. If you wanted to connect up with order 123, then you had a field called order ID with the value 123, or something similar to that. The IBM system, at least as described in its docs, hides the linking information somewhat. So you might not have an explicit field that says which orders a vendor owns, but the link can still be followed.

Starting point is 00:51:27 Even though I said this is in hypertext, we can apply a similar contextual understanding to these links. In a hierarchical system like IMS, these one-way links imply something like ownership. A vendor owns a set of orders. On the next layer, we might have a similar set of links for items. Thus, we could say an order owns some items, at least in an abstract way of speaking. The idea of ownership is also baked into the software here.

Starting point is 00:51:58 Big Blue has always been about big business. That's what the B stands for, after all. In order to better serve large customers, IMS was developed to be a multi-tenant system. That's the cool wording we use in the biz. It just means that you can have more than one user accessing the database. Maybe not all at once, but over time. This presents a bit of a problem. Let's say that our bill of materials database was accessible by each vendor. I might log in to check an order placed with Sean LLC was properly filed, as is my prerogative as the corporation's owner.

Starting point is 00:52:35 But, you know, I'm a crafty guy. I have all these self-sealing stem bolts that I want to unload. I think NASA might like them, but I worry that Widgets Co. may already be supplying these special bolts. So I decide to pull up the Widgets vendor record and navigate down to check all their orders. That is some kind of crime, I think. It at least feels like corporate espionage. You could sue me after the fact, but Sean LLC is already struck. The data is already compromised. However, Watts was proactive in this case.

Starting point is 00:53:14 IMS has separate user accounts with specific permission sets. So my user could be restricted to only accessing records that are linked to from the Sean LLC vendor record. This kind of access restriction is really a core feature nowadays, so it's neat to see it appearing in such an old database. Those are the big features here that matter. In general, many of the overall concepts share a lot with IDS, so we don't need to retread that ground. That said, I do want to bring up an issue with IBM's Homespun database. What would happen if you got a call from someone on the assembly line saying there was an issue with a part? Item number 5239 arrived broken. Someone needs to get in touch with the vendor and order a new batch. In the database

Starting point is 00:54:06 I've described so far, it would be hard to go from an item up to a vendor. You could accomplish this via software, but you have to traverse the entire database. If I was dealing with this, I'd start with each vendor, then follow the links down to get a list of each vendor's items. Once I had every item for a given vendor, I'd then manually conduct a search for the proper item number. If found, I'd be able to report the offending vendor. While that would work, that'd be slow. You'd have to traverse most of the database. In the worst case, you'd actually traverse the entire database. If you only had a few hundred items, that's fine. But what if you had tens of thousands or millions? In general, you want a database to be fast. It has to be able to deliver answers more quickly than a brute force

Starting point is 00:54:59 attempt could. Otherwise, there's no reason to use a database. It's just a nuisance. Just use brute force instead. It would be possible to create an inverse data set to have items that owned orders that owned vendors. That would let you ask questions like where an item came from. However, that doubles your storage requirement. Now you need to store a front-to-back and back-to-front version of the database. This has been pointed out many times as a key issue with these older navigational databases. I'm not making some unique argument here.

Starting point is 00:55:35 This is a well-tread problem. An ancillary issue that falls out of this whole discussion is that navigational databases relied on outside software to be useful. Queries for these databases were simplistic. Retrieve this record by its key, follow this link, save this modified record, insert a new record. That's about it. You couldn't ask a database itself to answer some complicated question. You couldn't say, ask for all orders from Sean LLC that arrived on Tuesdays. To answer a question like that, you had to rely on your own software. Your query, in this sense, was just the program you wrote to navigate these darn databases.

Starting point is 00:56:20 I think this is evidence enough that databases had a lot of growing left to do. One of the huge reasons to use a database is because it helps you program less. You can offload the work of managing data to another program. Some of our hyper-modern databases let you offload almost all of that task. You can use these databases to crunch numbers for you, or run all kinds of really complex analyses. In the late 60s, we weren't even close to that level of sophistication, so there was a lot of room to improve.

Starting point is 00:56:54 However, the path to success started with a big hurdle. You want to push more and more work off to the database. That's the goal here. Navigational databases, however, weren't well suited for that. There just wasn't much room to change their restrictive design. Not everything in the world can be modeled as a strict hierarchy of records or a linked chain. Those are just two types of data structures. That gives us two weapons out of a much larger arsenal to work with. Both IMS and IDS would see a lot of use.

Starting point is 00:57:32 IMS is still being used today if you believe IBM. However, for databases to reach their full potential, a new design was needed. Some fundamental changes were required. was needed. Some fundamental changes were required. Alright, that brings us to the end of this episode. We've now reached the halfway point in our discussion of the humble yet very important database. The furthest flung origins are unassuming to say the least. The first database, IDS, is just a fancy way for GE to coordinate manufacturing. Bigger to-dos like IBM's IMS don't stray that far from the original system. That database is still just a way to manage manufacturing. It's very exciting

Starting point is 00:58:22 stuff, I know. The other thing we found is an issue with these old databases, a fundamental flaw. The first generation here, the navigational databases, are just limited. They aren't flexible. They can't be used to express certain types of data relationships. The whole item versus vendor flip is only one example. Early databases are tools built for a simple job, tracking manufacturing. You'd have a hard time building, say, a family tree using IMS. You'd have to have a whole host of duplicate databases if you wanted to figure out something as simple as someone's parents. These limitations meant that, in practice, there were only so many problems an early database could solve.

Starting point is 00:59:11 That was ultimately a restriction in their use. Sure, IMS and IDS were still hugely effective, but there was a cap on that effectiveness. These issues were well-known very early on, and something better was just around the corner. Next time we'll be discussing the second act, this sequel, if you will. Thanks for listening to Advent of Computing. I'll be back in two weeks time with another piece of computing's past, and if you like the show, there are a few ways you can support it. If you know someone else who'd be interested in listening, then why not take a minute to

Starting point is 00:59:46 share it with them? You can also rate and review on Apple Podcasts. If you want to be a super fan, you can support the show directly through Advent of Computing merch or signing up as a patron on Patreon. Patrons get early access to episodes, polls for the direction of the show, and bonus episodes. You can find links to everything on my website, adventofcomputing.com. If you have any comments or suggestions for a future episode, then go ahead and shoot me a tweet. I'm at Advent of Comp on Twitter. And as always, have a great rest of your day.

Advent of Computing - Episode 96 - What Exactly IS A Database? Part I

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.