Advent of Computing - Episode 96 - What Exactly IS A Database? Part I
Episode Date: November 28, 2022I've fallen into a bit of a data rabbit hole, and you get to join me. In this episode I'm starting my journey to understand where databases came from, and how they started to evolve. This will serve ...as a foundation for next episode, when we will dive into one of the most popular databases from the 1970s: SQL. Along the way we wrestle with GE, the realities of the Apollo Program, and try to figure out what a database really is. Â Selected Sources: Â https://sci-hub.se/10.1109/MAHC.2009.110 - A history of IDS Â https://archive.org/details/TNM_Integrated_Data_Store_introduction_-_General__20171014_0141 - Learn IDS for yourself! Â https://archive.org/details/bitsavers_ibm360imsRGuide1969_8480205/page/n6/mode/2up - Educational guide to IBM's IMS
Transcript
Discussion (0)
There are a handful of pretty technical terms that have crossed over into the mainstream.
Someone might casually refer to a mainframe as in, oh, they're breaking into the mainframe.
Folk talk about networks all the time, or at least the word shows up despite being a very abstract and complicated idea.
And databases, well, those even get mentioned on the nightly news. It seems like
every other week some database somewhere is getting compromised by someone, whatever that
actually means. There's good reason for this last example. Databases really make modern computing
work. You listening to this podcast is possible because somewhere there's a database that
lists shows you can listen to. There's some stored structure that has every episode of Advent of
Computing neatly laid out. When you download an episode, it gets added to a smaller database on
your phone or computer that lists all the episodes you can play locally. You might even post about
this episode online saying how impressed you are,
and that would actually get fired off and stored in some database or another somewhere.
But what exactly is a database? The short answer is it's a place where you store data.
That's succinct, but doesn't really differentiate it from, say, a filing cabinet.
A database is much more complicated and much more sophisticated than just a pile of data.
I've actually worked with databases for most of my professional career. Most days I kind of live
inside a database. I think this makes me more than qualified to at least explain what a
database is. That said, there's one major gap in my knowledge here. I don't really know the origins
of the database. Oddly enough, this is one of those spots where I have plenty of up-to-date
knowledge but lack the deeper archaic lore that I usually go for. So, dear listener, let's go on a journey together.
Let's see where databases come from, and let's follow that lineage into the modern day.
I already have a thread in mind, so it's just a matter of pulling it until the entire sweater unwinds.
Welcome back to Advent of Computing.
I'm your host, Sean Haas, and this is episode 69.
What exactly is a database? Part 1.
This is going to be one of those episodes, or rather series, where I'm going in kind of blind.
At least, on the historical side. You see,
as I said in the hook, I actually do work with databases on a daily basis in my professional career. I write statistics software, which relies on databases to function. I've been on this bench
for years, so I know my way around quite a few of the big brand name databases that
are currently in use. As with anyone who works in the software industry as long as I have,
you start to form opinions. In this case, I have my favorite databases and I have my least favorite.
One of those that falls into the latter category is SQL, the Structured Query Language.
As with all good stories, I actually do have an inciting incident for these episodes.
A buddy of mine was recently looking for a new job. Most of the listings he saw were seeking
employees with some SQL experience. Now, my friend is a pretty industrious and capable fellow, so he decided
to learn SQL so, you know, he could send his resume to more places. What better way to learn
a new tool than with a new project? We were talking one evening and he told me that he had
just installed MySQL and was starting to figure things out. I, in turn, gave him a stern warning,
starting to figure things out. I, in turn, gave him a stern warning, as any good friend should do.
I told him to watch out, that SQL works in weird and surprising ways, that it uses an archaic language interface, that joins can be a little confusing at first until you just learn them,
and that its rigid schemas can really suck when you're starting off a new
project. You know, a here be dragons sort of speech. At the time, he didn't seem that perturbed.
He said how bad could it be, right? It must be fine, especially if so many companies use it.
Well, a few days later, he got back to me. He wanted to complain about SQL, and I was more than receptive.
I'm not sharing this story to say that SQL sucks.
The database has its own applications, and that's bore out by the state of programming.
SQL is still one of the most popular databases out there.
I've always thought that this is probably due to some legacy reasons. It can't be because
SQL is the best thing out there, because it simply isn't. There are better options for certain use
cases. Sure, SQL has its uses, but it can be a bit of a blunt instrument at times. Let's just
leave it at that before I get too far into shop talk. I've been meaning to set aside some time to talk on the show about databases in general,
because I really do want to get to the root of this whole SQL thing.
I know off the top of my head that it's an old language interface, that it's an old database,
but I'm not entirely sure how old.
To do that, I've worked out that it'll take two whole episodes.
Today we're going to be starting with the earliest roots of the database, and then we'll move on to
SQL itself in the next episode. I'd like to figure out why dragons have chosen to stick around this
specific program, and where are those dragons brooded? But hey, this is Advent of Computing we're talking about.
My pedagogy is firmly rooted in long-reaching rambles, so we of course can't just look at
SQL on its own.
So we have to dive into the earliest origins of the database.
This will help give us a good footing to figure out why SQL is how it is, and why I've always felt it's so
archaic. Like I said, I don't want to spend a whole episode complaining about a program I don't like.
That's not productive, and it doesn't really get me anywhere. Instead, I want to see where it's
coming from. I want to understand what SQL was made to replace, what problems it solved.
I think that could take me beyond mere frustration, so let's exercise some of my demons.
Before we get into this episode proper, as is custom, I have to plug notes on computing history.
We have a couple new articles that have come in through the editing pipeline,
so we're well on our way to issue one, but we still need more articles. If you want to submit and you don't have to have any experience writing,
then get in touch with me. Like I said, I'm not looking for people that have experience per se,
more anyone who's interested in writing articles about the history of computing.
If that sounds like you, then go to history.computer and read
how to submit. You just send me an email and I'll pair you up with an editor. Now, with that out of
the way, let's get into the episode proper. First off, I should just start by explaining what a
database is. I realize that if you aren't in IT, then you probably haven't been exposed to this stuff. Like I mentioned,
it's kind of back office technology to begin with anyway. A database is just a structured system
used to store and manage information on a computer. A database will be running on some
server somewhere, then you can connect to it in order to manipulate or otherwise access the data stored within.
This can range from adding new information to searching for records,
all the way to pretty complex computational tasks.
In most cases, a database has some smallest unit of storage.
This can be called a record or a row or a document.
I personally like document since I think it's a
nice generic name here. Each record will contain fields that store data. So you might have a zoo
database that has records for each animal. Each record would have a field for the animal's name,
its species, where it came from, its
birthday, and where it's currently living. The query part, the ability to carry out
searches, is really where a database earns its salt. In the zoo database
example, you could look for every animal record with the species dog, or you could
look for every animal from North America, or you could build a query to find, for instance, every animal born in May.
In this way, a database can be used to analyze information,
to make sense out of random piles of data.
Much like computers in general, on their own, a database isn't that useful.
These are tools, so to make them shine, they have to be used with a bit of
programming. This, as far as I'm concerned, is what makes databases a big deal. Every database
has some accompanying library so you can use the database with some other programming language.
This is what I mean when I say a database is really a tool. In this sense,
it becomes a cog in a larger machine. This allows you to use a database for persistent information
storage. Let me try to explain this since this is going to be important to the larger story.
Normally, when you run a program, there isn't any persistence of data. The program starts up and you start making variables for storing information.
Those variables only live in the computer's memory, which is intended for very short-term use.
Once the program ends, all your data just goes away.
That is, unless you do something to save your data while the program is still
running. The easiest thing to do is just print the data to screen. Then someone can jot those
numbers down and keep a record of the program's outputs. That's fine, but it sucks. It's not very
automatic. If you want to get fancier, you could output data to a file. This has the benefit of being fully digital and easy to manage.
You can also read one of those data files when the program starts up again.
So you can have some persistence there.
This is better than paper notes, but still has problems.
Files aren't structured on their own.
You have to spend some time and some code working up a way to structure these data files for later use.
There's also the matter of file size.
Usually when you read a file, the entire file just gets loaded into the computer's memory.
That's kind of a dumb way to do it, but it's often the default.
If you have a file larger than your computer's memory,
then you need to get tricky with how you handle that data. You have to write more code to solve
another annoying problem. Another issue is concurrency. What if some other program or
even another part of your own code needs to access that file? Well, that turns out to be a really bad thing. Thus, you need to write up
code to carefully manage your data file. You just keep running into these little ancillary issues
that you have to solve on your own. A good database will handle all of that for you. On the back end,
a database might even just use normal data files. But the database software has already solved this mountain of little issues for you.
You get a canned solution that saves you time, complexity, and gives you more features than raw files ever could.
This lets you focus on programming your project and gives you more power in how you manipulate your own data.
Using a database, you can simply ask for some data once your program starts,
then send your results back to the database when you're done.
That's the general rundown, and that's also a very modern view of databases.
But where did all this come from?
The short answer is General Electric.
The long answer is a little more interesting.
You see, old school computers were dumb in quite a number of ways.
This isn't a knock against anyone or anything.
It's just that we didn't really know how to best utilize resources yet.
Back in the day, prior to timesharing, computers operated in what's
known as batch mode. Tasks would be scheduled out on a physical calendar. Once a new time slot
rolled along, the proper program would be fetched, often in the form of a stack of punch cards,
fed into the computer, and executed. You'd eventually get outputs, and then those outputs
would be passed along to whoever asked for the program to be run.
The cycle would then start anew back at the top.
This is a very serial way to use a computer.
One task comes after another after another.
Now, I've kind of beaten the time-sharing thing to death on this podcast, because it's a very important point in history.
death on this podcast because it's a very important point in history. The transition from running one task at a time to running multiple tasks at once, that fundamentally
changed how computers could be used. That's the code-centric view of things, at least.
There's also the data side of things here. I think of this as an interesting interplay between technology and human practices.
At this time, pre-1960s, long-term storage devices were almost all serial. Punch cards
stored one bit after another after another, and could only be reasonably used in a specific order.
Paper tapes stored one bit after another, and had to be scanned from start to
finish. Magnetic tape, well, that's the same as paper, just fancier electronics. The digital brain
was only furnished one number at a time, so just like batch code execution, data was also handled in batches. It's just how the beast was built.
That's not a very relatable mode of operation for us, since, you know, we now, in the modern day,
have random access media. Old sequential storage is basically dead, save for certain niche
applications that most of us will never get near. I mean, I've held
a backup tape, but I've never used one to carry out a backup. Hard drives changed this at a very
fundamental level, but this happened a little slowly. The first commercial drives were released by IBM in 1956. Adoption occurred, but it's not until the 1960s that many third-party
drives hit the market. As far as I'm concerned, third-party adoption is kind of a good benchmark
of wider spread of a technology. Random access changed things considerably. One change it set the stage for was more complex data management.
This brings us up to 1961 at General Electric. Now, you probably don't think of GE as a computer
outfit, but back in the day, they actually manufactured a lot of computing machines.
Their computer concern was just one part of a much larger operation.
Each of these operating units ran with some autonomy, which is kind of weird to read about.
Separate operating units within GE would sometimes contract with other operating units,
almost like they were unconnected entities. This might have been something to get around antitrust laws, but I'm not an expert on
the history of GE, so let's just skip past that. Anyway, in one of these sections, there was a
programmer named Charles W. Bachman. At the time, GE bundled large projects into these yearly
integrated system projects, which sounds something like long-term development
sprints. Bachmann recounted the outcome of one of these ISPs in an 2009 IEEE paper that he titled
The Origin of the Integrated Data Store. You see, Bachmann could, in some circles,
be considered the father of the database. The lead-up to this early database,
IDS, is very specific. This big 1961 project at GE had to do with streamlining production.
The company's high-voltage switching gear factory had recently ran into an issue.
Customers weren't getting their high-voltage switching gear on time. That's a bit of a problem
for manufacturers. After some digging, a culprit was found. At the time, the factory had been using
a computer to track production. In theory, this was a simple and foolproof setup. An order would
come in, it would be entered into the shop's computer, and then the machine would schedule the actual manufacturing process.
You'd end up with a set of cards for each step, and each step would have a completion date punched onto it.
In theory, this would make operations a breeze, just kick back and let the machine manage everything.
No human overhead necessary.
But practice was a different story. From Bachman,
quote,
The root of the problem became clear after observing the many tote bins stacked under
the windows and along the aisles of various shops and assembly areas. Each tote bin contained a pile
of parts in the process of being manufactured, and had a small packet of punch cards wired to the box.
Each card represented a planned manufacturing step.
It displayed a date indicating when that step was supposed to be started.
Many cards for steps that should have been finished were still attached to the tote bins.
The original computer-generated completion dates were frequently being missed,
and the original schedule was seldom used. End quote. The computerized manager wasn't a very savvy operator.
How could it be?
It was just a machine, after all.
It planned everything under the assumption that tasks would be completed on time,
that nothing would go wrong
during manufacturing. It was supposed to be getting inputs fed to it when steps were done,
but it sounds like no one really bothered to deal with that. The machine would just spit out poorly
planned schedules anyway, so why bother? So we get two things that need correcting here.
other. So we get two things that need correcting here. A better system would be able to account for issues in production, while also being easier to use. So that's a start for this bigger project
at GE. Bachmann's analysis showed that there were many possible points of failure in the production
pipeline. He characterized the process as a network of
interconnected steps. Before components were soldered to a board, for instance, those components
would have to be sourced. The boards would have to be produced. The soldering station would have
to be available for use. Each of those prerequisites would have their own prerequisites.
Call them dependencies, if you will. It's a lot
to keep track of, and there weren't good tools to handle this kind of interconnected data.
That is, unless we talk about really early research systems like LISP at the time.
This left Bachmann and his colleagues in a bit of a bind. To solve this manufacturing problem,
they needed a new kind of tool,
one that could be used to model interconnected data. Further, the tool had to be powerful and
fast enough to actually be used. The solution was IDS, the Integrated Data Store. This is where we
have to break out the documentation. Bachman does a good job explaining
how IDS was developed and why and how it was used early on. His accounting is neat and all,
but it doesn't have the firm technical details that I like to see. I want to start by looking
at the actual data structures used in IDS because, well, it's a data base. The data representation matters here
quite a bit. So what does IDS's manual have to say? The primary data structure used here is
something that's called a chain. It's similar to something like a doubly linked list, but the older name here is a bit
more evocative, I think. The basic idea is that you construct a list out of elements,
each element having a link to the next element on the list as well as the previous element.
This lets you traverse up and down the list by following links. Something specific to IDS and to these chains is that once you reach the end of a chain,
it links back to the first element.
So these are a circular affair.
There's also one other complication that kind of makes these different than normal doubly linked lists.
Each element of the chain will have a link back to the record that
owns the chain. That's something that I'll explain a little later on, but just keep in mind for now
that these aren't very normal linked lists. These are special. So why does it matter that these are
doubly instead of singly linked elements? Well, in a normal linked list,
each element just links to the next element. You can't traverse backwards, only forwards.
In most cases, this is fine. You can program around needing these backwards links. But
sometimes the added data does help. In IDS, these backlinks allowed for more rich data modeling,
and it gives the database a very distinctive style of use.
I'm just going to use the documentations example here,
since it's pretty good.
Let's say you're building an inventory system,
which is what IDS was really meant for.
You have three types of information
to track. Vendors, orders, and inventory items. The vendors are obviously at the top of this
hierarchy. Orders come next since each vendor has a set of orders you send out to them.
Now, this is where the data modeling really opens up. Each entry in IDS has a set of defined fields.
A vendor would have something like a vendor name, shipping address, phone number, and so on.
You can also have lists associated with these entities.
These lists are, of course, the fancy chains I described earlier.
The vendor entries in this case would each come with their own list of orders.
Those orders would each link out to the inventory items that were actually shipped.
So at each layer, you have chains being used to manage data.
We've encountered this type of modeling before.
This is a very pointer-rich dataset, as in, there are a lot of links out to other chunks of data.
There are two immediate comparisons that spring to mind.
First is Lisp, and the second is Hypertext.
Those may not sound all that connected, but all three of these technologies share a core feature.
IDS, Lisp, and Hypertext are all about links. Lisp is the most ideologically simple here.
The language is built around building, manipulating, and traversing singly linked
lists. Each list element can point to the next list element, and it can also point out to another list.
This allows Lisp to represent very complex linked data structures like trees.
It's pretty easy to see how Lisp and IDS differ.
Lisp is a programming language, so it's all about dealing with data in memory,
whereas IDS is a database, so it's all about representing and storing data.
Then we have the whole hypertext thing. The distinction between hypertext and databases is,
well, it's something that keeps me up at night. Let me explain my thinking here. The two technologies
feel very different to me. Hypertext uses links
to add context and organization to data, and is intended to present the data to end users.
It's a very human-centric kind of thing. IDS uses links to add context and organization to data,
but it's intended to be presented to other programs. The actual
nuts and bolts here are really similar. A database like IDS could even be used as part of a larger
hypertext system. The key difference, at least in my mind, is intent. Hypertext and hypermedia are
all about the human factor. The entire field is dedicated
to making data accessible and understandable to end users. Databases in general, and IDS
specifically, are geared more towards furnishing data to software. An end user isn't going to sit
down at a terminal and issue raw commands to IDS.
By the same token, hypertext systems aren't intended to be driven by other programs.
I know this kind of sounds like an I'll-know-it-when-I-see-it sort of explanation,
but I swear, my convictions here are firm.
Just trust me, hypertext and databases are different beasts.
Back to IDS.
The link is a central part of the database's construction.
The next key aspect is the idea of a schema.
At least, that's what we call it today.
A schema is just a fancy way of saying how each record is internally structured.
What are those fields named? What data type is each field?
That sort of thing. In IDS lingo, this is called the data description or record format. Here's how
the docs describe the description, and I think it really shows where IDS is coming from. Quote,
shows where IDS is coming from. Quote, in general, the format and use of record descriptions required for IDS are the same as those in COBOL record descriptions. The exception to the normal
usages are described below. End quote. The invocation of COBOL here is a little bit of a wrinkle.
IDS was created a few years after COBOL hit the scene.
The database would initially be used with GE's own programming language,
but IDS would see a whole lot of use with COBOL.
The connection is kind of unavoidable.
These data descriptions are roughly similar to how COBOL handles data.
In IDS, you can describe each field in terms of data type and size. That's kind of how COBOL does data types, but there are differences. COBOL uses pretty terse and cryptic syntax for describing
data types. It uses a few letters followed by numbers with some special
punctuation. IDS appears to use predefined data classes. For instance, the class integer 1
describes any integer between numbers 0 and 999. I haven't been able to find a list of all data classes, but they should all follow a similar convention.
Now, this presents something to take note of.
999 isn't a very digital number.
It doesn't have that nice power of 2 ring to it.
It's not a 32.
Why would you stray from that convention? Well, there are two possible
options. The worst answer is that, oh, the conventions just weren't as ironclad in the 1960s.
There were 24-bit machines, for instance. Stuff was very much still in flux. But the better answer,
or at least the more interesting one, is that Bachmann was trying to
make IDS somewhat platform independent. You see, early on, Bachmann had to jump platforms.
The high-voltage switching supply plant never did end up using IDS. But the low-voltage switching
supply plant was interested. These departments used different GE computers.
They weren't vastly different, but Bachman does explain that he had to rewrite a lot of IDS to
work on the smaller machine over at the low-voltage plant. From early on, Bachman was dealing with
different platforms. I think in that context, it makes sense to use whatever numbers you want. IDS might not
always be on a 32-bit machine. Maybe one day it would be used on a 10-bit computer, or even a 64-bit
computer. There's quite a bit more detail about the low-level implementation here, but I think it
would behoove us to escape from that part of IDS, at least for the time
being. So far, we've just been talking about the database itself. But what was it like to actually
use IDS? How did all this data representation stuff bear fruit? It's at this point that I have
the honor of adding another confusing facet to the database story. When
programmers talk about a database, that usually encompasses a few different things. One is the
actual database program itself, the software that handles everything. Another, at least tangentially,
is the programming interface, that's the library that you use to access the database from within your own code.
And finally, there's the database's own native tongue. The third one can lead to some confusion
depending on the nomenclature. When you're working with SQL, for instance, you will have an SQL
server, but you'll also write commands in SQL, the structured query language.
It can be an annoying naming convention.
Some newer packages do get away from this, but that's slow going.
IDS starts this trend in a fittingly confusing way.
Initially, there were two languages involved here, the data description language and the data manipulation language.
We've already been talking about the first one, DDL, although somewhat obliquely.
That was just the special syntax used to describe IDS records.
This brings us into something of an interplay with COBOL.
You see, this business language was codified in 1959, just two years
before development started on IDS. I was actually just dipping my toes into COBOL land preparing a
bonus episode, so I don't want to dive in again, but I think I have to. COBOL was designed as
something of a proto-database system.
So here's the deal.
A COBOL program is composed of two source files, a data division and a procedure division. The data division describes all the variables and data structures you're going to use.
This can be simple things like just normal variables, or it can be entire structured
tables of data.
DDL is the IDS equivalent of the data division.
It describes your data structures, their field names, field types, and the overall shape
of your data.
The only real difference is that DDL was used to describe external data structures, whereas
COBOL's data division was an internal-only thing.
IDS was built to spread around its data, in theory.
The second language in play is the IDS Data Manipulation Language, or DML.
This is equivalent to COBOL's Procedure Division.
All your queries, all your reads and writes are written in DML.
At least, there is a DML at some point in the process.
By the time IDS is actually out in the wild and well-documented, everything is mediated
through COBOL.
IDS was being accessed using a COBOL library that spliced DML commands into COBOL itself.
At least, that's kind of what it looks like to me.
All the syntax for DML is pretty in line with traditional COBOL-style syntax.
DML is the more important of the interfaces here.
Most of the code you write for a database is involved with manipulating data. You only have to describe your data one time, at least that's the ideal situation.
Sometimes we mess up, but hey, DML is where it's at. The thing is, the operating principles
used in IDS are… strange. It's a lot different than a modern database, so it's been a little hard for me to
wrap my head around it. IDS Operations used this scratchpad thing that Bachman calls the
working space. This was used to store records that were being actively used. If you wanted to
update your record, you'd first load it into the working space, make your changes, and then store the updated record back in IDS.
It's a little odd, but I guess not that out of the ordinary.
Most data access works this way in the background, IDS is just explicit about the setup.
There's also the matter of chains and links.
This is really what confused me for
quite some time. In IDS, you don't just ask the system for some specific record. Let's stick with
the vendor order item database from earlier. If you wanted a specific order but didn't know the
full record name, then you'd first retrieve the vendor record.
That would be your starting point.
From there, you'd retrieve the first order on the order chain.
Then you'd get the next order on the chain, and so on.
You'd keep following the chain until you hit the order you wanted or were returned to the first order in the chain.
Instead of writing complex queries, a programmer was expected to navigate this sea of links.
As such, IDS and similar systems have been retroactively called navigational databases.
There are distinct benefits to this over earlier sequential data storage, but there are also downsides.
Really, I think this is leaning into random access in the wrong way. Sure, it's faster
to follow a handful of links instead of sequentially searching the entire table. The downside is that
this complexity is pushed onto the programmer. So while IDS did handle a lot of things for the
programmer, it didn't go all the way. Every paper or document
on IDS that I've seen includes these crazy maps of how a database can be laid out. I'll link to
some of the documentation in the show's description so you can see it yourself, because this has to
be seen to be believed. They're basically these big circular webs. Arrows represent links with boxes representing
documents. And you can get from any record to any other by following these arrows through other
records, but it can take quite a few operations to do so. The result is you end up navigating IDS
like driving through an unfamiliar city. You get street signs and
everything does connect up to everything else, it just might take some time to get from point A to
point B, especially if you don't know your path by heart. During this cruise, there's one command
you'll rely on. Retrieve. It's sort of the do-it-all command of IDS, kind of like the move operation in most
assembly language. Using retrieve, you grab a record by name. IDS calls this data name or
level one name. It's basically a key that each record must have, and it's unique.
Call retrieve with this name, and your record is loaded
into the working space. From there, you start to navigate. But how do you navigate, you ask?
Well, simply run another retrieve. You can use the same command to follow links forward and
backwards. Some records even have links going back to the top-level record
that they're all connected to. Those links are followed with, you guessed it, Retrieve.
This doesn't smack of bad design to me, it's just kind of weird. Due to this multi-use nature,
Retrieve has a variable syntax. You would say Retrie retrieve record name to just get a record,
so you might write retrieve vendor one to get the first vendor. Then retrieve next record of
record name chain to follow the chain, so something like retrieve next record of vendor one chain.
chain. So something like, retrieve next record of vendor one chain. There are also a handful of other ways to invoke retrieve, but I won't list them all here. They're all of a similar type of
vein. What I would like to draw attention to is that second example, retrieve next record of name chain. That's pretty verbose. You have the main operation, retrieve, the argument,
your record name, and then just a pile of words. I could see the syntax being simplified to
something like retrieve next record name, or heck, maybe just next if the record is already in the working space, because you do have
to retrieve the record first. This verbosity is something to keep an eye out for. Another
interesting feature is the sparseness of the syntax. Everything is just an English word.
You know, the kind you find in a dictionary. DML doesn't use any special characters, not even punctuation.
Every command is composed of simple words and record names.
This is somewhat unique for a language of the era.
Contemporaries like Fortran, Lisp, or Algol are chock-a-block with special characters.
Fortran used perins for accessing links and
calling functions. Lisp is composed almost entirely of parentheses. And Algal likes to
use colons of all kinds. But IDS doesn't work that way. Part of this could be thanks to the
relative simplicity of DML. It's not a full programming language. You don't have
loops or even really complex data structures. The entire language is just for describing and
navigating data. You aren't even making complex queries, so you don't need to group arguments
together. So hey, maybe special characters just weren't needed. There's also the lineage factor at play here.
IDS was initially developed for use with GCOM, the General Compiler.
This was an internal language used at GE and with GE computers.
Now, GCOM is weird.
At the time, it was billed as the future.
Quote,
The GE225 GECOM system, an advanced and effective automatic coding method,
provides the next logical step in programming evolution.
GECOM is a step towards fulfillment of the much-needed total system concept,
a concept that deems an information processing system
to be an integration of application, programming,
and information processing computer.
End quote.
Now, I don't want to veer off into a discussion of GCOM
because I have a feeling I could talk about this for hours.
This is a language that can be compiled
into other programming
languages. Call it a programming language programming language. That brings up some
very interesting philosophical questions and technical questions. But to digress a little bit,
G-Com was built in an attempt to further streamline programming. At this point, we had
languages that were compiled down to machine code, so why not take the next logical step?
I don't know how well that worked in practice, I don't even think that's very logical,
and I'm guessing it didn't really sell too well either. G-Com could output a number of languages. That in itself is a
wild sentence. The primary language used here, and the inspiration for G-Com's own look and feel,
was a language that starts with a letter C. Now, I'm not talking the Bell Labs original. I'm talking COBOL. All roads lead to COBOL today. GCOM was
partly an effort to generalize COBOL by integrating features from other languages. But despite all the
new trappings, the bottom line is that GCOM looked like COBOL. It even had separate data and procedure divisions. It mainly was
written in simple English words. So we can say that IDS was, in a certain way,
built for use with COBOL. That, I think, explains a lot of its syntax. And here's
the final thing I want to say about DML. Much like my man MF Doom, it was written
in all caps. This is one of those small nitpicky details that really cracks open a story for me.
This probably wasn't an intentional design choice, just how things had to be. Lowercase was a bit of
a luxury at the time. Either terminals would only support
uppercase characters, keyboards couldn't do lowercase, or character encoding only supported
uppercase letters. Punch cards only did uppercase, for instance, so any language that used cards
had to be in uppercase. COBOL, bless its heart, is an ALL CAPS kind of language.
But to be fair, so is FORTRAN. Modern compilers for these antiquated languages will let you break
tradition, but doing so may invoke old ghosts. Anytime you see a language or documentation that
shows code in all capital letters, that should be a red flag. That's a
warning that you're dealing with something considerably vintage. Or at least, something
that's tightly connected to older traditions and older ways. So that's the first database.
But what followed? Well, this gives us the familiar story of spread, diversification, and down-the-line standardization.
The idea of a system for structuring data and providing slick access to said data, well, that was just plain good.
IDS would see use both inside and outside of GE.
It would see a number of ports to different GE machines, but we're still in the era where software was firmly tied to hardware.
GE didn't release a version of IDS for IBM computers.
GE customers got this swanky new database thing while others had to wait.
It didn't take long for vendors to catch on.
Big Blue themselves came in hot on GE's heels.
on. Big Blue themselves came in hot on GE's heels. Their offering had a similar name, IMS,
the Information Management System. Like any good nation-state-sized entity, IBM provides a nice official story about IMS. To quote their propaganda, IMS has been an important part of worldwide computing since its inception.
On May 25th, 1961, U.S. President John F. Kennedy challenged American industry to send an American man to the moon and return him safely to Earth.
The feat was to be accomplished before the end of the decade as part of the Apollo program.
American Rockwell won the bid to build the spacecraft for the Apollo program and, in 1965, they established a partnership with IBM to fulfill the requirement of an automated system to manage large bills of material for the construction of the spacecraft.
End quote.
The story is all true, but I really don't think JFK had databases on the mind when he started pushing for a
lunar program.
We can already see how IMS was birthed out of the same type of problem that we've seen
just very recently here.
The simple fact is that building a spacecraft is a huge feat.
At the time, the Saturn V rocket, the design that would get us to the moon, was one of the largest machines ever built.
Plus, the entire rocket had to be assembled by hand.
IBM would play a huge role in the Apollo program, at least they were a big player behind the scenes.
IBM mainframes, for instance, were used to plan, test, and run the actual spacecrafts involved in Apollo.
run the actual spacecrafts involved in Apollo. And as it turns out, IBM also provides software and hardware for managing manufacturing of these crafts themselves. We actually arrive at a
theoretical database structure here that's similar to the vendor order item example that I was using
earlier. That should go a long way towards showing that IBM and Rockwell were dealing with the same issues GE faced just a few years prior.
This is all pure manufacturing administrivia, which does take a lot of work, takes a lot of effort.
Rockwell needed a system that could track every part that went into the hardware for the Apollo program.
That would include rockets, spaceships, and a sundry array of
smaller machines. That is very much in the realm of what a database is good at handling.
Development on IMS started in 1966. From the beginning, this was a joint venture.
Most personnel on the project were IBMers, who were joined by programmers from Rockwell and
Caterpillar Tractor Company.
From what I understand, Caterpillar built some of the ground-based support equipment
for use in the Apollo program. Now, here's the hiccup we have to deal with.
IBM is usually tight-lipped about their corporate history. Part of this is thanks to the fact that
IBM survives into the modern day as
more or less the same corporate entity it was back during the rise of computing.
So there hasn't been some grand leaks of documents after their fall. There has been no fall to
produce such leaks. This is also made worse thanks to the fact that IMS is still in use today, if you can believe it.
One of IBM's great faults, I think, is that the company hasn't published much of its history.
What we do get comes from their official history site, which is little more than a timeline,
really. You get a few paragraphs, maybe a full page, about every major event and product.
That's far from useful.
It's also very impersonal coverage.
IBM just presents everything as coming from Big Blue itself, almost by immaculate conception.
You get a few programmer names, but not many details.
Here's what we know for sure.
IMS was developed by a group of programmers, one of which
was named Vern Watts. Watts was an IBM career man who managed the development of the database.
Most early design decisions came from Watts himself. So what exactly did Watts unleash upon
the world? Well, IMS doesn't use the same architecture as IDS, but there are similarities. IMS is another
one of these pointer-rich programs. However, we don't have the weird cyclical chains of links.
Instead, IMS uses links to construct a hierarchy of records. IDS, the old GE special, may have sounded like a hierarchical system, but that's not really
the case. The hierarchy there was more implicit, something that just sort of happened naturally
due to how the database was structured. IMS, on the other hand, enforced hierarchical data.
Each database started with a type of root record. All other records would be linked off of
this base structure. Sticking with our canonical example, the first level of records might be
vendor records. From there, you might have WidgetsCo, Sean LLC, and so on. Each vendor
then has a set of orders that are linked to from the vendor record. This
link is unidirectional. You can traverse from vendor to order, but you can't go from order to
vendor. It's also interesting to note that these are somewhat implicit links. In IDS, a link was
an actual accessible field in your record. If you wanted to connect up with order
123, then you had a field called order ID with the value 123, or something similar to that.
The IBM system, at least as described in its docs, hides the linking information somewhat.
So you might not have an explicit field that says which orders a vendor owns, but the link can still be followed.
Even though I said this is in hypertext,
we can apply a similar contextual understanding to these links.
In a hierarchical system like IMS,
these one-way links imply something like ownership.
A vendor owns a set of orders.
On the next layer, we might have a similar set of links for items.
Thus, we could say an order owns some items, at least in an abstract way of speaking.
The idea of ownership is also baked into the software here.
Big Blue has always been about big business.
That's what the B stands for, after all.
In order to better serve large customers,
IMS was developed to be a multi-tenant system. That's the cool wording we use in the biz. It
just means that you can have more than one user accessing the database. Maybe not all at once,
but over time. This presents a bit of a problem. Let's say that our bill of materials database was accessible by each vendor.
I might log in to check an order placed with Sean LLC was properly filed,
as is my prerogative as the corporation's owner.
But, you know, I'm a crafty guy.
I have all these self-sealing stem bolts that I want to unload.
I think NASA might like them, but I worry
that Widgets Co. may already be supplying these special bolts. So I decide to pull up the Widgets
vendor record and navigate down to check all their orders. That is some kind of crime, I think.
It at least feels like corporate espionage. You could sue me after the fact, but Sean LLC is already struck.
The data is already compromised.
However, Watts was proactive in this case.
IMS has separate user accounts with specific permission sets.
So my user could be restricted to only accessing records that are linked to from the Sean LLC vendor record.
This kind of access restriction is really a core feature nowadays, so it's neat to see it appearing in such an old database.
Those are the big features here that matter.
In general, many of the overall concepts share a lot with IDS, so we don't need to retread that ground. That said,
I do want to bring up an issue with IBM's Homespun database. What would happen if you got a call from
someone on the assembly line saying there was an issue with a part? Item number 5239 arrived
broken. Someone needs to get in touch with the vendor and order a new batch. In the database
I've described so far, it would be hard to go from an item up to a vendor. You could accomplish this
via software, but you have to traverse the entire database. If I was dealing with this, I'd start
with each vendor, then follow the links down to get a list of each vendor's items. Once I had every item for a given
vendor, I'd then manually conduct a search for the proper item number. If found, I'd be able to
report the offending vendor. While that would work, that'd be slow. You'd have to traverse
most of the database. In the worst case, you'd actually traverse the entire database. If you only had
a few hundred items, that's fine. But what if you had tens of thousands or millions? In general,
you want a database to be fast. It has to be able to deliver answers more quickly than a brute force
attempt could. Otherwise, there's no reason to use a database. It's just a nuisance. Just use brute force instead.
It would be possible to create an inverse data set
to have items that owned orders that owned vendors.
That would let you ask questions like where an item came from.
However, that doubles your storage requirement.
Now you need to store a front-to-back and back-to-front version of the database.
This has been pointed out many times as a key issue with these older navigational databases.
I'm not making some unique argument here.
This is a well-tread problem.
An ancillary issue that falls out of this whole discussion is that navigational databases
relied on outside software to be useful.
Queries for these databases were simplistic. Retrieve this record by its key, follow this link,
save this modified record, insert a new record. That's about it. You couldn't ask a database
itself to answer some complicated question. You couldn't say, ask for all orders from Sean LLC
that arrived on Tuesdays. To answer a question like that, you had to rely on your own software.
Your query, in this sense, was just the program you wrote to navigate these darn databases.
I think this is evidence enough that databases had a lot of growing left to do.
One of the huge reasons to use a database is because it helps you program less.
You can offload the work of managing data to another program.
Some of our hyper-modern databases let you offload almost all of that task.
You can use these databases to crunch numbers for you, or run all kinds of really complex
analyses.
In the late 60s, we weren't even close to that level of sophistication, so there was
a lot of room to improve.
However, the path to success started with a big hurdle.
You want to push more and more work off to the database.
That's the goal here.
Navigational databases, however,
weren't well suited for that. There just wasn't much room to change their restrictive design.
Not everything in the world can be modeled as a strict hierarchy of records or a linked chain.
Those are just two types of data structures. That gives us two weapons out of a much larger arsenal to work with.
Both IMS and IDS would see a lot of use.
IMS is still being used today if you believe IBM.
However, for databases to reach their full potential, a new design was needed.
Some fundamental changes were required.
was needed. Some fundamental changes were required.
Alright, that brings us to the end of this episode. We've now reached the halfway point in our discussion of the humble yet very important database. The furthest flung origins are
unassuming to say the least. The first database, IDS, is just a fancy
way for GE to coordinate manufacturing. Bigger to-dos like IBM's IMS don't stray that far from
the original system. That database is still just a way to manage manufacturing. It's very exciting
stuff, I know. The other thing we found is an issue with these
old databases, a fundamental flaw. The first generation here, the navigational databases,
are just limited. They aren't flexible. They can't be used to express certain types of data
relationships. The whole item versus vendor flip is only one example. Early databases are tools
built for a simple job, tracking manufacturing. You'd have a hard time building, say, a family
tree using IMS. You'd have to have a whole host of duplicate databases if you wanted to figure out
something as simple as someone's parents. These limitations meant that, in practice,
there were only so many problems an early database could solve.
That was ultimately a restriction in their use.
Sure, IMS and IDS were still hugely effective,
but there was a cap on that effectiveness.
These issues were well-known very early on, and something better
was just around the corner. Next time we'll be discussing the second act, this sequel, if you
will. Thanks for listening to Advent of Computing. I'll be back in two weeks time with another piece
of computing's past, and if you like the show, there are a few ways you can support it. If you
know someone else who'd be interested in listening, then why not take a minute to
share it with them?
You can also rate and review on Apple Podcasts.
If you want to be a super fan, you can support the show directly through Advent of Computing
merch or signing up as a patron on Patreon.
Patrons get early access to episodes, polls for the direction of the show, and bonus episodes.
You can find links to everything on my website, adventofcomputing.com.
If you have any comments or suggestions for a future episode, then go ahead and shoot me a tweet.
I'm at Advent of Comp on Twitter. And as always, have a great rest of your day.