Advent of Computing - Episode 97 - What Exactly IS A Database? Part II

Episode Date: December 13, 2022

We've approach the beast itself: SQL. Or, as it used to be known, SEQUEL. In this episode we will discuss how early navigational databases failed, and how we were able to move past them into a relati...onal future. It's a fascinating tale about how careful research and planning can lead to much better tools.   Selected sources:   https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf -- Dr. Codd on relational databases   https://web.archive.org/web/20070926212100/http://www.almaden.ibm.com/cs/people/chamberlin/sequel-1974.pdf -- The first SEQUEL paper   https://people.eecs.berkeley.edu/~brewer/cs262/SystemR.pdf -- A History and Evaluation of System R

Transcript
Discussion (0)
Starting point is 00:00:00 Hey, it's your boy, Editing Bay Sean. Um, you may have noticed this episode is late, so I thought I'd hop in at the start, break my format a little, and explain why. While I was recording this episode, I was starting to get sick. And about halfway through recording, my voice gave out. halfway through recording, my voice gave out. So the first half of this episode is recorded with my normal setup, with my nice mic and everything, but there was a bit of a complicating factor. I was set to go on the road the day after I recorded this, so I didn't have a chance to record this while I wasn't ill. So I'm currently finishing up the last half of this episode in a closet in an Airbnb. It's a tried and true recording space. So as such, there's a change in audio quality
Starting point is 00:00:56 right now and about halfway through the episode. If you can't stand that, I'm sorry. This is the best I can do. Next episode will be back to normal with my normal recording setup, so just bear with the slight interruption. The democratization of information tends to be a force lurking in the shadows of many developments. I know, it's always great when an episode starts off sounding pretty pretentious, but there is some truth to that statement. Ted Nelson's entire career has been devoted to this idea in the form of Xanadu, this grand hypermedia system that would allow us all to share and pursue information with ease. In 1972, Alan Kay proposed this pretty wild machine called the Dynabook, basically a tablet computer that could be built using 70s tech. One of the big selling points
Starting point is 00:01:53 for his idea was that it would make education and the transmission of information easier. I mean, imagine everyone having a computer that they could just carry around. Those are great examples of the democratization of information because, let's face it, both of those examples are flashy and straightforward. Xanadu was planned as something very close to the modern internet. It's easy for us to see how that would make data more spreadable and accessible. We have Dynabooks nowadays, or at least we have a host of devices that are somewhat similar to Kay's original concept. We can visualize that, we can hold it in our hands. Of course, personal and portable computers have made information more
Starting point is 00:02:37 easy to access. But we can take this idea deeper. You see, most lines of code in the world are never visible to end users. Most stuff exists on the backend. And what better place to see changes in data accessibility than at this lower layer on the digital dip? So, can we see a democratizing effect in the evolution of databases? I think the answer is yes. in the evolution of databases? I think the answer is yes. It's undeniable that information has gotten easier and easier to come by as computer interfaces have improved. This trend isn't isolated to the cushy world of graphical interfaces. Better programming languages,
Starting point is 00:03:17 better data structures, and, yes, better databases, have all made information more democratic, more easy to grab, more approachable by the common people. Don't believe me? Then let me offer an example in the form of SQL. Welcome back to Advent of Computing. I'm your host, Sean Haas, and this is episode 97. What exactly is a database? Part 2. Last time, we started my journey towards understanding why SQL, one of the most popular databases that's still in use today, works how it does, why it is how it is. That first episode chronicled the early origins of the database itself, and the strange navigational databases that formed during the
Starting point is 00:04:13 1960s. I do recommend checking that episode out, since it will give better context for today's discussion. We closed out last episode by looking at the fatal flaw in navigational databases, inflexibility. To effectively use a navigational database, you have to follow links from one record to another. You can't use complex queries to find records. You have to specify your entire path, your entire algorithm, in your software. This makes it so that your program basically is the query. These databases push a lot of the actual work off onto the programmer. Useful, still, but not supremely so. I've been ruminating on why this rubs me the wrong way for quite a bit now, and I think I have a succinct way to explain it. To make good
Starting point is 00:05:06 use of a navigational database, you have to know how each record flows into another. You need to keep up this complex mental map of how to navigate through records. A good practitioner would know their dataset and know just how to follow the flow of pointers. Maybe you have a well-worn route from user account records into order records and on into inventory receipt listings. It sounds very kung fu to borrow the jargon file meaning of the term. You could work up some cool technobabble and throw this into a movie, even. But these are databases we're talking about. These aren't elegant tools for more civilized ages. These are tools meant for handling tax filing or manufacturing timetables or payroll
Starting point is 00:05:55 accounts. If you have to have some kind of trained and disciplined digital warrior to handle earnings reports, then you aren't hearkening back to a civilized time. You're going to get audited by the IRS. There's nothing kung fu about that. That presents a pretty big problem. Sure, early databases were good tools, but there were hard and fast limits to how useful they could be. The way around these limitations would require a major paradigm shift. A new type of database would be birthed from deep within IBM. This is where the relational database enters the fold. This is where we get into SQL proper. SQL is, in the modern day, one of the most popular types of databases. It's a huge standard in the IT industry that could lead you to the
Starting point is 00:06:46 conclusion that SQL must be somewhat modern. But, dear listener, perish the thought. SQL is of considerable vintage. In fact, it was one of the first relational databases created. Or, database languages? Standards? It gets a little muddled. The point is, SQL is old. The first lines of structured query language are contemporary with the first microprocessors. I also wrote some SQL earlier this week, so the stuff's everywhere and every when. This episode will be navigating through the transition from navigational to relational. The goal here is to try and appreciate SQL for what it is, to look at what problems it solved and why it was such a big deal back in the day. I'm hoping to come out the other end with a better understanding and appreciation for this
Starting point is 00:07:41 antiquated database. Who knows, I might even finally memorize which types of joins do what. Before we get into the episode proper, I want to reiterate my usual call for authors. Notes on Computer History, my upcoming semi-academic journal on the history of computing, is getting pretty close to issue zero. At least, we're getting close to critical mass on articles, but we still need more. So if you've ever thought about writing or researching the history of computing, which if you're listening to this show, then you've probably had a passing fancy, I highly recommend you write in and send me an article. You don't need any prior writing
Starting point is 00:08:21 experience to get started. Just find something interesting, write a draft, and send it in. Information on submissions are up on my website for the project. It's history.computer. Anyway, let's get into the show itself. One of my big concerns with navigational databases was the inversion problem that I outlined in the last episode. This is most apparent in IBM's IMS. It used unidirectional links to represent ownership between records. Take a database where the root records are vendors. Those records own orders, and each order in turn
Starting point is 00:08:59 owns item records. You can't take an item ID and figure out which vendor owns that item. Links only go down, not up. GE's IDS could be more flexible. It could actually solve this kind of issue, but you had to navigate through multiple layers of links to get to an answer. This isn't just future Sean talking, this would have been a very contemporary gripe. I think as a programmer, it would have been the first issue I would have spotted. If I was handed a manual for IDS and asked to build a database, I'd get frustrated pretty quickly. That could be due to my exposure to modern databases, but I don't think that's entirely the case. Over the course of my computer history research, I've learned that humans have
Starting point is 00:09:46 always had the same problems. The specific solutions change, but we all have similar hangups. A programmer sitting at GE in 1967 has a lot more similarities to a programmer in 2022 than we might like to admit. That's just one issue. Like any good flawed system, the issues are myriad. That's usually an indication of a systemic flaw versus a small problem that can just be waved away. Better still, we have a good amount of contemporary scholarship about these gripes. And you know me, I can't resist academic papers that contain complaints. There's something endearing about it. The best source to reach for here, in terms of coherence and impact factor, comes from
Starting point is 00:10:33 Dr. Edgar Codd. That's like the fish, but with an extra D. Now, Codd was another IBM man. However, he was also an academic. The earlier databases we've looked at weren't influenced by academics. They were, instead, built on the industry side of things. This isn't to say that you have to have a degree to do meaningful work. Just that folks with a doctor in front of their name tend to leave behind a larger paper trail. Codd definitely earned his DR. In 1955, he completed his thesis at the
Starting point is 00:11:07 University of Michigan-Ann Arbor. The paper in question was concerning cellular automata. This was an expansion to von Neumann's earlier work that definitely gives Codd some theoretical bona fides in my book. In 1970, he turned his attention to the database. That very year, he published a relational model of data for large shared databanks. It's a combination of diss track on navigational databases and a fascinating rephrasing of data management in general. The spooky part, at least for me, is it feels shockingly familiar. It feels almost too modern. The language used in the sources on early databases that I've been reading on these navigational databases is still archaic. IDS is discussed in terms of segments and offsets,
Starting point is 00:12:03 data descriptors and pages. Those are terms partially borrowed from the host hardware, which at this point only exists in museums. Codd uses strikingly modern terms. The trick here is that he was developing what we'd call a relational database, and that's a model that's still in widespread use today. So let's start with the problems Codd found. The inversion and navigation issue was one of them. Of course, Codd puts it in better terms than I can. He calls the issue access path dependency. To get to a record, you have to traverse some known path in some specific order. There may be a few ways to get the data you want, but you still need to follow the links. COD presents two more of these dependence
Starting point is 00:12:54 issues. Ordering dependence and index dependence. These are more abstract ideas, so I want to just touch on them. Ordering dependence is an issue that seems to me the most clear in IDS. In that database, you traverse chains, these strange doubly and triply linked lists, to access data. These lists, as with all lists, are inherently ordered. One entry comes after the next, and that ordering is preserved from use to use. A frustrating outcome here is that IDS, despite using random access storage, isn't really a random access medium. The order of each chain matters.
Starting point is 00:13:37 There's a dependence there that can impact performance as well as program complexity. As a programmer, this is one of those points where you get to wrestle with your tools, which is not a good feeling. Indexing dependence is more abstract still. An index is a neat feature to databases that has actually been around since the beginning of time. The idea is that you can construct a digital index that functions much like the index of a book. It stores key data that can be used to speed up queries. There's a whole subfield of study just in indexing alone. In the most simple case, you might index on some primary key, something like an item number. Creating this index will make a search for specific item numbers much faster. A good
Starting point is 00:14:26 database will handle indexing in the background. You spend some time up front to build the index, then the database picks up the rest of the slack. Creating new records adds to the index and queries are thus forever improved. All the tedious bookkeeping is automated away. All you should notice are faster queries. However, not all early databases did this. IDS, Codd explains, constructed a manual index. To get faster queries, a programmer had to actually look through the index by hand. That's more wrestling, which us scrawny programmers tend to dislike. Here's where Cod knocks the whole thing out of the park. If it was me, I would have looked at
Starting point is 00:15:11 a way to improve these specific features, make indexing easier to use, make links doubly-sided, and stop using lists. But Cod, he's a professional. He used these separate issues to diagnose a larger problem. As Codd phrased it, The problems treated here are those of data independence, the independence of application programs and terminal activities from growth in data type and changes in data representations. in data type and changes in data representations. The core idea here is that early databases expose too much of their inner workings to the programmer. To effectively use a navigational database, you have to be familiar with its nuts and bolts. This isn't something I got into last episode, but IDS and IMS both let the programmer choose the layout of data on the physical disk. In the IDS manual, it even shows little diagrams of how you can tell the database to put your
Starting point is 00:16:17 chunk of text on different physical parts of a disk. That's something that would be inconvenient today and practically inconceivable. So why would this level of control be a bad thing? Let's say you spend days or weeks setting up the perfect IDS database. You have your disk laid out just so, your indexing routines are fast, and your lists are all in optimal order. What happens if a new hotshot programmer rolls into the office and decides to add another field to one of your records? Well, all your careful planning is immediately undone. You now have to spend a few more weeks reorganizing your on-disk layouts and, worst of all, rewriting your database access routines.
Starting point is 00:17:06 You've just lost a lot of time. This cascade of work happens because your code is too closely tied to the inner workings of your database. There's no isolation here. You have to cozy up to IDS's guts. In this earlier era, a database is just a way to manage data. Codd proposes that a database should, instead, be an abstraction on top of data management. Now, this is a subtle distinction that is in line with larger developments around programming. Codd is talking about portability, at least in a roundabout sort of way. During this time period, programmers were coming to grips with the idea of software that could run on more than one type of computer. This is probably best exemplified by work around BCPL, B, and eventually C. These were all languages
Starting point is 00:17:59 designed to smooth over the rough edges of a host platform and thus insulate a programmer from all the gritty details. Programming languages in general started out as a way to implement this kind of abstraction. Fortran, although initially tied closely to IBM hardware, was invented as a way to elevate programmers out of the digital doldrums. The same goes for all of the first generation languages. Once portability entered the mix, this whole abstraction discussion became much more important. Portable programs have to have some way to deal with changes in hardware. This could be something simple. Maybe more RAM gets added to your computer and you want to take advantage of it.
Starting point is 00:18:50 A good implementation of C, backed with a reasonable standard library, will just go with the flow. More RAM means you can allocate more space inside your program. This could be more extreme. You could be migrating from a TI mainframe to an IBM PC. In theory, the same C program could run on both of those machines. At least, the same code could be massaged to do the same thing. This is possible because in C, you express your program in a higher-level fashion. You wouldn't start loading up the arithmetic logic unit and then go to town. Instead, you'd say x equals y plus z. The same holds true for all operations in the language.
Starting point is 00:19:28 Under the hood, there will be a huge difference between a C program running on different machines, but to the programmer, those differences are minimal if they even exist at all. In 1970, Codd was proposing that we apply that same approach to databases, or at least he was proposing a very similar style of approach. The best way to do this would be to establish some type of standardized interface between the programmer and the database. Something like a programming language geared towards describing data. Now, this actually encompasses more than just the language itself. There had to
Starting point is 00:20:07 be some sort of model for how data was presented to the programmer. You needed a formalized way to talk about your stored data. Codd would call his proposed solution the relational database model, or just relational model. This is where we can finally break out modern terminology, or at least I can take a bit of a breath. COD's version of a relational database ends up becoming the basis for SQL. We're going to get into how eventually here, in the meantime just absorb that fact. One outcome here is that COD's early database paper actually sounds pretty familiar. It sounds close to modern. He was writing about a theoretical database back in 1970, but the terms and descriptions
Starting point is 00:20:53 are all things that a modern database enjoyer can at least recognize. That's not to say that Codd's relational model paper is easy to read or digest. About half of it is really just written in set theory notation and equations used for describing data. Instead, it feels familiar. Let's just go with that. The core of the model is its namesake, the relation. A relation is described as a collection of domains, each domain being a set of values that share the same meaning. In other words, we're talking about a table with columns, where each column represents a specific value, and each row is some entry. That's just like any old table of values, or think of it as a fancy spreadsheet if you're that specific variety of wizard.
Starting point is 00:21:46 In these tables, each record would just be a single row of columns. Simple as that. The next major concept here is the cross-reference. This is where the terminology might be a little confusing. You see, Dr. Codd is using relational in the set theory sense. Relation just means a collection of domains, so when you hear relational, it might be better to just sub in the word tabular. Cross-references are COD's version of links, so they do express a certain type of relationship between data, but not a mathematical relationship per se. Cross-references work using a combination of primary and foreign keys. A primary key is a specific field in a table that serves as some type of unique identifier for a record. These are usually sequential, so you'd have a table with primary keys starting at 1
Starting point is 00:22:41 and working their way up with no repeated values. with primary keys starting at 1 and working their way up with no repeated values. Any record can be fully and unambiguously addressed with the name of its table and its primary key. It's like a home address. How would you connect this record up to another record? Let's say you have two tables, users and vendors. Each vendor can belong to one user, and each user already has a primary key set up for you. All you have to do is set up the vendor record to have a foreign key field. Call it user ID or something similar, and then plop in the primary key of the user you want to cross-reference.
Starting point is 00:23:20 Let me give a more concrete example just to make the point clear. I have a user account. It's userid1, since... I'm the winner. I run Sean LLC, a vendor that sells fine self-sealing stimbolts. In this database system, the vendor record for Sean LLC would have a foreign key field with my userid in it. So just a field with a 1 in it. Now, let's say I pull a hostile takeover of Widgets Co. They've been a competitor in the Stimbolt market for years, but thanks to
Starting point is 00:23:53 some corporate espionage, I was able to steal a few of their big clients. I finally secured the funding to buy them out at a pretty low price. That means the database has to be updated. Well, that's easy. Just go to the vendor record for Widgets.co and change their user ID foreign key to point to my user ID. Now the database properly shows that both vendors are owned by my user record. Here's a cool implication of this schema. You can access users on their own, you can access vendors on their own, and you can link the two together to do a sort of super access. There's no navigation involved here, and frankly minimal programming. In the case of a super access, you can just tell the database that you want to get information
Starting point is 00:24:44 for a given vendor and the user it links to. The nitty-gritty of that query, following links and constructing a return value, is handled behind the scenes. You just get a big blob of information back. I think it's clear to see that COD was outlining a totally new approach to database management. What's more, this change in approach wasn't coming from some new industry need. It wasn't part of a larger manufacturing project. The navigational databases that we looked at earlier were tailor-made to solve very specific manufacturing issues. They were tools built for specific jobs. There is nothing wrong with having a handcrafted tool that's suited to a single task,
Starting point is 00:25:27 but you can't use it for some new job that comes down the line. A Phillips head screwdriver can't deal with a Torx screw. Codd's relational database doesn't follow this earlier trend. His work, at least at this point, was theoretical. It was backed by math and musings, not industry realities. In my mind, this puts the new database into a different category. It's more in line with something like Lisp, a language built from first mathematical principles. In the case of Lisp, this led to a programming language that, to many, is the essence of programming itself.
Starting point is 00:26:06 language that, to many, is the essence of programming itself. It's a pared-down language that has just what you need and can be used to craft elegant solutions. Very much an elegant tool for a more civilized age, and all the best implications of that phrase. I think this bodes well as far as ideological pedigrees. So, what became of Codd's work? Let's take a beat and set the scene. The concept of a relational database is born in 1970. The next year, the Intel 4004 is announced, which is a reasonable marker for the beginning of the microprocessor era. In 1972, IBM starts work on the SCAMP, one of the first portable computers. 1973 sees the creation of the Xerox Alto, the first graphical and personal computer. That same year, the TV typewriter is unveiled, which paves the way for home microcomputers.
Starting point is 00:27:02 In the span of a few very short years, the digital landscape is changing profoundly. This brings us to 1974. This is the year that a team inside IBM develops a language called SQL. It's one of the first implementations of COD's earlier work. At least, 74 is the usual year given for SQL's birth. The language was actually developed during this fast-moving period in the beginning of the 70s. The project started in 1972 when two IBM employees, Don Chamberlain and Ray Boyce, met Dr. Codd. These two were in attendance at a lecture Codd gave about his ideas on relational databases. Boyce and Chamberlain were both researchers working in IBM's lab, so they were in the same sort of academic-y world as Codd. Their first exposure to this relational model really stuck
Starting point is 00:27:58 with them. At the time, they'd been using DBTG. It's this gross standardized database language that had grown out of IDS. They were trapped in a navigational world, and they had just seen a way out. As Chamberlain put it in his paper Early History of SQL, For the first time, we could see how a query that would require a complex program in the dbtg language could be reduced to a few simple lines using one of COD's relational languages. It became a game for the two of us to invent queries and challenge each other to express them in various query languages. End quote. This is how SQL started, as a game between two co-workers. There were already some proposed languages that
Starting point is 00:28:47 implemented the relational model, but they were practically unusable. Codd had even offered a few options. They were just, well, there was no practical way to put them to use. Remember how Codd was a math nerd? One consequence of that was that he chose to express queries in terms of set theory, funny symbols and all. There's a practical problem there. Computers back in the day had very limited character sets. The wild operator symbols that Codd used just didn't exist in the digital vocabulary. So right off the bat, Codd's own languages weren't really usable. There's also the human factor to keep in mind. Very few people are actual diehard math nerds. A subset of those math nerds are familiar with set theory.
Starting point is 00:29:40 There's a relation you can compose there, and it's not a very favorable one. The point is, very few people could sit down and understand COD's languages. The relational model could make databases much easier to use and much more practical, but there is still this language barrier to overcome. The BNC team saw this as a huge issue. The point of the relational model was that it made a database more accessible. So it had to have an accessible language to go with it. The duo started by sketching out a simplified mathematical-based language, but that still had some issues.
Starting point is 00:30:19 It wouldn't be until 1973 that SQL itself rolled along. This next stage of development would happen under the banner of the System R project. This project, which, believe me, we won't be talking about in detail, was IBM's attempt to create a next-generational database. You can guess by the R that it was planned to be relational. Boyce and Chamberlain developed SQL as a possible language front-end for System R, a front that users would actually interact with. Chamberlain uses an illustrative example during his discussion of the history of SQL. I'm going to paraphrase it here to save a little space. Say you have a database that stores salary information for
Starting point is 00:31:05 everyone at a company. Each record has the person's name, their salary, and a reference to their supervisor. It could be a link, it could be a foreign key, or just the supervisor's name. It just matters that the information's there. How do you find all employees that make more money than their direct manager? Using an old-style navigational database, you'd have to write a program that traversed all of your dataset. You would have to know how records were connected, how everything was linked together, and your code would be tied to that specific structure. You run into the dependency problems pretty quickly. Boyce and Chamberlain realized that a relational database just greatly simplifies the salary question.
Starting point is 00:31:50 You can write a single query that will find the right employee records. You don't have to have extensive knowledge of how the data is constructed. You don't even have to use an external programming language. You can just issue a command. This is the root of how SQL was constructed and conceived of. It was a simple-to-understand language that would complement this new simple-to-understand database. We can even see this ideology in the language's name. SQL was originally an awkward shortening of Structured English Query Language. Something I didn't expect
Starting point is 00:32:28 is how much SQL was targeted for non-computer nerds. The 1974 description of the language, basically its first official paper trail, tucks this line in its opening passage, There is an increasing need to bring the non-professional user into effective communication with a formatted database. Much of the success of the computer industry depends on developing a class of users other than trained computer specialists, end quote. Even at the outset, SQL is being put into a pretty grandiose position. Even at the outset, SQL is being put into a pretty grandiose position.
Starting point is 00:33:12 It's set to be a simple language that's accessible, as well as a cog in the larger expansion of the computing industry. The key here is simplicity, which we need to keep in the front of our minds. SQL was built to work with tables of data. We're talking the same type of domain relations that Codd outlined a few years prior. Boyce and Chamberlain would end up actually renaming a number of Codd's ideas to make them more accessible, more palatable to non-math people. In SQL, each table is composed of columns of data, each column with a given data type. Thus, each row of a table is a full relation, or just a row if you prefer the simplified terms. Much like earlier database languages, SQL almost exclusively used alphanumeric characters. Everything was written in big, all-caps English words.
Starting point is 00:34:08 in big all-caps English words. The most used of these words is, by far, SELECT. This is the do-it-all command that ties queries together. To retrieve a record, you simply write out SELECT FIELDS from TABLE. To filter results, you'd add a WHERE FIELD equals blah. That just goes at the end of your query. I think it's plain to see how really simple this all is. That's the best word for it. You can actually read a line of SQL aloud and it makes good sense. The only special syntax in this example, if you can even call it that, is an equal sign. That on its own lowers the barrier to entry. You could shout a query across a room and have a novice type it in without issues. That's some power right there. Select from where commands are still a defining feature of SQL in the modern day. By modern day here, I don't even mean like 1990 or 2000. I mean, in 2020, SQL still uses the very same syntax for SELECT.
Starting point is 00:35:11 This very early SQL paper describes a language that looks shockingly similar to anyone who's used SQL. However, there are some oddities at play. To start with, we do have something that breaks the only English rule. You see, the SQL paper describes this thing called the union operator. This is where we can see set theory creeping in. The union operator, represented by this lovely upside-down capital U, takes the union of two results. In other words, you get all results from both arguments without any duplication in data. The example given shows the output of two select statements being combined. Not only is this character missing
Starting point is 00:35:59 from a keyboard, but if you aren't a mathlete, then you probably don't know what to expect from a union. There are a few other weird set theory hints around here, at least if you know where to look. Select was initially described a little wrong. In modern SQL, a select just gives you whatever you ask for. A SQL select, on the other hand, gives you a unique result set. This is a subtle difference, but one that does matter. In set theory, a set is a unique list of values. So while SQL is starting the move towards a more vernacular approach to data, there's still this mathematical DNA very much in the forefront. There's also just a bunch of weird and clunky syntax going on. Old-school SQL doesn't do join operations. That's the more modern way to handle cross-references. Instead, you use nested selects to accomplish a somewhat similar thing.
Starting point is 00:37:06 Then there's the matter of labels. This is kind of a technical gripe, but I think it's an important thing to look at. In SQL, it's really common to alias column and table names. You might write a query like select name from employees as e, where e.id equals one. Using the as keyword, you can temporarily rename a column or table. This new name only applies within that query. In this case, I'm renaming the employees table as just E and then referencing that in my where clause. This is kind of a trivial example.
Starting point is 00:37:50 It's not very useful since you don't need to use the table name inside this where clause. More often, you use aliases when constructing really complex queries that involve multiple tables in one line. selecting really complex queries that involve multiple tables in one line. You might alias employees as e and departments as d to save on code later on. You can also do math inside select statements, and you usually want to alias the results of a math operation into something reasonable. This matters on a pretty basic level because every column in a relational database is uniquely identifiable. Each column has an address, which, that's important because queries are all about juggling
Starting point is 00:38:36 columns around. In SQL, that address is simply the table name and the column name. It's expressed as table.column. Simple. So an employee's name is stored in employees.name. You can alias that up to something like e.name or even e.in if you're nasty and you hate your co-workers. When you don't use a fully qualified column name, SQL assumes you're talking about the table you're accessing. That's the from part of a query. If you're selecting something from employees, then you don't need to specify the table you're using in the where clause. There's only one table you could be talking about. There's no
Starting point is 00:39:25 ambiguity. But how can you break this? In SQL, you don't often run into ambiguity. The most common issue I can think of is if you have shared column names between tables. But you can get around that by using fully qualified column names. Just add table. and you're good. These ambiguities can show up in joins, these pesky, fancy cross-reference operations. But SQL, at least early on, doesn't have joins. It nests. That can lead to a subtle issue, though. This is where we hit the bottom of this weird rabbit hole.
Starting point is 00:40:07 When you make a select, you're technically creating a brand new relation. The first argument to a select is a list of columns to return. At least, that's how the user sees it. But to SQL, to SQL, you're actually creating a new relation, but it doesn't belong to any table. It just exists somewhere in some ambiguous space. What would happen if, say, you wanted to reference some column in this new barely extant relation? That may sound like an edge case, but it's actually not. This is one way that cross-references worked in SQL. This even shows up in our canonical example of the episode, selecting employees who earn more than their managers. In our beautiful 74 vintage SQL,
Starting point is 00:41:02 this query is very possible, but only with the use of what's called a block label. This is a label you can give to the results of a select, basically a way to temporarily name that temporary relation. But the syntax, this is where it gets really apparent how early SQL is. A labeled query would start like this. Label colon select name from and then so on. This looks like a small detail, but it's actually a big inconsistency. It's a break in the established reasoning for SQL's syntax. The label isn't announced at all.
Starting point is 00:41:53 It doesn't fit the usual flow of select from where. It's just a word tacked on the front, followed by a colon. It very much makes me think, oh no, our example code doesn't actually work. Let's quickly fix it. This is Chamberlain and Voice's first pass at a non-mathematical programming language. Some issues are just kind of bound to appear. Now, I'm harping on this so much because, well, it caught me off guard. To me, modern SQL doesn't feel modern at all. It feels a lot like programming Fortran or COBOL. It has all these gotchas I don't expect, documentation comes in ALL CAPS, and it's super verbose. I just assumed it must be one of these totally fossilized languages.
Starting point is 00:42:46 Fortran has changed over the years, but those changes are mainly expansions. SQL, on the other hand, has dropped some of its vestigial syntax. At least, I haven't seen a block label in the years I've spent programming SQL. Maybe we should look at SQL less as an early version of SQL and more as a rough draft. Now, there's still the matter of lineage here. SQL isn't just a forced acronym. Boyce and Chamberlain had a reason to force things to fit that word in particular. You see, this language was their sequel to an earlier programming language. The language in question was called Square. The 74 sequel paper actually spends about maybe half of its length discussing Square, so it's something we should at least glance at.
Starting point is 00:43:47 Square incorporated much more mathematical notation than SQL ever did. This actually makes it almost impossible to read in standard type. It uses a whole lot of superscripts and subscripts, which is fine, But for whatever reason, period IBM papers notate that by using three lines of text. You have the center line used for the normal script, superscript on the line above, and subscript on the line below. The effect is that square code is presented in an almost impenetrable wall of text. A very fine first impression, if I do say so myself. A study was conducted by Phyllis Reisner, another IBM researcher,
Starting point is 00:44:38 on how easy it was to learn square versus SQL. The results were pretty unambiguous. Students were able to pick up SQL more quickly and more easily. The batch of students included programmers and non-programmers. So while Square is a cool footnote, it was never a practical kind of language. So that's our rundown of SQL, the language. But that's only part of a database, only one component. The language is really just the interface. To go deeper, we need to look at an implementation. What better example than SystemR, the very first experimental SQL database? SystemR started development in 1973. It was IBM's attempt to develop a next-generation relational database.
Starting point is 00:45:29 It's a nice line, but it doesn't tell us much. Near the end of the 70s, a paper discussing the development, history, and analysis of System R was published. It's kind of a treasure trove for me, at least in researching this early era of relational databases. The paper lists seven key goals to the overall project. I'm going to mix, match, and reduce those down into something more digestible. First of all, System R was experimental. It was not intended as a product. experimental. It was not intended as a product. Instead, it was viewed as a way to see if relational databases would, you know, actually be useful to people. This is far different than
Starting point is 00:46:13 the roots of IDS and IMS. System R was a dry run, part of a larger experimental project. This is in distinct contrast to purpose-built software that dominated the early database scene. That distinction alone is, once again, very, very important. System R was intended to have a high-level language interface. That's the SQL-slash-SQL connection. The language was intended to only serve as an interface into the database program, not as a way to specify lower-level functionality. As a result, SQL doesn't offer some of the nitpicky options that were found in navigational databases. You can't, for instance, use SQL to define file layouts.
Starting point is 00:47:05 You can define tables and ask for data, but that's it. This feeds into something that I hadn't initially considered. This comes directly from a history and evaluation of Systemar. Quote, to support a rapidly changing database environment in which tables, indexes, views, transactions, and other objects could easily be added to and removed from the database without stopping the system. That's the third goal in the overall SystemR project. Not only does this high-level interface make it easier to use System R, but it also lets you gloss over changes in the underlying data. That high-level interface allows for changes in
Starting point is 00:47:55 transactions, tables, indexes, views, and other objects. New tables might be added, information could be removed, or whole new fields could be added to existing tables, but your old queries will still work. You're never messing with underlying data when you're programming SQL. You're just asking for specific bits and pieces. It's up to system R to figure out how to serve those queries. This means that really a whole lot can change on the backend without a user ever noticing. You could even totally revamp how files are being stored or how tables are internally structured. As long as your employee's table still has a name field, then you can find every employee named Dave. You won't even need to change your code. This is that whole abstraction point we discussed earlier. Yes, portability is
Starting point is 00:48:56 one reason to go this route. That said, there is another huge incentive. Cost cost. This is a really old-school reason for constructing a programming language. In fact, it's why programming languages emerged from the primordial soup in the first place. In the 50s, we reach a point where the cost of programming outweighed the cost of building hardware. Researchers like Rear Admiral Grace Hopper approached this issue by creating the first programming languages. High-level languages, that is, anything more accessible to us humans than assembly language,
Starting point is 00:49:38 make fantastic tools for cutting cost. At a basic level, they just make programming faster. A single line of code can do more than a single CPU instruction. It also makes code easier to read and understand, thus rendering it easier to debug. Those two factors save you hourly costs. That shows up as a nice line item in payroll. There's a more subtle way these kinds of languages earn their keep. They expand who can be a programmer. Fortran was specifically developed to get non-programmers using computers. SQL 2 was designed as a language that non-programmers could easily pick up. I mean, there were even
Starting point is 00:50:26 human factor studies conducted to make sure that goal was being achieved. That saves money because you no longer have to hire people with PhDs to crunch numbers. Suddenly, database administration turns from a high-octane programming job to something at least a little closer to clerical work. This is a democratizing effect, sure, but it also means you can hire less skilled workers to crunch numbers. That, in turn, means companies can save a few bucks. Again, this is getting us to the root of what made SQL different from earlier databases. This thing is a monster that comes straight out of the research lab, not a tool developed for specific industry needs. It has nothing to do with manufacturing at GE or subcontracting on the Apollo program.
Starting point is 00:51:20 SQL and System R are research projects, built straight from first principles and backed with good reasons. I think that speaks volumes as to why SQL has had such staying power. We can see more of these lab-grown bona fides in the development of System R itself. Not only was System R totally experimental, it also had a prototype phase. And I gotta say, I love this kind of stuff. The nerds inside IBM started with what they called System R Phase Zero. That straight up sounds like something out of a bad sci-fi film. A database has no right to sound that cool.
Starting point is 00:52:06 Now, Phase Zero was intended as a way to rough out a relational database and really try to scope out issues before starting in on the real project. As such, System R Phase Zero was intended to be scrapped as soon as testing was done. This was the first implementation of SQL. It was also around this time that the name change occurred. The language started showing up in official documents as SQL, the Structured Query Language. The name was different, but this was still SQL, the same language we've been discussing. So yeah, SQL has a direct lineage going back to 1973. If we're a little generous, then there's a less direct chain going back to 1970. Now, there are some cool aspects about Phase Zero that we need to discuss.
Starting point is 00:53:03 I think we've hit the language part pretty hard, so I want to bring us around to the back end. Phase Zero was a single user system. This was done to facilitate rapid development and sidestep the issues that came with multi-user systems. Those were problems that IBMers had already encountered, so there were existing solutions around the office. This prototype was also layered, much like an onion. SQL was the top layer, the abstraction layer that users and programmers would actually see. But SQL didn't command SystemR directly, I don't know. Instead, SQL statements were used as gentle suggestions from the user.
Starting point is 00:53:49 Remember, you can't specify the specific algorithm used to manipulate data. You just send out a query. System R has to figure out how best to execute that query. The database program knows about the underlying structure of the database itself, how files are stored, all that low-level jazz. It's able to make optimizations and tweaks to make that query fast and functional. SQL queries were actually compiled to execute on an intermediary layer. That layer was called XRM, and there's not a lot of information about it on the net. This is due to the fact that it was kind of an IBM internal thing, so the papers about it aren't super accessible. XRM was also dropped after phase 0, so it wasn't in use with system R for very long. What matters here is that XRM acted as an
Starting point is 00:54:48 interface between SQL and System R itself. From what I gathered, it was a lower-level programming interface that would let you describe the full algorithm to execute. So, something kind of like the older navigational database languages, but geared towards relational data, if that makes any sense. I keep talking about how important abstraction was to SQL. Well, this is how that abstraction was actually implemented. Phase 0 would show enough promise that the team was able to move to Phase 1, total rewrite. This was still experimental, but Phase 1 would wind up being used both internally and by select, very lucky IBM customers. This was the opportunity to put relational databases really through their paces.
Starting point is 00:55:46 relational databases really through their paces. This entailed a full redesign, and this is where XRM is dropped in favor of a layer called RSS. It served the same purpose but was more refined, at least that's what the documents lead me to believe. Phase 1 also saw more of a user population. System R was now a multi-user system with all the trappings that entailed. Phase 1 is also where the dreaded join was used for the first time in real life. Boyce and Chamberlain had called for a join command when they initially drafted SQL, or at least a little bit after the first paper, But it hadn't been implemented in phase zero. A join is dreaded because it's one of the more complex operations you can do in SQL. You join together multiple tables based off some cross-reference key, usually a foreign key in one table paired
Starting point is 00:56:41 up to a primary key in another. Think of it as lining up two tables to form a much larger table. You can actually join more than two tables, leading to these cumbersome chains of joins using aliases and all kinds of nasty business. I can see why IBM wouldn't want to do that in their beautiful Phase Zero prototype. But something made the SystemR team change their minds. Quoting again, Observation of some of the applications of Phase Zero convinced us of the importance of the join formulation of SQL. End quote.
Starting point is 00:57:22 Nice and vague. My guess is they were looking for more flexibility. A join is complex enough that it's almost like its own little program. You're telling SQL that you want it to scan a table and for each employee look up the employee ID in the payroll table and then get their salary. You can do something very similar using nested selects, but you only get limited information back. A select only returns columns from one table. It's from table. A join can give you columns from really any number of tables, from every table involved. In some cases, that will let you write fewer queries to get the same amount of data. In phase one, we reach the point where SystemR is ready to
Starting point is 00:58:13 interface with other languages. Think way back to last episode. The whole point of a database is that it backs up other software. It's a tool in the programmer's ever-growing belt. Phase 1 had three interfaces. A native command line, a COBOL library, and a PL1 library. And here's the brilliant part, and something we still grapple with today. All three of these interfaces were the same. It was SQL all the way. To access your database in COBOL, you wrote inline SQL queries.
Starting point is 00:58:52 If you knew how to run the System.R command line, then you could program it in PLI. Everything was the same across languages. And I cannot emphasize how crucial this choice was. Today, this is a given, but it wasn't so back in the 70s. One thing that struck me about IMS, IBM's earlier navigational database, was that it was really hard to understand its documentation. Its interfaces used different names for the same calls. So you might be able to drive IMS's native interface, but you wouldn't necessarily know how to use it with COBOL. This is something that we still struggle with today. The official Python library for MongoDB
Starting point is 00:59:41 actually falls into this trap, and I will never forgive them. That library changes the name of some functions from MongoDB's native interface. The newest version of PyMongo uses count documents to get a count of records, where MongoDB just uses count, for instance. I don't know how they justify it, but it is the wrong choice, and they will never convince me otherwise. I don't care what the rationalization is. The native interface should be the golden standard for a library. That lets programmers switch between running on-the-fly queries and just programming. It also means that the library's authors don't have to reinvent the wheel.
Starting point is 01:00:25 You just write from a spec. You never need two ways to do the same thing. When users got their hands on System R, they really appreciated this consistency. It sounds like everyone thought it was a good idea, because it kinda is. idea. Because it kinda is. There is the downside that it, in some cases at least, can look strange to have random SQL mixed in with your code. But hey, it works and folk liked it. This is still the default way for dealing with SQL. You can hide your queries under more abstraction layers. and hide your queries under more abstraction layers. Libraries exist for just that purpose.
Starting point is 01:01:09 But if you open up a standard SQL library, chances are you'll be issuing queries in normal SQL syntax. This does bring in something that will show up way down the line here. You see, usually when you run an SQL query from another language, you call up some special function that will send your query off to the SQL server. That query is written and sent as a string, just a chunk of unadorned text. I don't think that was the case with COBOL and PLI, but more recent implementations roll this way. SQL statements look really similar to COBOL, so they could mesh well with COBOL's code. The compiler probably wouldn't really care. But SQL and PHP, a language very commonly used with the database nowadays,
Starting point is 01:02:03 well, those two languages look totally different. If you put a raw line of SQL in a PHP file, your code wouldn't just be ugly, it wouldn't run. So you have to pass around SQL queries as strings, as variables. So let's say you have a form that lets you search for users by name. You type in a name and you hit send. On the back end, a select query is constructed, your name is put into the where section, and then the query is executed. But what if you put in a really dumb name?
Starting point is 01:02:38 Like, let's just say you have a friend named quote semicolon drop table users. Say what you will, but it's a family name. This family happens to break databases. What I've laid out is a pretty traditional SQL injection attack. If your code is just blindly shoveling users' inputs into queries, then you're asking for a bad time. A malicious user can exploit that to cause havoc with your database. So there are some trade-offs with this uniform syntax approach. But hey, injection attacks don't show up for quite some time. The final note here is that even during testing, System R and SQL continued to grow.
Starting point is 01:03:29 IBM was very receptive to feedback from users. This was an experiment after all. At least four new SQL keywords were added from user suggestions. I think this really shows that IBM was taking System R really seriously. It was a well-thought-out and well-built project. One of these user-added commands, like, is something I use all the time. It lets you perform partial matches, so you can look for fields that contain some string of letters. It's one of those core functionality things, and it's neat to know that it came from experimental feedback. The bottom line here is that SQL was part of a serious attempt to make a better kind of database. In latter years, that dedication would help it conquer the world.
Starting point is 01:04:22 would help it conquer the world. Alright, thus we reach the conclusion of our database deep dive. In these two episodes, we've covered the don of the database up to SQL. As far as I'm concerned, that drops us right into the existing status quo. So, what have we learned? Have I found a way to see past my frustrations with SQL? I like to think I have. Early databases, these navigational types,
Starting point is 01:04:57 would have really sucked to work with. Even outside all the technical problems, navigational databases were just cumbersome beasts. That was their very nature, so it was hard to correct. The best way out was to just scrap them and start over, to go back to first principles. Dr. Codd started that process in 1970. The idea of a relational database was brewing during a time of really rapid technological change. Core tenants were established by drawing from set theory. This led to something fundamentally new, both in terms of intent and execution.
Starting point is 01:05:38 Earlier databases had been designed to fill a role, built inside industry as a tool to solve problems. These were useful, but they had their limits. The relational database, especially System R, was built from research. It was a more methodical approach to creating a database. SQL, knee SQL, was the friendly face of this new type of software. It served as the user interface and the programmer's connection all rolled up into one. I think that explains a lot of my problem with SQL. It was never meant as a programmer's language, never a highly refined tongue. SQL was designed as an accessible language that anyone can learn to use. In that sense, it's closer to basic than it is to Lisp. That doesn't make me love the
Starting point is 01:06:36 language anymore, but I think now I can better understand why SQL is the way it is. It's not necessarily meant for someone like me. But hey, compared to older competition, I think I'd take SQL any day. Thanks for listening to Advent of Computing. I'll be back in two weeks time with another piece of computing's past. And hey, if you like the show, there are a few ways you can support it. If you know someone else who'd be interested in computer history, then please take a minute to share the show with them. You can also rate and review on Apple Podcasts. And if you want to be a super fan, you can support the show directly through Advent of Computing merch
Starting point is 01:07:16 or signing up as a patron on Patreon. Patrons get early access to episodes, polls for the direction of the show, and bonus content. You can find links to everything on my website, adventofcomputing.com. get early access to episodes, polls for the direction of the show, and bonus content. You can find links to everything on my website, adventofcomputing.com. If you have any comments or suggestions for a future episode, then go ahead and shoot me a tweet. I'm at Advent of Comp on Twitter. And as always, have a great rest of your day.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.