Screaming in the Cloud - Defining a Database with Tony Baer

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is brought to us in part by our friends at Red Hat. As your organization grows, so does the complexity of your IT resources.

Starting point is 00:00:39 You need a flexible solution that lets you deploy, manage, and scale workloads throughout your entire ecosystem. The Red Hat Ansible automation platform simplifies the management of applications and services across your hybrid infrastructure with one platform. Look for it on the AWS Marketplace. Welcome to Screaming in the Cloud. I'm Corey Quinn. Back in my early formative years, I was an SRE Sysadmin type. And one of the areas I always avoided was databases, or frankly, anything stateful. Because I am clumsy and unlucky, and that's a bad combination to bring within spitting distance of anything that, you know, can't be spun back up intact like databases. So as a result, I tend not to spend a lot of time historically living in that world. It's time to expand horizons and think about this a little bit differently.

Starting point is 00:01:38 My guest today is Tony Baer, principal at DB Insight. Tony, thank you for joining me. Oh, Corey, thanks for having me. And by the way, we'll try and basically knock down your primal fear of databases today. That's my mission. We're going to instill new fears in you because I was looking through a lot of your work over the years and the criticism I have, and always the best place to deliver criticism is massively in public, is that you take a very conservative, stodgy approach to defining a database, whereas I'm on the opposite side of the world. I contain information. You can ask me about it, which we'll

Starting point is 00:02:11 call querying. That's right. I'm a database. But I've never yet found myself listed in any of your analyses around various database options. So what is your definition of databases these days? Where do they start and stop? Oh, gosh. Because I think anything can be a database if you hold it wrong. I think one of the last things I've ever been called is conservative and stodgy. So, this is certainly a way to basically put the thumbtack on my chair. Exactly. I'm trying to normalize my own brand of lunacy. So, we'll see how it goes. Exactly. Because that's the role I normally play with my clients. So,

Starting point is 00:02:43 now that the shoe is on the other foot, what I view a database is, is basically a managed collection of data. And it's managed to the point where essentially a database should be transactional. In other words, when I basically put some data in, I should have some positive information. I should hopefully, depending on the type of database, have some sort of guidelines

Starting point is 00:03:05 or schema or model for how I structure the data. So a database, you know, even though you keep hearing about unstructured data, the fact is... Schema-less databases and data stores. Yeah, it was all the rage for a few years. Yeah, except that they all have schemas, just that those schema-less databases just have very variable schema. They're still schema. A question that I have is, you obviously think deeply about these things, which should not come as a surprise to anyone. It's like, well, this is where I spend my entire career. Imagine that. I might think about the problem space a little bit. But you have, to my understanding, never worked with databases in anger yourself. You don't have a history as a DBA or as an engineer.

Starting point is 00:03:46 But what I find very odd is that unlike a whole bunch of other analysts that I'm not going to name, but people know who I'm talking about regardless, you bring actual insights into this that I find useful and compelling instead of reverting to the mean of, well, I don't actually understand how any of these things work in reality. So I'm just going to believe whoever sounds the most confident when I ask a bunch of people about these things. Are you just asking the right people who also happen to sound confident? But how do you get away from that very common analyst trap? Well, a couple of things. One is I purposely play the role of outside observer. In other words, like the idea is that if basically an idea is supposed to

Starting point is 00:04:26 stand on its own legs, it has to make sense. If I've been working inside the industry, I might take too many things for granted. And a good example of this, this goes back actually to my early days. Actually, this goes back to my freshman year in college where I was taking an organic chem course for non-majors. And it was taught as a logic course, not as a memorization course. And we were given the option at the end of the term to either basically take a final or do a paper. So, of course, me being a writer, I thought, I can BS my way through this.

Starting point is 00:04:57 But what I found, and this is what fascinated me, is that as long as certain technical terms were defined for me, I found a logic to the way things work. And so that really informs how I approach databases, how I approach technology today, as I look at the logic on how things work. That being said, in order for me to understand that, I need to know twice as much as the next guy

Starting point is 00:05:17 in order to be able to speak that, because I just don't do this in my sleep. That goes a big step toward, I guess, addressing a lot of these things. But it also feels like, and maybe this is just me paying closer attention, that the world of databases and data and analytics have really coalesced or emerged

Starting point is 00:05:36 in a very different way over the past decade-ish. It used to be, at least from my perspective, that, oh, all the data we store, that's a storage admin problem. And that was about managing NetApps and SANS and the rest. And then you had the database side of it, which functionally from the storage side of the world is just

Starting point is 00:05:53 a big file or series of files that are the backing store for the database. And okay, there's not a lot of cross-communication going on there. Then with the rise of object store, it started being a little bit different. And even the way that everyone is talking about getting meaning from data has really seemed to be evolving at an incredibly intense clip lately. Is that an accurate perception or have I just been asleep at the wheel for a while and finally woke up?

Starting point is 00:06:20 No, I think you're onto something there. And the reason is that, one, data is touching us all around ourselves. data has, well, I should say, not just with the cloud, but with smart mobile devices. Don't blame that. We are all each fonts of data and rich fonts of data. And people in all walks of life, not just in the industry, are now becoming aware of it. And there's a lot of concern about, can we have any control, any ownership over the data that should be ours?

Starting point is 00:07:03 So I think that phenomenon has also happened in the enterprise where essentially where we used to think that the data was the DBA's issue, but it's become the app developer's issue. It's become the business analyst's issue because the answers that we get, we're ultimately accountable for. It all comes from the data. It also feels like there's this idea of databases themselves becoming more contextually aware of the data contained within them. Originally, this used to be in the realm of,

Starting point is 00:07:33 oh, we know what's been accessed recently, and we can tear out where it lives for storage optimization purposes. Okay, great. But what I'm seeing now almost seems to be a sense of, people like to talk about pouring ML into their database offerings. And I'm not able to tell whether that is something that adds actual value or if it's marketing-ware. Okay. First off, let me kind of spill a couple of things. First off, it's not a question of the database becoming aware. Database is not sentient. Neither are some engineers, but that's neither here nor there. That would be true. But then again, I don't want anyone with shotguns lining up my door after this

Starting point is 00:08:09 interview is published. But more of the point, though, is that I can see a couple roles for machine learning in databases. One is a database itself. The logs are an incredible font of data, of operational data. And you can look at trends in terms of when the pattern of these logs goes this way, that is likely to happen. So the thing is that I could very easily say, we're already seeing it, machine learning being used to help optimize the operation of databases. If your Oracle will say, hey, we can have a database that runs itself. The other side of the coin is being able to run your own machine learning models in database as opposed to having to go out into a separate cluster and move the data. And that's becoming more and more of a

Starting point is 00:08:50 checkbox feature. However, that's going to be for essentially probably like the low-hanging fruit, like the 80-20 rule. It'd be like the 20% of relatively rudimentary, let's say, predictive analyses that we can do inside the database. If you're going to be doing something more ambitious, such as a large language model, you probably do not want to run that in database itself. So there's a difference there. One would hope. I mean, one of the inappropriate uses of technology that I go for all the time is finding ways to, as directed or otherwise in off-label uses, find ways of tricking different services into running containers for me. It's kind of a problem. This is probably why everyone is very

Starting point is 00:09:31 grateful I no longer write production code for anyone. But it does seem that there's been an awful lot of noise lately. I'm lazy. I take shortcuts very often. And one of those is that whenever AWS talks about something extensively through multiple marketing cycles, it becomes usually a pretty good indicator that they're on their back foot on that area. And for a long time, they were doing that about data and how it's very important to gather data.

Starting point is 00:09:59 It unlocks the key to your business. But it always felt a little hollow slash hypocritical to me because you've been to some of the same events that I have that AWS throws on. You notice how you have to fill out the exact same form with a whole bunch of mandatory fields

Starting point is 00:10:13 every single time, but there never seems to be anything that gets spat back out to you that demonstrates that any human or system has ever read any of that. It's basically a say what we do what we say, not what we do style of story. And I've always found that to be a little bit disingenuous.

Starting point is 00:10:30 I don't want to just harp on AWS here. Of course, we can always talk about the two pizza box rule and the fact that you have lots of small teams there, but I'd rather generalize this. And I think really what you're just describing has been my trip through the healthcare system. I had some sports-related injuries this summer, so I've been through a couple of surgeries to repair sports injuries.

Starting point is 00:10:50 And it's amazing that every time you go to the doctor's office, you're feeling the same HIPAA information over and over again, even with healthcare systems that use the same electronic health record software. So it's more a function of that. It's not just that the technologies are siloed, it's that the organizations are siloed. That's what you're seeing. That is fair. And I think on some level, I don't know if this is a weird extension of Conway's law or whatnot, but these things all have different backing stores as far as data goes. And there's a,

Starting point is 00:11:21 the hard part, it seems in a lot of companies, once they hit a certain point of maturity, is not just getting the data in, because they've already done that to some extent, but it's also then making it actionable and helping various data stores internal to the company reconcile with one another and start surfacing things that are useful. It increasingly feels like it's less of a technology problem, more of a people problem. It is. I mean, put it this way. I spent a lot of time last year. I burned a lot of brain cells working on data fabrics, which is an idea that's in the idea of the beholder. But the ideal of a data fabric is that it's not the tool that necessarily governs your data or secures your data or moves your data or transforms your data, but it's supposed to be the master orchestrator that brings all that stuff together. And maybe sometime 50 years in the future, we might see that.

Starting point is 00:12:08 I think the problem here is both technical and organizational. The problem is you have all these what we used to call silos. We still call them silos or islands of information. And actually, ironically, even though in the cloud we have technologies where we can integrate this, the cloud has actually exacerbated this issue because there's so many islands of information coming up, and there's so many different little parts of the organization that have their hands on that. That's also a large part of why there's such a big discussion about, for instance, data mesh last year. Everybody is concerned about owning their own little piece of the pie, and there's a lot of question in terms of how do we get some consistency there? How do we all read from the same sheet of music? That's going to be an ongoing problem.

Starting point is 00:12:48 You and I are going to get very old before that ever gets solved. Yeah, there are certain things that I am content to die knowing that they will not get solved. If they ever get solved, I will not live to see it. There's a certain comfort in that on some level, but it feels like this stuff is also getting more and more complicated than it used to be. And terms aren't being used in quite the same way as they once were. Something that a

Starting point is 00:13:12 number of companies have been saying for a while now has been that customers overwhelmingly are preferring open source. Open source is important to them when it comes to their database selection. And I feel like that's a conflation of a couple of things. I've never yet found an ideological purity-driven customer decision around that sort of thing. What they care about is, are there multiple vendors who can provide this thing so I'm not going to be using a commercially licensed database that can arbitrarily start playing games with seat licenses and wind up distorting my cost structure massively with very little notice. Does that align with your understanding of what people are talking about when they say that, or am I missing something fundamental, which is, again, always possible?

Starting point is 00:13:51 No, I think you're onto something there. Open source is a whole other can of worms, and I've heard many, many brain cells over this one as well. And today you're seeing a lot of pieces about that are giving utilogies for open open source, like HashiCorp has finally changed its license and a bunch of others have. In the database world, what open source has meant has been, I think for practitioners, for DBAs and developers, here's a platform that's been implemented by many different vendors, which means my skills are portable. And so I think that's really been the key to why,

Starting point is 00:14:26 for instance, MySQL and especially Postgres SQL have really exploded in popularity, especially Postgres of late. And it's like, you look at Postgres, it's this very unglamorous database. If you're talking about Stodgy, it was born to be Stodgy because they wanted to be an adult database from the start. They weren't the lamp stack like MySQL. And the secret of success with Postgres was that it had a very permissive open source license, which meant that as long as you don't hold University of California at Berkeley liable, have added kids. And so you see a lot of different flavors of Postgres out there, which means that a

Starting point is 00:15:03 lot of customers are attracted to that because if I could get up to speed flavors of Postgres out there, which means that a lot of customers are attracted to that because if I could get up to speed on this Postgres, on one Postgres database, my skills should be transferable, should be portable to another. So I think that's a lot of what's happening. Well, I do want to call that out in particular because when I was coming up in the noughts, the mid-2000s decade, the lingua franca on everything I used was MySQL, or as I insist on mispronouncing it, MySqueal. And lately, same vein, Postgresqueal seems to have taken over the entire universe when it comes to the de facto database of choice. And I'm old and grumpy, and learning new things is always challenging. But so I don't understand a lot of the ways that that thing gets managed from the context coming from where I did

Starting point is 00:15:45 before. But what has driven the massive growth of Mindshare among the Postgres wheel set? Well, I think it's a matter of it's 30 years old. And number one, Postgres always positioned itself as an Oracle alternative. And the early years, you know, this is a new database, how are you going to be able to match? At that point, Oracle had about a 15-year head start on it. And so it was a gradual climb to respectability. And I have huge respect for Oracle. Don't get me wrong on that. But you take a look at Postgres today, and they have basically filled in a lot of the

Starting point is 00:16:18 blanks. And so it now is, in many cases, it's a credible alternative to Oracle. Can it do all the things that Oracle can do? No. But for a lot of organizations, it's the 80-20 rule. And so I think it's more just a matter of like Postgres coming of age. And the fact is, as a result of it coming of age, there's a huge marketplace out there and so much choice and so much opportunity for skills portability. So it's really one of those things where its time has come. I think that a lot of my own biases are simply a product of the era in which I learned how a lot of these things work. I am terrible at Node, for example,

Starting point is 00:16:54 but I would be hard-pressed not to suggest JavaScript as the default language that people should pick up if they're just entering tech today. It does front-end, it does back-end, it even makes fries, apparently. There's a, that is the lingua franca of the modern internet in a bunch of different ways. That doesn't mean I'm any good at it, and it doesn't mean at this stage

Starting point is 00:17:14 I'm likely to improve massively at it, but it is the right move, even if it is inconvenient for me personally. Right, right. Put it this way, we've seen, and as I said, I'm not an expert in programming languages, but we've seen a huge profusion of programming languages and frameworks. But the fact is that there's always been a draw towards critical mass.

Starting point is 00:17:34 At the turn of the millennium, we thought it was between Java and.NET. Little did we know that basically JavaScript, which at that point was just a web scripting language, we didn't know that could work on the server. We thought it was just a client. Who knew? That's like using something inappropriately as a database. I mean, good heavens.

Starting point is 00:17:50 That would be true. I mean, I could have easily just used a spreadsheet or something like that. But so, I mean, who knew? Just like, for instance, Java itself was originally conceived for a set-top box. You never know how this stuff is going to turn out. It's the same thing that happened with Python.

Starting point is 00:18:05 Python was also a web scripting language. Oh, by the way, it happens to be really powerful and flexible for data science. And whoa, now Python, in terms of data science languages, has become the new SaaS. It really took over in a bunch of different ways. Before that, Perl was great. You're like, well, why would I use, why would I write in Python when Perl's available? It's like, okay, you know how to write Perl, right? Yeah.

Starting point is 00:18:27 Have you ever read anything a month later? Oh, it's very much a write-only language. It is inscrutable after the fact. And Python at least makes that a lot more approachable, which is never a bad thing. Yeah. Speaking of what you touched on toward the beginning of this episode,

Starting point is 00:18:43 the idea of databases not being sentient, which I equate to being self-aware, you just came out very recently with a report on generative AI and a trip that you wound up taking on this, which I've read. I love it. using the phrase, English is the new most common programming language once a lot of this stuff takes off. But what have you seen? What have you witnessed as far as both the ground truth reality as well as the grandiose statements that companies are making as they trip over themselves trying to position as the forefront leader in all of this thing that didn't really exist five months ago? Well, what's funny is, and that's a perfect question, because if on January 1st you asked, what's going to happen this year? I don't think any of us would have thought about generative AI

Starting point is 00:19:32 or large language models. And I will not identify the vendors, but I was on some advanced briefing calls back around the January, February timeframe. They were talking about things like serverless. They were talking about like in-database machine learning and so on and so forth. They weren't saying anything about generative. And all of a sudden in April, it changed. And it's essentially, this is another case of the tail wagging the dog. Consumers were flocking to chat GPT and enterprises had to take notice.

Starting point is 00:20:02 And so what I saw in the spring was, and I was at conferences from SAS, I'm trying to remember, SAP, Oracle, IBM, Mongo, Snowflake, Databricks, and others, that they all very quickly changed their tune to talk about generative AI. What we were seeing was for the most part position statements, but we also saw, I think the early emphasis was, as you're saying, it's basically English as the new default programming language or API. So basically coding assistant, what I'll call conversational query. I don't want to call it natural language query because we had stuff like Tableau-ass data, which was very robotic. So we're seeing a lot of that. And we're also seeing a lot of attention towards foundation models because, I mean, what organization is going to have the resources of a Google or an open AI to develop their own

Starting point is 00:20:51 foundation model? Yes, some of the Wall Street houses might, but I think most of them are just going to say, look, let's just use this as a starting point. I also saw a very big theme for your models with your data and where I got a hint of that, it was a throwaway LinkedIn post. It was back in, I think like February. Databricks had announced Dolly, which was kind of an experimental foundation model just used with your own data.

Starting point is 00:21:16 And I just wrote three lines in a LinkedIn post. It was on Friday afternoon. By Monday, it had 65,000 hits. I've never seen anything. I mean, yes, I had a lot. I used to say data mesh last year, but didn't get anywhere near that. So, I mean, that really hit a nerve. And other things I saw was starting to look with vector storage and how that was going to be supported.

Starting point is 00:21:37 Was it going to be a new type of database? And, hey, let's have AWS come up with like an 80th database here. Or is this going to be a feature? I think for the most part, it's going to be a feature. And of course, under all this, everybody's just falling in love, falling in love with themselves to get in the good graces of NVIDIA. In Capsule, that's kind of like what I saw. That feels directionally accurate. And I think databases are a great area to point out one thing that's always been more than a little disconcerting for me.

Starting point is 00:22:06 The way that I've always viewed databases has been, unless I'm calling a RAND function or something like it, and I don't change the underlying data structure, I should be able to run a query twice in a row and receive the same result deterministically both times. Generative AI is effectively non-deterministic for all realistic measures of that term. Yes, I'm sure there's a deterministic reason things are under the hood. I am not smart enough or learned enough to get there, but it just feels like sometimes we're going to give you the answer you think you're going to get.

Starting point is 00:22:41 Sometimes we're going to give you a different answer. And sometimes in generative AI space, we're going to be supremely confident and also completely wrong. That feels dangerous to me. Oh gosh, yes. I mean, I take a look at chat GPT and to me, the responses are essentially, it's a high school senior coming out with an essay response without any footnotes. It's the exact opposite of an ACID database. The reason why in the database world, we're very strongly drawn towards ACID is because we want our data to be consistent.

Starting point is 00:23:09 And if we ask the same query, we're going to get the same answer. And the problem is that with generative, based on large language models, computers sound sentient, but they're not. Large language models are basically just a series of probabilities. And so hopefully those probabilities will line up

Starting point is 00:23:25 and you'll get something similar. That to me kind of scares me quite a bit. And I think as we start to look at implementing this in an enterprise setting, we need to take a look at what kind of guardrails can we put on there. And the thing is that what this led me to was that the missing piece that I saw this spring, the generation of AI, at least in the data and analytics world, is nobody had a clue in terms of how to extend AI governance to this, how to make these models explainable. And I think that's still, that's a large problem. That's a huge nut that it's going to take the industry a while to crack.

Starting point is 00:24:00 Yeah, but it's incredibly important that it does get cracked. Oh gosh, yes. One last topic important that it does get cracked. Oh, gosh, yes. One last topic that I want to get into. I know you said you don't want to over-index on AWS, which, fair enough. It is where I spend the bulk of my professional time and energy focusing on, but I think this one's fair because it is a microcosm of a broader industry question. And that is, I don't know what the DBA job of the future is going to look like, but increasingly it feels like it's going to primarily be picking which purpose-built AWS database or larger story purpose of database is appropriate for a given workload. Even without my inappropriate misuse of things that are not databases as databases, there are legitimately 15 or 16 different AWS services

Starting point is 00:24:46 that they position as database offerings. And it really feels like you're spiraling down a well of analysis paralysis trying to pick between all these things. Do you think the future looks more like general purpose databases or very purpose built and each one is this

Starting point is 00:25:02 beautiful bespoke unicorn? Well, this is basically a hit on a theme that we've all been thinking about for years. And the thing is, there are arguments to be made for multi-model databases versus a for-purpose database. That being said, okay, two things. One is that what I've been saying in general is that, and I wrote about this way, way back. I actually did a talk at the old Strat. It was a throwaway talk or one of those conferences. I threw it together and it's basically looking at the emergence of all these specialized databases, but how I saw also there's going to be kind of an overlapping, not that we're going to come back

Starting point is 00:25:41 to a Pangea per se, but that, for instance, like a relational database will be able to support JSON. And Oracle, for instance, has some fairly brilliant ideas up its sleeve, what they call a JSON duality, which sounds kind of scary, which basically says we can store data relationally, but superpose GraphQL on top of all this, and this is going to look really JSON-y. So I think on one hand, you are going to be seeing databases that do overlap. Would I use Oracle for a MongoDB use case? No. But would I use Oracle for a case where I might have some document data? I could certainly see that. the cloud vendors, for all the talk that we give you, operational simplicity and agility, are making things very complex with this expanding cornucopia of services. And what they need to do, I'm not saying, you know, let's close down the patent office. What I think we do is we need to provide some guided experiences that says, tell us the use case. We will now blend these particular services together,

Starting point is 00:26:45 and this is the package that we would suggest. I think cloud vendors really need to go back to the drawing board from that standpoint and look at how do we bring this all together? How would we really simplify the life of the customer? That is honestly, I think, the biggest challenge that the cloud providers have across the board, there are hundreds of services available at this point from every hyperscaler out there. And some of them are brand new and effectively feel like they're there for three or four different customers, and that's about it. And others are universal services that most people are probably going to use. And most things falling between those two extremes, but it becomes such an analysis

Starting point is 00:27:25 paralysis moment of trying to figure out what do I do here? What is the golden path? And what that means is that when you start talking to other people and asking their opinion and getting their guidance on how to do something when you get stuck, it's, oh, you're using that service? Don't do it. Use this other thing instead. And if you listen to that, you get midway through every problem before I'm going to start over again because, oh, I'm going to pick a different selection of underlying components. It becomes confusing and complicated. And I think it does customers largely a disservice. What I think we really need on some level is a simplified golden path that's easy on ramps and easy off ramps, where in the absence of a compelling reason, this is what you should be using.

Starting point is 00:28:04 Believe it or not, I think this would be a golden case for machine learning. In other words, submit to us the characteristics of your workload, and here is a recipe that we would propose. Obviously, we can't trust AI to make our decisions for us, but it can provide some guardrails. Yeah, use a graph database. Trust me, it'll be fine. That's your general purpose approach. Yeah, that'll end well. I would hope that the AI would basically be trained on a better set of training data to not come out

Starting point is 00:28:32 with that conclusion. One could sure hope. Yeah, exactly. I really want to thank you for taking the time to catch up with me around what you're doing. If people want to learn more, where's the best place for them to find you? My website is dbinsight.io. And on my homepage, I list my latest research. So you just have to go to the homepage where you can basically click on the links to the latest and

Starting point is 00:28:57 greatest. And I will, as I said, after Labor Day, I'll be publishing my take on my generative AI journey from the spring. And we will, of course, put links to this in the show notes. Thank you so much for your time. I appreciate it. Hey, it's been a pleasure, Corey. Good seeing you again. Tony Baer, Principal at DB Insight. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point.

Starting point is 00:30:06 Visit duckbillgroup.com to get started.

Screaming in the Cloud - Defining a Database with Tony Baer

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.