Screaming in the Cloud - Use Cases for Couchbase’s New Columnar Data Stores with Jeff Morris

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn.

Starting point is 00:00:34 This promoted guest episode of Screaming in the Cloud is brought to us by our friends at Couchbase. Also brought to us by Couchbase is today's victim, for lack of a better term. Jeff Morris is their VP of Product and Solutions Marketing. Jeff, thank you for joining me. Thanks for having me, Corey, even though I guess I paid for it. Exactly. It's always great to say thank you when people give you things.

Starting point is 00:00:57 I learned this from a very early age, and the only people who didn't were rude children and turned into worse adults. Exactly. So you are effectively announcing something new today. And I always get worried when a database company says that because sometimes it's a license that is going to upset people. Sometimes it's died so deep in the wool of generative AI that we're now supporting vectors or whatnot. Well, most of us don't know what that means. Fortunately, I don't believe that's what you're doing today. What have you got for us? So you're right as well. What I'm doing

Starting point is 00:01:30 is that we're announcing new stuff inside of Couchbase and helping Couchbase expand its market footprint. But we're not really moving away from our sweet spot either, right? We like building or being the database platform underneath applications. So push us on the operational side of the operational versus analytic kind of database divide. But we are announcing a columnar data store inside of the Couchbase platform so that we can build bigger, better, stronger analytic functionality to feed the applications that we're supporting with our customers. Now, I feel like I should ask a question around what a columnar data store is, because my first encounter with the term was when I had a very early client for

Starting point is 00:02:13 AWS bill optimization when I was doing this independently. And I was asking them the polite question of why do you have 283 billion objects in a single S3 bucket? That is atypical and kind of terrifying. And their answer was, oh, we built our own columnar data store on top of S3. This might not have been the best approach. It's like, I'm going to stop you there. With no further information, I can almost guarantee you that it was not. But what is a columnar data store?

Starting point is 00:02:42 Well, let's start with the, everybody loves more data and everybody loves to count more things, right? But a columnar data store allows you to expedite the kind of question that you ask of the data itself by not having to look at every single row of the data while you go through it. You can say, if you know you're only looking for data that's inside of California, you just look at the column value of find me everything in California, and then I'll pick all of those records to analyze. So it gives you a faster way to go through the data while you're trying to gather it up and perform aggregations against it. It seems like it's one of those, well, that doesn't sound hard

Starting point is 00:03:20 type of things when you're thinking about it the way that I do in terms of a database being more or less a medium to large size Excel spreadsheet. But I have it on good faith from all the customer environments I've worked with that, no, no, there are data stores that span even larger than that, which is, you know, one of those sad realities of the world. And everything at scale begins to be a heck of a lot harder. I've seen some of the value that this stuff offers, and I can definitely understand a few different workloads, in which case that's going to be super handy. What are you targeting specifically? Or is this one of those areas where you're going to learn from your customers? Well, we've had analytic functionality inside the platform. It just, at the size and scale customers actually wanted to roam through the data, we weren't supporting

Starting point is 00:04:03 that that much. So we'll expand that particular footprint. It'll give us better integration capabilities with external systems or better access to things in your bucket. But the use case problem is, I think, going to be driven by what new modern application requirements are going to be. You're going to need, we call it hyper-personalization because we tend to cater to B2C style applications, things with a lot of account profiles built into them. So you look at account profile and you're like, oh, well, Jeff likes blue, so sell him blue stuff. And that's a great current level personalization. But with a new analytic engine against this, you can maybe start aggregating all the inventory information that you might have of all the blue stuff that you want to sell me and do that in

Starting point is 00:04:51 real time. So I'm getting better recommendations, better offers as I'm shopping on your site or looking at my phone and, you know, looking for the next thing I want to buy. I'm sure there's massive amounts of work that go into these hyper-personalization stories. The problem is that the only time they really rise to our notice is when they fail hilariously. Like, you just bought a TV. Would you like to buy another? Now, statistically, you are likelier to buy a second TV right after you buy one. But for someone who's just, well, I'm replacing my living room TV after 10 years, it feels ridiculous. Or when you buy a whole bunch of nails and they don't suggest,

Starting point is 00:05:25 would you like to also perhaps buy a hammer? It was one of those areas where it just seems like a human putting thought into this could make some sense, but I've seen some of this stuff that can come out of systems like this and it can be incredible. I also personally tend to bias towards use cases that are less, here's how to convince you to buy more things and start aiming in a bunch of other different directions where it starts meeting emerging use cases or changing situations rapidly, more rapidly than a human can in some cases. The world has, for better or worse, gotten an awful lot faster over the last few decades. Yeah. And think of it in terms of how responsive can I be at any given moment? And so let's pick on one of the more recent interesting failures that has popped up. I'm a Giants fan, San Francisco Giants fan, so I'll pick on the Dodgers.

Starting point is 00:06:13 The Dodgers during the baseball playoffs, Clayton Kershaw, three-time MVP, Cy Young Award winner, great, great pitcher, had a first inning meltdown of colossal magnitude. Gave up 11 runs in the first inning to the Diamondbacks. Well, my customer Domino's Pizza could end up, well, let's shift the focus of our marketing. The Dodgers are the best team in baseball this year in the National League. Let's focus our attention there. But with that meltdown, let's pivot to Arizona and focus on our market in Phoenix. And they could do that within minutes or seconds even with the kinds of capabilities that

Starting point is 00:06:51 we're coming up with here so that they can make better offers to that new environment and also do the decision intelligence behind it. Like, do I have enough dough to make a bigger offer in that big market? Do I have enough drivers? Or do I have to go and spin out and get one of the other food delivery folks, Uber Eats or something like that to jump on board with me and partner up on this kind of system? It's that responsiveness in real, real time, right? That's always been kind of the conundrum between applications and analytics. You get an analytic insight, but it takes you an hour or a day to incorporate that into what the application is doing. This is intended to

Starting point is 00:07:32 make all of that stuff go faster. And of course, when we start to talk about things in AI, right, AI is going to expect real-time responsiveness as best you can make it. I figure we have to talk about AI. That is a technology that has absolutely sprung to the absolute peak of the hype curve over the past year. OpenAI released ChatGipity either late last year or early this year, and suddenly every company seems to be falling all over itself to rebrand itself as an AI company, where we've been working on this for decades, they say, right before they announced something that very clearly was crash developed in six months. And every company is trying to drape themselves in the mantle of AI. And I don't

Starting point is 00:08:16 want to sound like I'm a doubter here. Unlike most fans, I see an awful lot of value here, but I am curious to get your take on what do you think is real and what do you think is not in the current hype environment? So, yeah, I love that. I think there's a number of things that are, you know, are real is it's not going away. It is going to continue to evolve and get better and better and better. One of my analyst friends came up with the notion that the exercise of generative AI, it's imprecise. So it gives you similarity things, and that's actually an improvement in many cases over the precision of a database. Databases, a transaction either works or it doesn't.

Starting point is 00:08:55 It has failover or it doesn't. It's ideally deterministic when you ask the same question a second time, assuming it's not time-bound. Gives you the right answer. Yeah, or at least the same answer. The same answer. And your Gen AI may not. So that's part of the oddity of the hype. But then it also helps me kind of feed our storyline of, if you're going to try and make

Starting point is 00:09:15 Gen AI closer and more accurate, you need a clean pool of data that you're dealing with. Even though you've got probably your previous design was such that you would use a relational database for transactions, a document database for your user profiles. You probably attach your website to a caching database because you needed speed and a lot of concurrency. Well, now you've got three different databases there that you're operating. And if you're feeding data from each of those databases back to AI, one of them might be wrong or one of them might confuse the AI. Yet, how are you going to know? The complexity level is going to become exponential.

Starting point is 00:09:57 So our premise is because we're a multi-model database that incorporates in-memory speed and documents and search and transactions and the like, if you start with a cleaner pool of data, you'll have less complexity that you're offering to your AI system. And therefore, you can steer it into becoming more accurate in its response. And then, of course, all the data that we're dealing with is on mobile, right? Data is created there for, let's say, your account account profile and then it's also consumed there because that's what people are using as their application interface of choice so you also want to have mobile interactivity and synchronization and local storage kind of capabilities built in

Starting point is 00:10:36 there so those are kind of you know a couple of the principles that we're looking at of you know json is going to be a great format for it, regardless of what happens. Complexity is kind of the enemy of AI, so you don't want to go there. And mobility is going to be an absolute requirement. And then related to this particular announcement, large scale aggregation is going to be a requirement to help feed the application. There's always going to be some other bigger calculation that you're going to want to do relatively in real time and feed it back to your users or the AI system that's helping them out. I think that that is a much more nuanced use case than a lot of the stuff that's grabbing customer attentions, where you effectively have the Chad Chippity story of it being an incredible parrot. Where I have run into trouble with the generative story has been people putting the thing that the robot

Starting point is 00:11:27 that's magic and from the future has come up with off the cuff and just hurling that out into the universe under their own name without any human review. And that's fine sometimes, sure, but it does get it hilariously wrong at some points. And the idea of sending something out under my name that has not been at least

Starting point is 00:11:46 reviewed by me, if not actually authored by me, is abhorrent. I mean, I review even the transactional yes, you have successfully subscribed or sorry to see you go email confirmations on stuff because there's an implicit hugs and puppies, love Corey, at the end of everything that goes out under my name. But I've gotten a barrage of terrible sales emails and companies that are trying to put the cart before the horse where either the support rep, quote unquote, that I'm speaking to in the chat is an AI system or else needs immediate medical attention because there's something going on that needs assistance. Yeah, they just don't understand.

Starting point is 00:12:22 Right. And most big enterprise stories that I've heard so far that have come to light have been around the form of, we get to fire most of our customer service staff, an outcome that basically no one sensible wants. That is less compelling than a lot of the individualized consumer use cases. I love asking it, here's a blog post I wrote, give me 10 title options. And I'll usually take one of them. One of them will usually be not half bad. Then I can modify it slightly. And you'll change four words in it. Yeah. Yeah, exactly. That's a bit of a different use case. It's been an interesting, even as we've all become familiar, or at least junior prompt engineers, right, is your information is only going to be as good as

Starting point is 00:13:00 you feed the AI system. The return is only going to be as good. So you're going to want to refine that kind of conversation. Now, we're not trying to end up replacing the content that gets produced or the writing of all kinds of pros, other than we do have a code generator that works inside of our environment called Capella IQ that talks to chat GPT. But we try and put guardrails on that too, right? Is always make sure that it's talking in terms of the context of Couchbase rather than where's Taylor Swift this week, which I don't want it to answer because I don't want to spend the GPT money to answer that question for you. And it might not know the right answer, but it might very well spit out something

Starting point is 00:13:41 that sounds plausible. Exactly. But I think the kinds of applications that we're steering ourselves toward can be helped along by the Gen AI systems. But I don't expect all my customers are going to be writing automatic blog post generation kinds of applications. I think what we're ultimately trying to do is facilitate interactions in a way that we haven't dreamt of yet, right? One of them might be if I've opted into two loyalty programs, like my United account and my American Express account. That feels very targeted at my lifestyle as well. So please continue. Exactly. And so what I really want the system to do is for Amex to reward me when I hit 1K status on United while I'm on the flight. And have the flight attendant come up and be like, hey, you did it. Either here's a free upgrade from American Express, that would be hyper-personalization because you booked your

Starting point is 00:14:37 plane ticket with it, but they also happen to know or they cross-consumed information that I've opted into. I've seen them congratulate people for hitting a million miles flown mid-flight, but that's clearly something that they've been tracking and happens a heck of a lot less frequently. This is how you start scaling that experience. Yes, but that happened because American Airlines was always watching, because that was an American Airlines ad ages ago, right? But the same principle holds true.

Starting point is 00:15:02 But I think there's going to be a lot more of these, how much information am I actually allowing to be shared amongst the cult loyalty programs, but the data sources that I've opted into. And my God, there's hundreds of them that I've personally opted into, whether I like it or not, because everybody needs my email address, kind of like what you were describing earlier. A point that I have, I think, agrees largely with your point is that few things to me are more frustrating than when I'm signing up, for example,

Starting point is 00:15:28 oh, I don't know, an AWS event. Gee, can't imagine there's anything like that going on this week. And I have to fill out an entire form that always asks me the same questions. How big my company is, whether we have multiple workloads on, what industry we're in.

Starting point is 00:15:42 And no matter what I put into that, first, it never remembers me for the next time, which is frustrating in its own right. But two, no matter what I put in to fill the thing out, the email I get does not change as a result. At one point, I said, all right, I'm picking randomly. I am a venture capitalist based in Sweden. And I got nothing that is differentiated from the other normal stuff I get tied to my account because I use a special email address for those things sometimes just to see what happens. And no, if you're going to make me jump through the hoops to give you the data, at least use it to make my experience better.

Starting point is 00:16:16 It feels like I'm asking for the moon here, but I shouldn't be. Yes, immediately to make your experience better and say, you know, here's four companies in Malmo that you ought to be talking to. And they happen to be here at the AWS event and you can go find them because their booth is here, here, and here. That kind of immediate responsiveness could be facilitated. And to our point, ought to be facilitated. It's exactly like that. That kind of thing is use the data in real time. I was talking to somebody else today that was discussing that most data, right, becomes stale and unvaluable. Like 50% of the data,

Starting point is 00:16:52 its value goes to zero after about a day. And some of it is stale after about an hour. So if you can end up closing that responsiveness gap that we're describing, and this is kind of what this columnar service inside of Capela is going to be like, is react in real time with real-time calculation and real-time lookup and real-time find out how you might apply that new piece of information right now. And then give it back to the consumer or the user right now. So Couchbase takes a few different forms. I should probably, at least for those who are not steeped in the world of exotic forms

Starting point is 00:17:32 of database, I always like making these conversations more accessible to folks who are not necessarily up to speed. Personally, I tend to misuse anything as a database if I can hold it just the wrong way. The wrong way? I've caught that about you. Yeah, it's not for these a database if you hold it wrong. But you folks have a few different options. You have a self-managed commercial offering. You're an open source project.

Starting point is 00:17:57 So I can go ahead and run it on my own infrastructure however I want. And you have Capella, which is Couchbase as a service. And all of those are useful and have their points, and I'm sure I'm missing at least one or two along the way. But do you find that the columnar use case is going to disproportionately benefit folks using Capella in ways that the self-hosted version would not be as useful for? Or is this functionality already available in other expressions of Couchbase? It's not already available in other expressions, although there is analytic functionality in the self-managed version of Couchbase. But it's, as I mentioned, I think earlier,

Starting point is 00:18:33 it's just not as scalable or as really real-time as we're thinking. So it's going to, yes, it's going to benefit the database as a service deployments of Couchbase available on your favorite three clouds and still interoperable with environments that you might self-manage and self-host. So there could be even use cases where our development team or your development team builds in AWS using the cloud-oriented features, but is still ultimately deploying and hosting and managing a self-managed environment. You could still do all of that. So there's still a great interplay and interoperability amongst our different deployment options. But the fun part, I think, about this is not only is it going to help the Capella user, there's a lot of other things inside Couchbase that help address the developer's penchant for trading zero cost for degrees of complexity that you're willing to

Starting point is 00:19:28 accept because you want everything to be free and open source. And Couchbase is my fifth open source company in my background. So I'm well, well versed in the nuances of what open source developers are seeking. But what makes Couchbase, its origin story really cool too, though, is it's the peanut butter and chocolate marriage of Memcached, the people behind that, and Membase, and CouchDB from Couch1. So I can't think of that many, maybe Red Hat, projects and companies

Starting point is 00:20:01 that formed up by merging to complementary open source projects. So we took the scale- You had OpenTelemetry, I think, that did that once. You see occasional mergers, but it's very far from common. It's very, very infrequent. But what that made the Couchbase people end up doing is make a platform that will scale, make a data design that you can auto-partition anywhere, anytime, and then build

Starting point is 00:20:28 independently scalable services on top of that. One for SQL++, the query language. Anyone who knows SQL will be able to write something in Couchbase immediately. Then I've got this AI automator, IQ, that makes it even easier. You just say, write me a SQL++ query that does this, and it'll do that. But then we added full-text search. We added eventing so you could stream data. We added the analytics capability originally, and now we're enhancing it and use JSON as our kind of universal data format so that we can trade data with applications really easily. So it's a cool design to start with. And then in the cloud, we're steering towards things like making your entry point

Starting point is 00:21:10 and using our database as a service Capella really, really, really inexpensive, so that you get that same robustness of functionality, as well as the easy cost of entry that today's developers want. And it's my analyst friends that keep telling me the cloud is where the market's going to go. So we're steering ourselves towards that hockey puck location. I frequently remark that the role of the DBA might not be vanishing, but it's definitely changing, especially since the last time I counted, if you hold them and use as directed, AWS has something on the order of 14 distinct managed database offerings. Some are general purpose, some are purpose-built.

Starting point is 00:21:50 And if this trend keeps up in a decade, the DBA role is going to be determining which of its 40 databases is going to be the right fit for a given workload. That seems to be the counter-approach to a general purpose database that works across the board. Clearly, you folks have opinions on this. Where do you land? Oh, so absolutely. There's the product that is a suite of capabilities or that are individual capabilities. And then there's ones that are, in my case, kind of multi-model and do lots of things at once.

Starting point is 00:22:20 I think historically you'll recognize, because this is, let's pick on your phone, the same holds true for, you know, your phone used to be a watch, used to be a Palm Pilot, used to be a StarTAC telephone, and your calendar application, your day planner all at the same time. Well, it's not anymore. Technology converges upon itself. It's kind of a historical truism. And the database technologies are going to end up doing that and continue to do that even right now. So that notion that it's a 10-year-old notion of use a purpose built database for that particular workload, maybe sometimes in extreme cases, that is the appropriate thing. But in more cases than not right now, if you need transactions when you need them,

Starting point is 00:23:02 that's fine. I can do that. You don't necessarily need Aurora or RDS or Postgres to do that. But when you need search and geolocation, I support that too. So you don't need Elastic. And then when you need caching and everything, you don't need Elastic Cache. It's all built in. So the multi-model notion of operate on the same pool of data, it's a lot less complex for your developers. They can code faster and better and more cleanly. Debugging is significantly easier. As I mentioned,

Starting point is 00:23:32 the SQL++ is our language. It's basically SQL syntax for JSON. We're a reference implementation of this language along with Asterisk DB is one of them. And actually the original author of that language also wrote DynamoDB's particle. So it's a common language that you wouldn't necessarily imagine, but the ease of entry in all of this, I think is still going to be a driving goal for people. The old people like me and you are running around worrying about, am I going to get a particular really specific feature out of the full-text search environment? Or the other one that I pick on now is, am I going to need a vector database too? And the answer to me is no, right?

Starting point is 00:24:13 There's going to be the database vendors like ourselves and like Mongo's announced and a whole bunch of other NoSQL vendors. We're going to support that. It's going to be just another mode. And you get better bang for your buck when you've got more modes than a single one at a time. The consensus opinion that's emerging is very much across the board, that Vector is a feature, not a database type.

Starting point is 00:24:38 Not a category. Yeah, me too. And yeah, we're well on board with that notion as well. And then, like I said earlier, the JSON as a vehicle to give you all of that versatility is great, right? You can have vector information inside a JSON document. You could have time series information in the document. You could have graph node locations and ID numbers in a JSON array. So you don't need index-free adjacency or some of the other cleverness

Starting point is 00:25:05 that some of my former employers have done. It really is all converging upon itself. And hopefully everybody starts to realize that you can clean up and simplify your architectures as you look ahead so that you do, if you're going to build AI-powered applications, feed it clean data, right? You're going to be better off. So this episode is being recorded in advance, thankfully, but it's getting released the first day of reInvent. What are you folks doing at the show for those who are either there and for some reason listening to a podcast rather than going to get marketed to by a variety of different pitches that all mention AI or might even be watching from home and trying to figure out what to make of it. Right. So, of course, we have a booth, and my notes don't have in front of me what our booth

Starting point is 00:25:50 number is, but you'll see it on the signs in the airport. So we'll have a presence there. We'll have an executive briefing room available, so we can schedule time with anyone who wants to come talk to us. We'll be showing not only the capabilities that we're offering here, we'll show off Capella IQ, our coding assistant. Okay, so yeah, we're on the AI hype band, but we'll also be showing things like our mobile sync capability where my phone and your phone can synchronize data amongst themselves

Starting point is 00:26:18 without having to actually have a live connection to the internet. So long as we're on the same network locally within the Venetians network, we have an app that we have people download from the Apple store. And then it's a color synchronization app or a picture synchronization app. So you tap it and it changes on my screen and I tap it and it changes on your screen. And we'll have, I don't know, as many people who are around standing there,

Starting point is 00:26:45 synchronizing what, maybe 50 phones at a time? It's actually a pretty slick demonstration of why you might want a database that's not only in the cloud, but operates around the cloud, operates mobily, operates, you know, can connect and disconnect to your networks. It's a pretty neat scenario.

Starting point is 00:27:03 So we'll be showing a bunch of cool technical stuff, as well as talking about the things that we're discussing right now. I will say you're putting an awful lot of faith in connectivity working at reInvent, be it Wi-Fi or the cellular network. I know both of those have bitten me in various ways over the years, but I wish you the best on it. I think it's going to be an interesting show based upon everything I've heard in the run-up to it. I'm just glad it's here. Now, this is the cool part about what I'm talking about, though.

Starting point is 00:27:29 The cool part about what I'm talking about is we can set up our own wireless network in our booth. And, you know, we still, you'd have to go to the App Store to get this application. But once there, I can have you switch over to my local network and play around on it. And I can sync the stuff right there and have confidence that in my local network that's in my booth, the system's working. I think that's going to be ultimately our design there.

Starting point is 00:27:52 Because, oh my gosh, yes, I have a hundred stories about connectivity and someone blowing a demo because they're yanking on a cable behind the pulpit, right? I always build in a... And assuming there's no connectivity, how can I fake my demos? Just because it's, I've only had to do it once, but I, you wind up planning in advance when you start doing a talk to a large enough or influential enough audience where you want things to go right. There's a delightful acceptance right now of recorded videos and demonstrations that people sort of accept that way because of exactly all this. And I'm sure we'll be showing that in our booth there too. Given the non-deterministic

Starting point is 00:28:30 nature of generative AI, I'm sort of surprised whenever someone hasn't mocked the demo in advance just because, yeah, it gives you the right answer in the rehearsal, but every once in a while it gets completely unglued. Yes, and we see it pretty regularly. So the emergence of clever and good prompt engineering is going to be a big skill for people. And hopefully, you know, everybody's going to figure out how to pass it along to their peers. Excellent. We'll put links to all this in the show notes. And I look forward to seeing how well this works out for you. Best of luck at the show. And thanks for speaking with me. I appreciate it. Yeah, Corey, we appreciate the support and I think the show is going to be

Starting point is 00:29:08 very strong for us as well. Again, thanks for having me here. Always a pleasure. Jeff Morris, VP of Product and Solutions Marketing at Couchbase. This episode has been brought to us by our friends at Couchbase and I'm cloud economist, Corey Quinn.

Starting point is 00:29:23 If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment. But if you want to remain happy, I wouldn't ask that podcast platform what database they're using. No one likes the answer to the point. Visit duckbillgroup.com to get started.

Your Ad Here

Screaming in the Cloud - Use Cases for Couchbase’s New Columnar Data Stores with Jeff Morris

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.