Screaming in the Cloud - The Ever-Growing Ecosystem of Postgres with Álvaro Hernandez

Episode Date: February 9, 2023

Álvaro Hernandez, Founder of OnGres, joins Corey on Screaming in the cloud to discuss his hobby project Dyna53, the balkanization of AWS services, and all things Postgres. Álvaro and Corey ...discuss what it means to be an AWS Community Hero these days, and Álvaro shares some of his experiences as being one of the first Heroes to provide feedback on AWS services. Álvaro also shares his thoughts on why people shouldn’t underestimate the importance of selecting the right database, why he feels Postgres and Kubernetes work so well together, and the ever-growing ecosystem of Postgres.About ÁlvaroÁlvaro is a passionate database and software developer. Founder of OnGres ("ON postGRES"), he has been dedicated to Postgres and R&D in databases for more than two decades.Álvaro is at heart an open source advocate and developer. He has created software like StackGres, a Platform for running Postgres on Kubernetes or ToroDB (MongoDB on top of Postgres). As a well-known member of the PostgreSQL Community, Álvaro founded the non-profit Fundación PostgreSQL and the Spanish PostgreSQL User Group. He has contributed, among others, the SCRAM authentication library to the Postgres JDBC driver.You can find him frequently speaking at PostgreSQL, database, cloud (becoming an AWS Data Hero in 2019), and Java conferences. In the last 10 years, Álvaro has completed more than 120 tech talks (https://aht.es).Links Referenced:OnGres: https://ongres.com/Dyna53: https://dyna53.io/Personal Website: https://aht.esTwitter: https://twitter.com/ahacheteLinkedIn: https://www.linkedin.com/in/ahachete/

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Today's episode is brought to you in part by our friends at Minio, the high-performance Kubernetes native object store that's built for the multi-cloud,
Starting point is 00:00:40 creating a consistent data storage layer for your public cloud instances, your private cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the speed to run any workload,
Starting point is 00:01:05 and the footprint to run anywhere. And that's exactly what Minio offers. With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got on the system, it's exactly what you've been looking for. Check it out today at min.io slash download and see for yourself. That's min.io slash download. And be sure to tell them that I sent you. This episode is sponsored in part by our friends at Logicworks. Getting to the cloud is challenging enough for many places, especially maintaining security, resiliency, cost control, agility, etc., etc., etc.
Starting point is 00:01:44 Things break, configurations drift, cost control, agility, etc., etc., etc. Things break, configurations drift, technology advances, and organizations, frankly, need to evolve. How can you get to the cloud faster and ensure you have the right team in place to maintain success over time? Day two matters. Work with a partner who gets it. Logicworks combines the cloud expertise and platform automation to customize solutions to meet your unique requirements. Get started by chatting with a cloud specialist today at snark.cloud slash logicworks. That's snark.cloud slash logicworks. And my thanks to them for sponsoring this ridiculous podcast.
Starting point is 00:02:25 Welcome to Screaming in the Cloud. I'm Corey Quinn. If I could be said to have one passion in life, it would be inappropriately using things as databases. Because frankly, they're all databases. If you have knowledge that I want access to, that's right, I can query you. You're a database. Enjoy it. Today's guest has helped me take that one step further. Alvaro Hernandez is the founder at Ongress, which we will get to in due course. But the reason he is here is he has built the rather excellent Dyna 53. Alvaro, thank you for joining me first off. Thank you for having me. It's going to be fun, I guess. Well, I certainly hope so. So I have
Starting point is 00:03:06 been saying for years now, correctly, that Route 53 is Amazon's premier database offering. Just take whatever data you want, stuff it into text records, and great, it's got 100% SLA, it is globally distributed, the consistency is eventual, and effectively it works gloriously. Disclaimer, this is mostly a gag. Please don't actually do this in production because the last time I made a joke like this, I found it at a bank. So you have taken the joke, horrifying though it is, a step further by sitting down one day to write Dyna 53. What is this monstrosity? Okay, actually it took a little bit more than one day. But essentially, this is a hobby project. I just want to have some fun. Most of my day is managing
Starting point is 00:03:51 my company and not programming, which I love. So I decided, let's program something. And there was reasons I wanted to do this. We can get on that later. But essentially, so what it is, so Dyna53, it's essentially DynamooDB where data is stored in Route 53. So you laid out the path, right? Like data can be stored on text records. Actually, I use both text and service records. And we can get into the tech details if you want to. But essentially, I store data when you run Dynab 53 on top of Route 53 on records.
Starting point is 00:04:21 But it exposes a real database interface because otherwise it's not a real database until it mimics a real database interface. So you can use the Amazon CLI with DynamoDB. You can use any GUI program that works with DynamoDB. And it works as if you're using DynamoDB, except the data is stored on a zone on your Route 53. And it's open source, by the way. Under the hood, does it use DynamoDB at all? Not at all. Excellent. Because if it did that, it wouldn't be, I guess, necessary in the least.
Starting point is 00:04:52 And it would also be at least 80% less ridiculous, which is kind of the entire point. Now, you actually do work with data for a living. You're an AWS data hero, which is their bad name for effectively community folk who are respected, prolific, and know what they're talking about in the ecosystem around certain areas. You work with these things for a living. Is there any scenario in which someone would actually want to do this, or is it strictly something to be viewed as a gag and maybe a
Starting point is 00:05:20 fun proof of concept, but never do anything like this in reality. No, I totally discourage using this in anything serious. I mean, technically speaking, and if you want to talk about billing, which as far as I know is something you care about, you could save some money for super small testing scenarios where you might want to run Dynar 53 years and Amazon Lambda against a hosted zone that maybe you already have, so you don't need to pay for it. And you may save some money as long as you do less than three, four million transactions per month on this less than 10,000 records, which is the limit of Route 53 by default. And you're going to save two, three dollars per month.
Starting point is 00:06:00 So, I mean, yeah, it's not money, right? So your Starbucks coffee is going to be more than that. So essentially, yes, don't use it. It's a gag. It's just for fun. But actually, it was a surprising, you know, funny engineering game. If you want to look at the code, anybody can do this, please start on GitHub. It's a joke project anyway. I'm not going to get to the Arctic Vault because of this. It's a funny exercise on how to route around all the limitations for the text records that are on Route 53 and all the quotas and all the, you know, like there's a lot of stuff going on there. So I think from an architectural perspective and a code perspective, it's a
Starting point is 00:06:34 fun thing to look at. I'm a big fan of when people take ideas that I've come up with more or less on the fly, just trying to be ridiculous and turn it into something. I do appreciate the fact that it does come with a horrible warning of, yeah, this is designed to be funny. This is not intentionally aimed at something you should run your insurance company on top of. And I do find that increasingly as my audience gets larger, and I mean that in terms of raw numbers, not volume, the problem that I keep smacking into is that people don't always have context. I saw a Reddit post a couple of years ago after I started the Route 53 as a database gag and saw someone asking, well, okay, I get it's designed to be a gag, but would this actually work in production? At which point I had to basically go crashing in.
Starting point is 00:07:22 Yeah, this actually happened once at a company I worked at, which is where the joke came from. Here are the problems you're going to have with it. It's a nice idea. There are better ways in 2023 to skin that cat. You probably do not want to do it. And if you do, I disclaim all responsibility, et cetera, et cetera. Just it's, otherwise there's always the danger of people taking you too seriously. Speaking of being taken seriously,
Starting point is 00:07:48 during the day, you are the founder at Ongress. What do you folks do over there? Ongress means on Postgres. So we are essentially a Postgres specialized shop. We offer both professional services, which is 24-7 support, monitoring, consulting. And we do a lot of, for example, migrations from all databases, especially Oracle to Postgres. But we also develop software for the Postgres ecosystem. Right now, for example, we have developed a fully open source StackRest project, stack of components on top of Postgres, which is where you essentially need to run Postgres on Kubernetes for production quality workloads.
Starting point is 00:08:24 So Postgres, Postgres, Postgres. I'm known on many environments as the Postgres guy. And if you say Postgres three times, I typically pop up and answer whatever question is there. I find that increasingly over the past few years, there has been a significant notable shift as far as, I guess, the zeitgeist or what most of the community centralizes around
Starting point is 00:08:46 when it comes to databases. Back when I used to touch production systems in anger and, oh, I was oh so angry, I found that MySQL was sort of the default database engine that most people went with. These days, it seems that almost anything Greenfield that I see starts with Postgres, or as I insist on calling it, Postgresqueel.
Starting point is 00:09:05 And as a result, I find that that seems to have become a de facto standard kind of out of nowhere. Was I asleep at the wheel, or have I missed something that's been happening all along? What is the deal here? Well, this is definitely the case. Postgres is becoming the de facto standard for, especially as you said, green new deployments, but not only at the Postgres level, but also Postgres-compatible and Postgres-derived projects. If you look at Google's cloud offering, now they added compatibility layer with Postgres. And if you look at AlloyDB, it's only compatible with Postgres and not MySQL. It's kind of Aurora's equivalent, and it's only Postgres there, not Postgres and MySQL. A lot of databases are adding Postgres compatibility layer, the
Starting point is 00:09:49 protocol, so you can use Postgres drivers. So Postgres as a wide ecosystem, yes, it's becoming the de facto standard for new developments and many migrations. Why is that? I think it's a combination of factors, most of them being that, and pardon me, all MySQL fans out there, but I believe that Postgres is more technically advanced, has more features, more completeness of SQL, and more capabilities, therefore, in general, compared to MySQL. And it has a strong reputation, very solid, very reliable, never corrupting data. You might think it performs a little bit better, a little bit worse, but in reality, what you care is about your data being there all the time.
Starting point is 00:10:28 And that it's stable and it's rock solid. You can throw stones at Postgres, it will keep running. You could really configure it in Turgle and make it go slow, but it will still work there. It will still be there when you need it. So I think it's the thing that cannot go wrong.
Starting point is 00:10:43 If you choose Postgres, you're very likely not wrong. If you look for a new fancy database for a fancy new project, things may or may not work. But if you go Postgres, it's essentially the Swiss army knife of today's modern databases. If you use Postgres, you may get 80% of the performance or 80% of what your really new fancy database will do, but it will work for almost any workload, any use case that you may have. And you may not need specialized databases if you just stick to Postgres,
Starting point is 00:11:11 so you can standardize on it. On some level, it seems that there are two diverging philosophies, and which one is the right path invariably seems to be tied to whoever is telling the story wants to wind up selling in various ways. There's the idea of a general purpose database where oh, great, it's one of those you can have in any color you want as long as it's black style
Starting point is 00:11:34 of approach, where everything should wind up living in a database, a specific database. And then you have the Cambrian explosion of purpose-built databases where that's sort of the AWS approach, where it feels like the DBA job of purpose-built databases, where that's sort of the AWS approach, where it feels like the DBA job of the future is deciding which of Amazon's 40 managed database services by then that are going to need to be used for any given workload. And that doesn't seem like it's
Starting point is 00:11:55 necessarily the right approach either on some level. It feels like it's a spectrum. Where do you land on it? So let me speak actually about Postgres extensibility. Postgres has an extensibility mechanism called extensions, not super original, which is essentially like plugins. You can take your browser's plugin that augment functionality. And it's surprisingly powerful. And you can take Postgres and transform it into something else. So Postgres has extensions built into the core, a lot of functionality for JSON, for example, but then you have extensions for GraphQL, you have extensions for sharding Postgres, you have extensions for time series, you have extensions for geo, for anything that you can almost think of.
Starting point is 00:12:34 So in reality, once you use these extensions, you can get, you know, very close to what a specialized database purpose build may get, maybe, you know, as I said before, like 80% there, but, you know, at the cost of just standardizing everything on Postgres. So it really depends where you are at. If you are planning to run everything as managed services, you may not care that much because someone is managing them for you. I mean, from a developer perspective, you still need to learn this 48 or whatever APIs, right?
Starting point is 00:13:12 But if you're going to run things on your own, then consolidating technologies is a very interesting approach. And in this case, Postgres is an excellent home for that approach. One of the things that I think has been, how do I put this in a way that isn't going to actively insult people who have gone down certain paths on this? There's an evolution, I think, of how you wind up interacting with databases. At least that is my impression of it. And let's be clear, I, in my background as a production engineer, I tended to focus on things that were largely stateless, like web servers. Because when you break the web servers, as I tended to do, we all have a good laugh. We reprovision them because they're stateless and life generally goes on. Let my aura too close to the data warehouse and we don't have a company anymore. So people learn pretty damn quick not to let me near things like
Starting point is 00:13:53 that. So my database experience is somewhat minimal. But having built a bunch of borderline horrifying things in my own projects and seeing what's happened with the joy of technical debt as software projects turn into something larger and then have to start scaling, there are a series of common mistakes it seems that people make their first time out. Such as they tend to assume in every case that they're going to be using a very specific database engine. Or they, well, this is just a small application. Why would I ever need to separate out my reads from my writes, which becomes a significant scaling problem down the road, and so on and so on. And then you have people who decide
Starting point is 00:14:30 that, you know, CAP theorem doesn't really apply. That's not really a real thing. We should just turn it on globally. Google says it doesn't matter anymore. And well, that's adorable. But it's those things that you tend to be really cognizant of the second time, because the first time you've got to, you mess it up and you wind up feeling embarrassed by it. Do you think it's possible for people to learn from the mistakes of others? Or is this the sort of thing that everyone has to find out for themselves once they really feel the pain of getting it wrong? It actually surprises me that I don't see a lot of due diligence when selecting a database technology. And databases tend to be centerpieces,
Starting point is 00:15:05 maybe not anymore with this microservices, more oriented architecture. It's a little bit of a joke here. But essentially, what I mean is that a database should be a serious choice to make, and you shouldn't take it lightly. And I see people not doing a lot of due diligence and picking technologies just because they're claiming to be faster or because they're cool or because they're the new thing, right? But then it comes with a technical debt. So in reality, databases are extremely powerful software. I had a professor at the university that said they're the most complex software in the world only after compilers.
Starting point is 00:15:35 Whether you agree with that or not, databases are packed with functionality. And it's not a smart decision to ignore that and just reinvent the wheel at your application side. So leverage your database capabilities, learn your database capabilities, And it's not a smart decision to ignore that and just reinvent the wheel at your application site. So leverage your database capabilities, learn your database capabilities, and pick your database based on what capabilities you want to have, not to write from your application with bugs and inefficiencies, right? So look at one of these examples could be, and there's no pun intended here, MongoDB and the schema-less approach, being them the champion of this schema-less approach. Schema-less is a good fit for certain cases, but not for all the cases. And it's a kind of a trade-off.
Starting point is 00:16:12 Like, you can start writing data faster, but when you query the data, then the application is going to need to figure out, oh, but is there this key present here or not? Oh, there's a sub-document here, and which fields can I query there? So you start creating versions of your objects, depending on the evolution of your business logic and so on and so forth. So at the end, you are shifting a lot of business logic to the application that with another database, say Postgres, could have been done by the database itself. So you start faster, you grow slower. And this is a trade-off that some people have to make. And I'm not saying which one is better. It really depends on your use case, but it's something that should people be aware of, in my opinion. While I've got you here, you've been a somewhat
Starting point is 00:16:55 surprising proponent for something that I would have assumed was a complete non-starter, don't do this. But again, I haven't touched either of these things in anger, at least in living memory. You have been suggesting that it is not completely ridiculous to run Postgres on top of Kubernetes. I've always taken the perspective that anything stateful would generally be better served by having a robust API, the thing in Kubernetes talks to. In an AWS context, in other words, oh great, you use RDS and talk to that as your persistent database service and then have everything in the swirling maelstrom
Starting point is 00:17:30 that is Kubernetes that no one knows what's going on with. You see a different path. Talk to me about that. Yeah, actually, I've given some talks about why I say that you should primarily run Postgres on Kubernetes. Like, as long as I'm saying that Postgres should be your default database option, I also say that Kubernetes, Postgres on Kubernetes, henceforth, should be the default deployment model, unless you have strong reasons not to.
Starting point is 00:17:56 There's this conventional wisdom, which has become already just a myth, that Kubernetes is not for stateful workloads. It is. It wasn't many years ago. It is today. No question. And we can get into the technical details, but essentially, it is safe and good for that. But I would get it even farther. It actually could be much better than other environments because Kubernetes essentially is an API. And this API allows really, really high levels of automation to be created. It can automate compute, can automate storage, can automate networking at a level that is not even
Starting point is 00:18:31 possible with other virtualization environments of some years ago. So databases are not just day one or day zero operations, like deploying them and forgetting about them. You need to maintain them. You need to perform operations. You need to perform vacuums and repacks and upgrades and a lot of things that you need to do with your database. And those operations in Kubernetes can be automated. Even if you run on RDS, you cannot run automated repack or a vacuum of the database. You can do automated upgrades, but not the other ones. You cannot automate the benchmark. I mean, you can do all, but it's not provided. I mean, you can do all. It's not provided. On Kubernetes, this can be provided.
Starting point is 00:19:07 For example, the open server that we developed for this, StackRest, automates all the operations that I mentioned, and many more are coming down the road. So it is safer from a stateful perspective. Your data will not get lost. Behind the scenes, if you're wondering,
Starting point is 00:19:19 data will go by default. You can change that, but the default will go to EBS volumes. So even if the nodes die, the data will remain on the EBS volumes. It can automate even more things that you can automate today in other environments. And it's safe from that perspective. Reheals typically, a node failing is rehealed much faster than on other environments. So there's a lot of advantages for running on Kubernetes. But on the particular case of Postgres, and if you compare it to managed services, there's additional reasons for that. If you move to Kubernetes because you believe Kubernetes is
Starting point is 00:19:52 bringing advantages to your company, to your velocity, compatibility with production environments, there's a lot of reasons to move to Kubernetes. If you're moving to Kubernetes, everything except for your database, well, you're not enjoying all those advantages and reasons why you decided to move to Kubernetes. Move full in, if that's the case, to leverage all these advantages. But on top of that, if you look at the managed service like RDS, for example, there are certain extensions that are not available. And you may want to have all the extensions that post-release developers use. So here you have complete freedom, you can own essentially your environment. So there's multiple reasons for that.
Starting point is 00:20:31 But the main one is that it's safe from a state perspective. And you can get to higher levels of automation that you get today on non-Kubernetes environments. It also feels on some level like it makes it significantly more portable. Because if you wind up building something on AWS and then for some godforsaken reason want to move it to another cloud provider, again, a practice that is not highly recommended in most cases, having to relearn what their particular database services peculiarities are, even if they're both Postgres, let's be clear, seems like there's enough of a discordance or a divergence between them
Starting point is 00:21:04 that you're going to find yourself in operational hell without meaning to. Yeah. Actually, the first thing I would never recommend to do is running Postgres by yourself in an EC2 instance. In an upper EC2 instance. Yes, you're going to save costs compared to RDS. RDS is significantly more expensive than an upper
Starting point is 00:21:19 instance, but I would never recommend to do that yourself. You're going to pay everything else in engineer hours and down times. But when you talk about services that can automate things like RDS or even more than RDS, the question is different. And there's a lot of talk recently, and you know probably much more than me, about things like repatriation, going back on-prem or taking some workloads on-prem or to other environments. And that's where Kubernetes may really help to move workloads across, because you're going to change a couple of lines on one of your Jumbo files, and that's it, right? So it definitely helps in this regards.
Starting point is 00:21:54 This episode is sponsored in part by our friends at Strata. Are you struggling to keep up with the demands of managing and securing identity in your distributed enterprise IT environment? You're not alone, but you shouldn't let that hold you back. With Strata's Identity Orchestration Platform, you can secure all your apps on any cloud with any IDP, so your IT teams will never have to refactor
Starting point is 00:22:17 for identity again. Imagine modernizing app identity in minutes instead of months, deploying passwordless on any tricky old app, and achieving business resilience with always-on identity, all from one lightweight and flexible platform. Want to see it in action? Share your identity challenge with them on a discovery call, and they'll hook you up with a complimentary pair of AirPods Pro. Don't miss out. Visit strata.io slash screamingcloud. That's strata.io slash screaming cloud.
Starting point is 00:22:47 You're right. You will pay more for RDS than you will for running your own Postgres on top of EC2. And as a general rule, at certain points of scale, I'm a staunch advocate of your people will cost you more
Starting point is 00:22:59 than you pay for in cloud services. But there is a tipping point of scale where I've talked to customers running thousands of EC2 nodes, running databases on top of them. And when I did the simple math of, okay, if you just migrate that over to RDS, oh dear, the difference there would mean you have to lay off an entire team of people who are managing those things. There's a clear win economically to run your own at this point. Plus, you get the advantage of a specific higher degree of control. You can tweak things. You can have maintenance happen exactly when you want it to, rather than at some point during a window,
Starting point is 00:23:34 et cetera, et cetera. There are advantages to running this stuff yourself. It feels like there's a gray area. Now, for someone building something at small scale, yeah, go with RDS and don't think about it twice. When is it time, from your position, to start re-evaluating the common wisdom that applies when you're starting out? Well, it's not necessarily related to scale. RDS is always a great choice. Aurora is always a great choice.
Starting point is 00:23:59 Just go with them. But it's also about the capabilities that you want and where you want them. So most of the people right now want a database as a service experience. But again, on something like Kubernetes with operators, you can also have that experience. So it really depends. Is this a greenfield development happening in containers with Kubernetes?
Starting point is 00:24:18 Then it's maybe a good reason to run Postgres on Kubernetes, which equates from a a cost perspective, on running on your bare instances. Other than that, for example, Postgres is a fantastic... What I'm going to say about Postgres, right? It's this database that I love. It's a fantastic database, but it's not batteries included for production workloads. If you really want to take Postgres to production from A to get install, you're going to take a long, long road. You need a lot of components that don't come with Postgres, right? You need to configure them. You need high availability. You need monitoring. You need backups. You need logs management. You need
Starting point is 00:24:54 a lot of connection pooling. None of those come with Postgres. You need to understand which component to pick from the ecosystem, how to configure it, how to tune it, how to make them all work together. That is a long road. So if you reach enough scale in your teams where you have talented, knowledgeable people about post-trust environments, if they can do this, it's still not worth it, probably. I would just say, just go with a solution that packages all this, because it's really a lot of effort. Yeah, it feels like early optimization tends to be a common mistake that people make. We talk about it in terms of software say, you don't have those right now at your three-person startup. But that disagrees with current orthodoxy within the technical community. So I smile, nod, and mostly tend to stay away from that mess.
Starting point is 00:25:56 Yeah, but let me give you an example, right? We have one of our users of a stacker, it's open source software, right? They're a company, they have like 200, 300 developers, and they want, for whatever reason, each of those developers to have a dedicated Postgres cluster. Now, if you do it without a guest, the cost, you can imagine what it's going to be, right? So instead, what they do is run stackers on Kubernetes
Starting point is 00:26:17 and run over-provisioned clusters because they're barely using it in average, right? And they can give each developer a dedicated Postgres cluster, and they can even turn them off during the weekend. So there's a lot of use cases where you may want to do this kind of things, but drawing your own cluster with monitoring and high availability and connection pooling on your own for 100 developers, that is a huge task. Changing gears slightly, I wanted to talk to you about what I've seen to be an interesting, we'll call it balkanization, I suppose, of what used to be known as the AWS community heroes.
Starting point is 00:26:52 They started putting a whole bunch of different adjectives in front of the word hero, which on one level, it feels like a weird thing to wind up calling community advocates, because who in the world wants to self-identify as a hero? It feels like you're breaking your arm, patting yourself on the back on some level. But surprising no one, AWS remains bad at naming things. But they have split their heroes into different areas, serverless, community, data in your case. And as a result, I'm starting to see
Starting point is 00:27:17 that the numbers have swelled massively. And I think that sort of goes hand in glove with the idea that you can no longer have one person that can wrap their head around everything that AWS does. It's gotten too vast as a result of their product strategy being a post-it note that says yes on it. But it does seem, at least from the outside, like there has been an attenuation of the hero community where you no longer can fit all of them in the same room on some level to talk about shared experiences just because they're so vast and divergent from one another. What's your take on it, given that you're actually in those rooms? Yeah. Actually, even just within the data category, I can claim to be an expert on all
Starting point is 00:27:58 database technologies or data technologies within AWS for obvious reasons, and I'm definitely not, right? So it is a challenge for everybody in the industry, whether hero or not hero, to keep up with the pace of innovation and amount of services that Amazon has. I miss a little bit these options. The idea to sit together in a room, we've done that, obviously.
Starting point is 00:28:19 And first time when I joined as a data hero, we were in our first batch, nine people, if I remember. We were the first data heroes. And we were nine to 11 people at most. We sat with a lot of Amazon people. We had a great time. We learned a lot. We shared a lot of feedback.
Starting point is 00:28:34 And that was highly valuable to me. It's just with bigger numbers right now, we need to deal with all this. But I don't know if this is the right path or not. I don't think I'm the person to do the right call on this vulcanization process or not but it definitely and i definitely also know about all the amazon services which are not data services right but i don't know if this is the the only option but i kind of make sense since from the perspective that when people come to me and say oh you know may you give me an opinion or give some counsel about something related to data? I, you know, I kind of deal with that.
Starting point is 00:29:10 If someone asks me about some of the services, I may or may not know about them. So at least it gives guidance to users on what to reach you about. But I wouldn't mind also having to say, you know, I don't know about this, but I know who knows about this thing. On some level, I feel like my thesis that everything is a database if you hold it wrong has some areas where it tends to be relatively accurate. For example, I tend to view the AWS billing environment as being a database problem, whereas people sometimes look at me strangely, like, it's just sending invoices. What do you mean? It's, yeah, it's an exabyte-scale data problem based upon things that AWS has said publicly about the billing system. What do you think that runs on? And I don't know about you, but when I try and open a, you know, terabyte-large CSV file, Excel catches fire and my computer starts to smell like burning metal.
Starting point is 00:30:03 So there's definitely a tooling story. There's definitely a way of thinking about these things. On some level, I feel like I'm being backed into becoming something of a data person just based on necessity. The world changes all the time, whether we want it to or not. I can't imagine how much work you're doing analyzing bills, and they're really detailed and complicated. I mean, we're doing this. We import data into a Postgres database and do queries on the billing. So I'm sure you can do that, and maybe you will benefit from that. But actually, I catched on a topic that I like very much, which is trying to
Starting point is 00:30:35 guess the underlying architecture of some of the Amazon services. I've had some good fun times trying to do this. Let me give a couple of examples first. So for example, there's DocumentDB, right? This MongoDB compatible service. And there was some discussion on Hacker News on how it's built, because it's not using MongoDB source code for legal reasons, for licensing reasons. So what is it built on?
Starting point is 00:31:00 And I claim from the very beginning that it's written on top of Postgres. First of all, because I know this could be done, I wrote the open source software called ToroDB that is essentially MongoDB on top of Postgres. So I know it can be done very well. But on top of that, I found on the documentation some hints that are clearly, clearly Postgres characteristics, like identifiers at the database level cannot be more than 63 characters, or that the new character, the new UTF-8 character cannot be represented. So anyway, I claim Amazon has never confirmed nor denied, but I know, and I claim publicly that
Starting point is 00:31:38 DocumentDB is based on Postgres, probably Aurora, but essentially Postgres technology behind the scenes. Same applies to DynamoDB. This is more public, I would say, but it's also a relational database under the hood. I mean, it's like several routing HTTP sharding layers, but under the hood is a modified MySQL. Could have been Postgres, whatever, but it's still a relational database also under the hood for each of those shards. So it's a fun exercise for me trying to guess how the services work. I've also done exercises into how serverless Postures work, etc. So I haven't dug deeper into the billing system and what technologies under here. I advise no one to do that. But okay, let me give you my bet. My bet is that there is a relational database under the hood,
Starting point is 00:32:20 I mean, or clusters of the relational database is probably sharded by customers of groups of customers. And because they know that Amazon relied a lot at the beginning, all times in Oracle, and then they migrated or they claimed so they migrated everything to Postgres. I'm also going to claim it's somehow Postgres, probably Aurora, Postgres under the hood. But I have no facts to sustain this claim. Just a wild guess. Come to find out someone at the billing system is sitting in a room surrounded by just acres of paper. And they're listening to this episode rap like, oh my god, we can use computers. Yeah, it'll be great. That'll be great.
Starting point is 00:32:58 No, I'm kidding. They're very sharp people over there. I have no idea what it is under the hood. It's one of those areas where I just can't fathom having to process data at that volumes. And it's the worst kind of data. Because if they drop the ball and don't bill people for usage, no one in the world externally is going to complain. No one is happy about the billing system. It's, oh, good, you're here to shake me down for money. Glorious. The failure modes are all invisible or to the outside world's benefit.
Starting point is 00:33:26 But man, does that sound like a fun problem. Yeah, yeah, absolutely. And very likely they run reports on Redshift. Oh, yeah. I really want to thank you for being so generous with your time. If people want to learn more about what you're up to
Starting point is 00:33:40 and see what other horrifying monstrosities you've created on top of my dumb ideas, where can people find you? Okay, so I'm available in all usual channels. People can find me mainly on my website. It's very easy, aht.es. That is my initials. And that's where I keep track
Starting point is 00:33:59 of all my public speaking, talks, videos, blog posts, et cetera. So aht.es. But I'm also easy to find on Twitter, LinkedIn, and my company's website, ongress.com. Feel free to ping me anytime. I'm really open to receive feedback, ideas, especially if you have a crazy idea
Starting point is 00:34:17 that I may be interested, let me know. And we will, of course, put links to that in the show notes. Thank you so much for your time. I appreciate it. I'm excited to see what you come up with next. Okay, I may have some ideas, but no, thank you very much for hosting me today. It's been a pleasure.
Starting point is 00:34:33 Alvaro Hernandez, founder at Ongress. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that goes way deep into the weeds of what database system your podcast platform of choice is most likely using. If your AWS bill keeps rising and your blood pressure is doing the same,
Starting point is 00:35:08 then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.