Screaming in the Cloud - The Ever-Growing Ecosystem of Postgres with Álvaro Hernandez
Episode Date: February 9, 2023Álvaro Hernandez, Founder of OnGres, joins Corey on Screaming in the cloud to discuss his hobby project Dyna53, the balkanization of AWS services, and all things Postgres. Álvaro and Corey ...discuss what it means to be an AWS Community Hero these days, and Álvaro shares some of his experiences as being one of the first Heroes to provide feedback on AWS services. Álvaro also shares his thoughts on why people shouldn’t underestimate the importance of selecting the right database, why he feels Postgres and Kubernetes work so well together, and the ever-growing ecosystem of Postgres.About ÁlvaroÁlvaro is a passionate database and software developer. Founder of OnGres ("ON postGRES"), he has been dedicated to Postgres and R&D in databases for more than two decades.Álvaro is at heart an open source advocate and developer. He has created software like StackGres, a Platform for running Postgres on Kubernetes or ToroDB (MongoDB on top of Postgres). As a well-known member of the PostgreSQL Community, Álvaro founded the non-profit Fundación PostgreSQL and the Spanish PostgreSQL User Group. He has contributed, among others, the SCRAM authentication library to the Postgres JDBC driver.You can find him frequently speaking at PostgreSQL, database, cloud (becoming an AWS Data Hero in 2019), and Java conferences. In the last 10 years, Álvaro has completed more than 120 tech talks (https://aht.es).Links Referenced:OnGres: https://ongres.com/Dyna53: https://dyna53.io/Personal Website: https://aht.esTwitter: https://twitter.com/ahacheteLinkedIn: https://www.linkedin.com/in/ahachete/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Today's episode is brought to you in part by our friends at Minio,
the high-performance Kubernetes native object store that's built for the multi-cloud,
creating a consistent data storage layer for your public cloud instances,
your private cloud instances, your
private cloud instances, and even your edge instances, depending upon what the heck you're
defining those as, which depends probably on where you work.
It's getting that unified is one of the greatest challenges facing developers and architects
today.
It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run
any workload, and the speed to run any workload,
and the footprint to run anywhere. And that's exactly what Minio offers. With superb read
speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got
on the system, it's exactly what you've been looking for. Check it out today at min.io slash download and see for yourself.
That's min.io slash download.
And be sure to tell them that I sent you.
This episode is sponsored in part by our friends at Logicworks.
Getting to the cloud is challenging enough for many places,
especially maintaining security, resiliency, cost control, agility, etc., etc., etc.
Things break, configurations drift, cost control, agility, etc., etc., etc. Things break, configurations drift,
technology advances, and organizations, frankly, need to evolve. How can you get to the cloud
faster and ensure you have the right team in place to maintain success over time? Day two matters.
Work with a partner who gets it. Logicworks combines the cloud expertise and platform automation to customize solutions
to meet your unique requirements.
Get started by chatting with a cloud specialist today at snark.cloud slash logicworks.
That's snark.cloud slash logicworks.
And my thanks to them for sponsoring this ridiculous podcast.
Welcome to Screaming in the Cloud. I'm Corey Quinn. If I could be said to have one passion
in life, it would be inappropriately using things as databases. Because frankly, they're all
databases. If you have knowledge that I want access to, that's right, I can query you. You're
a database. Enjoy it. Today's guest has helped me take that one step further.
Alvaro Hernandez is the founder at Ongress, which we will get to in due course. But the reason he
is here is he has built the rather excellent Dyna 53. Alvaro, thank you for joining me first off.
Thank you for having me. It's going to be fun, I guess.
Well, I certainly hope so. So I have
been saying for years now, correctly, that Route 53 is Amazon's premier database offering. Just
take whatever data you want, stuff it into text records, and great, it's got 100% SLA,
it is globally distributed, the consistency is eventual, and effectively it works gloriously. Disclaimer,
this is mostly a gag. Please don't actually do this in production because the last time I made
a joke like this, I found it at a bank. So you have taken the joke, horrifying though it is,
a step further by sitting down one day to write Dyna 53. What is this monstrosity?
Okay, actually it took a little bit more than one day.
But essentially, this is a hobby project. I just want to have some fun. Most of my day is managing
my company and not programming, which I love. So I decided, let's program something. And there
was reasons I wanted to do this. We can get on that later. But essentially, so what it is,
so Dyna53, it's essentially DynamooDB where data is stored in Route 53.
So you laid out the path, right?
Like data can be stored on text records.
Actually, I use both text and service records.
And we can get into the tech details if you want to.
But essentially, I store data when you run Dynab 53 on top of Route 53 on records.
But it exposes a real database interface because otherwise it's not a real
database until it mimics a real database interface. So you can use the Amazon CLI with DynamoDB.
You can use any GUI program that works with DynamoDB. And it works as if you're using DynamoDB,
except the data is stored on a zone on your Route 53. And it's open source, by the way.
Under the hood, does it use DynamoDB at all?
Not at all.
Excellent.
Because if it did that, it wouldn't be, I guess, necessary in the least.
And it would also be at least 80% less ridiculous,
which is kind of the entire point.
Now, you actually do work with data for a living.
You're an AWS data hero,
which is their bad name for effectively community folk who
are respected, prolific, and know what they're talking about in the ecosystem around certain
areas. You work with these things for a living. Is there any scenario in which someone would
actually want to do this, or is it strictly something to be viewed as a gag and maybe a
fun proof of concept, but never do anything like this in reality. No, I totally discourage using this in anything serious. I mean, technically speaking, and if you
want to talk about billing, which as far as I know is something you care about, you could save some
money for super small testing scenarios where you might want to run Dynar 53 years and Amazon Lambda
against a hosted zone that maybe you already have,
so you don't need to pay for it.
And you may save some money as long as you do less than three, four million transactions
per month on this less than 10,000 records, which is the limit of Route 53 by default.
And you're going to save two, three dollars per month.
So, I mean, yeah, it's not money, right?
So your Starbucks coffee is going to be more than
that. So essentially, yes, don't use it. It's a gag. It's just for fun. But actually, it was a
surprising, you know, funny engineering game. If you want to look at the code, anybody can do this,
please start on GitHub. It's a joke project anyway. I'm not going to get to the Arctic
Vault because of this. It's a funny exercise on how to route around all the limitations for the text records
that are on Route 53 and all the quotas and all the, you know, like there's a lot of stuff going
on there. So I think from an architectural perspective and a code perspective, it's a
fun thing to look at. I'm a big fan of when people take ideas that I've come up with more or less on
the fly, just trying to be ridiculous and turn it into
something. I do appreciate the fact that it does come with a horrible warning of, yeah, this is
designed to be funny. This is not intentionally aimed at something you should run your insurance
company on top of. And I do find that increasingly as my audience gets larger, and I mean that in
terms of raw numbers, not volume, the problem that I keep smacking into is that people don't always have context.
I saw a Reddit post a couple of years ago after I started the Route 53 as a database gag and saw someone asking, well, okay, I get it's designed to be a gag, but would this actually work in production?
At which point I had to basically go crashing in.
Yeah, this actually happened once at a company I worked at, which is where the joke came from.
Here are the problems you're going to have with it.
It's a nice idea.
There are better ways in 2023 to skin that cat.
You probably do not want to do it.
And if you do, I disclaim all responsibility, et cetera, et cetera.
Just it's, otherwise there's always the danger of people taking you too seriously.
Speaking of being taken seriously,
during the day, you are the founder at Ongress. What do you folks do over there?
Ongress means on Postgres. So we are essentially a Postgres specialized shop. We offer both professional services, which is 24-7 support, monitoring, consulting. And we do a lot of,
for example, migrations from all
databases, especially Oracle to Postgres.
But we also develop software for the Postgres ecosystem.
Right now, for example, we have developed a fully open source StackRest project, stack
of components on top of Postgres, which is where you essentially need to run Postgres
on Kubernetes for production quality workloads.
So Postgres, Postgres, Postgres.
I'm known on many environments as the Postgres guy.
And if you say Postgres three times,
I typically pop up and answer whatever question is there.
I find that increasingly over the past few years,
there has been a significant notable shift
as far as, I guess, the zeitgeist
or what most of the community centralizes around
when it comes to databases.
Back when I used to touch production systems in anger
and, oh, I was oh so angry,
I found that MySQL was sort of the default database engine
that most people went with.
These days, it seems that almost anything Greenfield
that I see starts with Postgres,
or as I insist on calling it, Postgresqueel.
And as a result, I find that that seems to have become a de facto standard kind of out of nowhere.
Was I asleep at the wheel, or have I missed something that's been happening all along?
What is the deal here? Well, this is definitely the case. Postgres is becoming the de facto
standard for, especially as you said, green new deployments,
but not only at the Postgres level, but also Postgres-compatible and Postgres-derived projects.
If you look at Google's cloud offering, now they added compatibility layer with Postgres.
And if you look at AlloyDB, it's only compatible with Postgres and not MySQL. It's kind of
Aurora's equivalent, and it's only Postgres there, not Postgres and MySQL. A lot of databases are adding Postgres compatibility layer, the
protocol, so you can use Postgres drivers. So Postgres as a wide ecosystem, yes, it's becoming
the de facto standard for new developments and many migrations. Why is that? I think it's a
combination of factors, most of them being that, and pardon me, all MySQL fans out there, but I believe that
Postgres is more technically advanced, has more features, more completeness of SQL, and more
capabilities, therefore, in general, compared to MySQL. And it has a strong reputation, very solid,
very reliable, never corrupting data. You might think it performs a little bit better, a little
bit worse, but in reality,
what you care is about your data being there all the time.
And that it's stable
and it's rock solid. You can throw stones
at Postgres, it will keep running. You could
really configure it in Turgle
and make it go slow, but it will still
work there. It will still be there when you need it.
So I think it's
the thing that cannot go wrong.
If you choose Postgres, you're very likely
not wrong. If you look for a new fancy database for a fancy new project, things may or may not
work. But if you go Postgres, it's essentially the Swiss army knife of today's modern databases.
If you use Postgres, you may get 80% of the performance or 80% of what your really new
fancy database will do, but it will work for almost any workload,
any use case that you may have.
And you may not need specialized databases
if you just stick to Postgres,
so you can standardize on it.
On some level, it seems that there are two diverging philosophies,
and which one is the right path invariably seems to be tied
to whoever is telling the story wants to wind
up selling in various ways. There's the idea
of a general purpose database where
oh, great, it's one of those you can
have in any color you want as long as it's black style
of approach, where everything should wind
up living in a database,
a specific database. And then you have the
Cambrian explosion of purpose-built databases
where that's sort of the
AWS approach, where it feels like the DBA job of purpose-built databases, where that's sort of the AWS approach, where it
feels like the DBA job of the future is deciding which of Amazon's 40 managed database services by
then that are going to need to be used for any given workload. And that doesn't seem like it's
necessarily the right approach either on some level. It feels like it's a spectrum. Where do
you land on it? So let me speak actually about Postgres extensibility. Postgres has an extensibility
mechanism called extensions, not super original, which is essentially like plugins. You can take
your browser's plugin that augment functionality. And it's surprisingly powerful. And you can take
Postgres and transform it into something else. So Postgres has extensions built into the core,
a lot of functionality for JSON, for example,
but then you have extensions for GraphQL, you have extensions for sharding Postgres, you have
extensions for time series, you have extensions for geo, for anything that you can almost think of.
So in reality, once you use these extensions, you can get, you know, very close to what a specialized
database purpose build may get, maybe, you know, as I said before, like 80% there,
but, you know, at the cost of just standardizing everything on Postgres.
So it really depends where you are at.
If you are planning to run everything as managed services,
you may not care that much because someone is managing them for you.
I mean, from a developer perspective,
you still need to learn this 48 or whatever APIs, right?
But if you're going to run things on your own, then consolidating technologies is a very interesting approach. And in this case, Postgres is an excellent home for that approach.
One of the things that I think has been, how do I put this in a way that isn't going to actively
insult people who have gone down certain paths on this? There's an evolution, I think, of how you wind up interacting with databases.
At least that is my impression of it. And let's be clear, I, in my background as a production
engineer, I tended to focus on things that were largely stateless, like web servers. Because when
you break the web servers, as I tended to do, we all have a good laugh. We reprovision them because
they're stateless and life generally goes on. Let my aura too close to the data warehouse and we
don't have a company anymore. So people learn pretty damn quick not to let me near things like
that. So my database experience is somewhat minimal. But having built a bunch of borderline
horrifying things in my own projects and seeing what's happened with the joy of technical debt as
software projects turn into something larger and then have to start scaling,
there are a series of common mistakes it seems that people make their first time out.
Such as they tend to assume in every case that they're going to be using a very specific database
engine. Or they, well, this is just a small application. Why would I ever need to separate
out my reads from my writes, which becomes a
significant scaling problem down the road, and so on and so on. And then you have people who decide
that, you know, CAP theorem doesn't really apply. That's not really a real thing. We should just
turn it on globally. Google says it doesn't matter anymore. And well, that's adorable.
But it's those things that you tend to be really cognizant of the second time,
because the first time you've got to, you mess it up and you wind up feeling embarrassed by it.
Do you think it's possible for people to learn from the mistakes of others?
Or is this the sort of thing that everyone has to find out for themselves once they really feel the pain of getting it wrong?
It actually surprises me that I don't see a lot of due diligence when selecting a database technology.
And databases tend to be centerpieces,
maybe not anymore with this microservices, more oriented architecture. It's a little bit of a
joke here. But essentially, what I mean is that a database should be a serious choice to make,
and you shouldn't take it lightly. And I see people not doing a lot of due diligence and
picking technologies just because they're claiming to be faster or because they're cool or because
they're the new thing, right? But then it comes with a technical debt.
So in reality, databases are extremely powerful software.
I had a professor at the university that said
they're the most complex software in the world only after compilers.
Whether you agree with that or not,
databases are packed with functionality.
And it's not a smart decision to ignore that
and just reinvent the wheel at your application side.
So leverage your database capabilities, learn your database capabilities, And it's not a smart decision to ignore that and just reinvent the wheel at your application site.
So leverage your database capabilities, learn your database capabilities, and pick your database based on what capabilities you want to have, not to write from your application with bugs and inefficiencies, right? So look at one of these examples could be, and there's no pun intended here, MongoDB and the schema-less approach, being them the champion of this schema-less approach.
Schema-less is a good fit for certain cases, but not for all the cases.
And it's a kind of a trade-off.
Like, you can start writing data faster, but when you query the data, then the application is going to need to figure out, oh, but is there this key present here or not?
Oh, there's a sub-document here, and which fields can I query there?
So you start creating versions of your objects, depending on the evolution of your business logic and so on and so
forth. So at the end, you are shifting a lot of business logic to the application that with another
database, say Postgres, could have been done by the database itself. So you start faster, you grow
slower. And this is a trade-off that some people have to make. And I'm
not saying which one is better. It really depends on your use case, but it's something that should
people be aware of, in my opinion. While I've got you here, you've been a somewhat
surprising proponent for something that I would have assumed was a complete non-starter,
don't do this. But again, I haven't touched either of these things in anger, at least
in living memory. You have been suggesting that it is not completely ridiculous to run Postgres
on top of Kubernetes. I've always taken the perspective that anything stateful would
generally be better served by having a robust API, the thing in Kubernetes talks to. In an AWS
context, in other words, oh great, you use RDS and talk to that
as your persistent database service
and then have everything in the swirling maelstrom
that is Kubernetes that no one knows what's going on with.
You see a different path.
Talk to me about that.
Yeah, actually, I've given some talks about why I say
that you should primarily run Postgres on Kubernetes.
Like, as long as I'm saying that Postgres should be your default database option,
I also say that Kubernetes, Postgres on Kubernetes, henceforth,
should be the default deployment model, unless you have strong reasons not to.
There's this conventional wisdom, which has become already just a myth,
that Kubernetes is not for stateful workloads.
It is. It wasn't many years ago. It is today.
No question. And we can get into the technical details, but essentially, it is safe and good
for that. But I would get it even farther. It actually could be much better than other
environments because Kubernetes essentially is an API. And this API allows really, really high
levels of automation to be created. It can
automate compute, can automate storage, can automate networking at a level that is not even
possible with other virtualization environments of some years ago. So databases are not just
day one or day zero operations, like deploying them and forgetting about them. You need to
maintain them. You need to perform operations. You need to perform vacuums and repacks and upgrades
and a lot of things that you need to do with your database. And those operations in Kubernetes can
be automated. Even if you run on RDS, you cannot run automated repack or a vacuum of the database.
You can do automated upgrades, but not the other ones. You cannot automate the benchmark. I mean,
you can do all, but it's not provided. I mean, you can do all. It's not provided.
On Kubernetes, this can be provided.
For example, the open server
that we developed for this,
StackRest, automates all the operations
that I mentioned,
and many more are coming down the road.
So it is safer from a stateful perspective.
Your data will not get lost.
Behind the scenes, if you're wondering,
data will go by default.
You can change that,
but the default will go to EBS volumes.
So even if the nodes die, the data will remain on the EBS volumes. It can automate even more things that you can
automate today in other environments. And it's safe from that perspective. Reheals typically,
a node failing is rehealed much faster than on other environments. So there's a lot of advantages
for running on Kubernetes. But on the particular case of Postgres, and if you compare it to managed services,
there's additional reasons for that. If you move to Kubernetes because you believe Kubernetes is
bringing advantages to your company, to your velocity, compatibility with production
environments, there's a lot of reasons to move to Kubernetes. If you're moving to Kubernetes,
everything except for your database, well, you're not enjoying all those advantages and reasons why you decided to move to Kubernetes.
Move full in, if that's the case, to leverage all these advantages.
But on top of that, if you look at the managed service like RDS, for example, there are certain extensions that are not available.
And you may want to have all the extensions that post-release developers use.
So here you have complete freedom, you can own essentially your environment.
So there's multiple reasons for that.
But the main one is that it's safe from a state perspective.
And you can get to higher levels of automation that you get today on non-Kubernetes environments.
It also feels on some level like it makes it significantly more portable.
Because if you wind up building something on AWS and then for some godforsaken reason want to move it to another cloud provider,
again, a practice that is not highly recommended in most cases,
having to relearn what their particular database services peculiarities are,
even if they're both Postgres, let's be clear,
seems like there's enough of a discordance or a divergence between them
that you're going to find yourself
in operational hell without meaning to.
Yeah. Actually,
the first thing I would never recommend to do is
running Postgres by yourself in an EC2 instance.
In an upper EC2 instance. Yes, you're going to save
costs compared to RDS.
RDS is significantly more expensive than an upper
instance, but I would never recommend
to do that yourself. You're going to pay everything else
in engineer hours and down times. But when you talk about services that can automate things like RDS
or even more than RDS, the question is different. And there's a lot of talk recently, and you know
probably much more than me, about things like repatriation, going back on-prem or taking some
workloads on-prem or to other environments. And that's where Kubernetes may really help to move workloads across, because you're going
to change a couple of lines on one of your Jumbo files, and that's it, right?
So it definitely helps in this regards.
This episode is sponsored in part by our friends at Strata.
Are you struggling to keep up with the demands of managing and securing identity in your
distributed enterprise IT environment?
You're not alone,
but you shouldn't let that hold you back.
With Strata's Identity Orchestration Platform,
you can secure all your apps on any cloud with any IDP,
so your IT teams will never have to refactor
for identity again.
Imagine modernizing app identity in minutes
instead of months,
deploying passwordless on any tricky old app,
and achieving business resilience with always-on identity, all from one lightweight and flexible
platform. Want to see it in action? Share your identity challenge with them on a discovery call,
and they'll hook you up with a complimentary pair of AirPods Pro. Don't miss out. Visit
strata.io slash screamingcloud. That's strata.io slash screaming cloud.
You're right.
You will pay more for RDS
than you will for running your own Postgres
on top of EC2.
And as a general rule,
at certain points of scale,
I'm a staunch advocate of
your people will cost you more
than you pay for in cloud services.
But there is a tipping point of scale
where I've talked to
customers running thousands of EC2 nodes, running databases on top of them. And when I did the
simple math of, okay, if you just migrate that over to RDS, oh dear, the difference there would
mean you have to lay off an entire team of people who are managing those things. There's a clear
win economically to run your own at this point. Plus, you get the advantage of a specific higher degree of control. You can tweak things. You can
have maintenance happen exactly when you want it to, rather than at some point during a window,
et cetera, et cetera. There are advantages to running this stuff yourself. It feels like
there's a gray area. Now, for someone building something at small scale, yeah, go with RDS and don't think about it twice.
When is it time, from your position,
to start re-evaluating the common wisdom
that applies when you're starting out?
Well, it's not necessarily related to scale.
RDS is always a great choice.
Aurora is always a great choice.
Just go with them.
But it's also about the capabilities that you want
and where you want them.
So most of the people right now want a database as a service experience.
But again, on something like Kubernetes with operators,
you can also have that experience.
So it really depends.
Is this a greenfield development happening in containers with Kubernetes?
Then it's maybe a good reason to run Postgres on Kubernetes,
which equates from a a cost perspective, on running
on your bare instances. Other than that, for example, Postgres is a fantastic... What I'm
going to say about Postgres, right? It's this database that I love. It's a fantastic database,
but it's not batteries included for production workloads. If you really want to take Postgres
to production from A to get install, you're going to take a long, long road.
You need a lot of components that don't come with Postgres, right? You need to configure them. You
need high availability. You need monitoring. You need backups. You need logs management. You need
a lot of connection pooling. None of those come with Postgres. You need to understand which
component to pick from the ecosystem, how to configure it, how to tune it, how to make them
all work together. That is a long road. So if you reach enough scale in your teams where you have talented, knowledgeable people about
post-trust environments, if they can do this, it's still not worth it, probably. I would just say,
just go with a solution that packages all this, because it's really a lot of effort.
Yeah, it feels like early optimization tends to be a common mistake that people make. We talk about it in terms of software say, you don't have those right now at your three-person startup.
But that disagrees with current orthodoxy within the technical community.
So I smile, nod, and mostly tend to stay away from that mess.
Yeah, but let me give you an example, right?
We have one of our users of a stacker, it's open source software, right?
They're a company, they have like 200, 300 developers,
and they want, for whatever reason,
each of those developers to have a dedicated Postgres cluster.
Now, if you do it without a guest,
the cost, you can imagine what it's going to be, right?
So instead, what they do is run stackers on Kubernetes
and run over-provisioned clusters
because they're barely using it in average, right?
And they can give each developer
a dedicated Postgres cluster, and they can even turn them off during the weekend. So there's a
lot of use cases where you may want to do this kind of things, but drawing your own cluster with
monitoring and high availability and connection pooling on your own for 100 developers, that is
a huge task. Changing gears slightly, I wanted to talk to you about what I've seen to be an interesting,
we'll call it balkanization, I suppose, of what used to be known as the AWS community heroes.
They started putting a whole bunch of different adjectives in front of the word hero, which
on one level, it feels like a weird thing to wind up calling community advocates, because who in the
world wants to self-identify as a hero? It feels like you're breaking your arm, patting yourself
on the back on some level.
But surprising no one, AWS remains bad at naming things.
But they have split their heroes into different areas,
serverless, community, data in your case.
And as a result, I'm starting to see
that the numbers have swelled massively.
And I think that sort of goes hand in glove
with the idea that you can no longer have one person that can wrap their head around everything that AWS does.
It's gotten too vast as a result of their product strategy being a post-it note that says yes on it.
But it does seem, at least from the outside, like there has been an attenuation of the hero community where you no longer can fit all of them in the same room on some level
to talk about shared experiences just because they're so vast and divergent from one another.
What's your take on it, given that you're actually in those rooms?
Yeah. Actually, even just within the data category, I can claim to be an expert on all
database technologies or data technologies within AWS for obvious reasons, and I'm definitely not,
right? So it is a challenge for everybody in the industry,
whether hero or not hero,
to keep up with the pace of innovation
and amount of services that Amazon has.
I miss a little bit these options.
The idea to sit together in a room,
we've done that, obviously.
And first time when I joined as a data hero,
we were in our first batch, nine people, if I remember.
We were the first data heroes.
And we were nine to 11 people at most.
We sat with a lot of Amazon people.
We had a great time.
We learned a lot.
We shared a lot of feedback.
And that was highly valuable to me.
It's just with bigger numbers right now, we need to deal with all this.
But I don't know if this is the right path or not.
I don't think I'm the person to do the right call on this vulcanization process or not but it definitely and i definitely also know
about all the amazon services which are not data services right but i don't know if this is the
the only option but i kind of make sense since from the perspective that when people come to
me and say oh you know may you give me an opinion or give some counsel about something related to data?
I, you know, I kind of deal with that.
If someone asks me about some of the services, I may or may not know about them.
So at least it gives guidance to users on what to reach you about.
But I wouldn't mind also having to say, you know, I don't know about this, but I know who knows about this thing. On some level, I feel like my thesis that everything is a database if you hold it wrong
has some areas where it tends to be relatively accurate. For example, I tend to view the AWS
billing environment as being a database problem, whereas people sometimes look at me strangely,
like, it's just sending invoices. What do you mean? It's, yeah, it's an exabyte-scale data problem based upon things that AWS has said publicly about the billing system.
What do you think that runs on?
And I don't know about you, but when I try and open a, you know, terabyte-large CSV file, Excel catches fire and my computer starts to smell like burning metal.
So there's definitely a tooling story.
There's definitely a way of thinking about these things.
On some level, I feel like I'm being backed into becoming something of a data person just based on necessity.
The world changes all the time, whether we want it to or not.
I can't imagine how much work you're doing analyzing bills, and they're really detailed and complicated.
I mean, we're doing this. We import data into a
Postgres database and do queries on the billing. So I'm sure you can do that, and maybe you will
benefit from that. But actually, I catched on a topic that I like very much, which is trying to
guess the underlying architecture of some of the Amazon services. I've had some good fun times
trying to do this. Let me give a couple of examples first. So for example, there's DocumentDB, right?
This MongoDB compatible service.
And there was some discussion on Hacker News
on how it's built,
because it's not using MongoDB source code
for legal reasons, for licensing reasons.
So what is it built on?
And I claim from the very beginning
that it's written on top of Postgres.
First of all, because I know this could be done, I wrote the open source software called ToroDB
that is essentially MongoDB on top of Postgres. So I know it can be done very well. But on top
of that, I found on the documentation some hints that are clearly, clearly Postgres characteristics,
like identifiers at the database level cannot be more
than 63 characters, or that the new character, the new UTF-8 character cannot be represented.
So anyway, I claim Amazon has never confirmed nor denied, but I know, and I claim publicly that
DocumentDB is based on Postgres, probably Aurora, but essentially Postgres technology behind the
scenes. Same applies to DynamoDB. This is more public, I would say, but it's also a relational database
under the hood. I mean, it's like several routing HTTP sharding layers, but under the hood is a
modified MySQL. Could have been Postgres, whatever, but it's still a relational database also under
the hood for each of those shards. So it's a fun exercise for me trying to guess how the services
work. I've also done exercises into how serverless Postures work, etc. So I haven't dug deeper into the billing system
and what technologies under here. I advise no one to do that.
But okay, let me give you my bet. My bet is that there is a relational database under the hood,
I mean, or clusters of the relational database is probably sharded by customers of groups of customers. And because they know that Amazon relied a lot at the beginning,
all times in Oracle, and then they migrated or they claimed so they migrated everything to
Postgres. I'm also going to claim it's somehow Postgres, probably Aurora, Postgres under the
hood. But I have no facts to sustain this claim. Just a wild guess.
Come to find out someone at the billing system is sitting in a room surrounded by just
acres of paper. And they're listening to this episode rap like, oh my god, we can use computers.
Yeah, it'll be great.
That'll be great.
No, I'm kidding. They're very sharp people over there. I have no idea what it is under the hood.
It's one of those areas where I just can't fathom having to process data at that volumes.
And it's the worst kind of data.
Because if they drop the ball and don't bill people for usage, no one in the world externally is going to complain.
No one is happy about the billing system.
It's, oh, good, you're here to shake me down for money.
Glorious.
The failure modes are all invisible or to the outside world's benefit.
But man, does that sound like a fun problem.
Yeah, yeah, absolutely.
And very likely they run reports on Redshift.
Oh, yeah.
I really want to thank you
for being so generous with your time.
If people want to learn more
about what you're up to
and see what other horrifying monstrosities
you've created on top of my dumb ideas,
where can people find you?
Okay, so I'm available in all usual channels.
People can find me mainly on my website.
It's very easy, aht.es.
That is my initials.
And that's where I keep track
of all my public speaking,
talks, videos, blog posts, et cetera.
So aht.es.
But I'm also easy to find on Twitter, LinkedIn,
and my company's website, ongress.com.
Feel free to ping me anytime.
I'm really open to receive feedback, ideas,
especially if you have a crazy idea
that I may be interested, let me know.
And we will, of course, put links to that in the show notes.
Thank you so much for your time.
I appreciate it.
I'm excited to see what you come up with next.
Okay, I may have some ideas,
but no, thank you very much for hosting me today.
It's been a pleasure.
Alvaro Hernandez, founder at Ongress.
I'm cloud economist, Corey Quinn,
and this is Screaming in the Cloud.
If you've enjoyed this podcast,
please leave a five-star review
on your podcast platform of choice. Whereas if you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice,
along with an angry, insulting comment that goes way deep into the weeds of what database system your podcast platform of choice is most likely using. If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started. This has been a HumblePod production.
Stay humble.