Screaming in the Cloud - Keeping Life on the Internet Friction Free with Jason Frazier
Episode Date: February 16, 2022About JasonJason Frazier is a Software Engineering Manager at Ekata, a Mastercard Company. Jason’s team is responsible for developing and maintaining Ekata’s product APIs. Previously, as ...a developer, Jason led the investigation and migration of Ekata’s Identity Graph from AWS Elasticache to Redis Enterprise Redis on Flash, which brought an average savings of $300,000/yr.Links:Ekata: https://ekata.com/Email: jason.frazier@ekata.comLinkedIn: https://www.linkedin.com/in/jasonfrazier56
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Today's episode is brought to you in part by our friends at Minio,
the high-performance Kubernetes native object store that's built for the multi-cloud,
creating a consistent data storage layer for your public cloud instances,
your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work.
It's getting that unified is one of the greatest challenges facing developers and architects today.
It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload,
and the footprint to run anywhere. And that's exactly what Minio offers. With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got
on the system, it's exactly what you've been looking for. Check it out today at
min.io slash download and see for yourself.
That's min.io slash download, and be sure to tell them that I sent you.
This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing
DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be
used as a pivot point to get
access into your environment. They've also gone deep in depth with a bunch of other approaches to
how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I
sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.
Welcome to Screaming in the Cloud. I'm Corey Quinn. This one is a bit fun because it's a
promoted episode sponsored by our friends at Redis, but my guest does not work at Redis,
nor has he ever. Jason Frazier is a software engineering manager at Akata, a MasterCard company, which I feel
like that should have some sort of music backstopping into it, just because large companies always
have that magic sheen on it.
Jason, thank you for taking the time to speak with me today.
Yeah, thanks for inviting me.
Happy to be here.
So other than the obvious assumption based upon the fact that Redis is kind enough to
be sponsoring this episode, I'm going to assume that you're a Redis customer at this point. But I'm sure we'll get there.
Before we do, what is Akata? What do you folks do?
So the whole idea behind Akata is, I mean, if you go to our website, our mission statement is,
we want to be the global leader in online identity verification. What that really means is in a more increasingly digital world, when anyone
can put anything they want into any text field they want, especially when purchasing anything
online. You really think people do that? Just go on the internet and tell lies?
I know. It's shocking to think that someone could lie about who they are online. But that's sort of
what we're trying to solve specifically in the payment space. Like,
I don't know, I want to buy a new pair of shoes online, and I enter in some information. Am I
really the person that I say I am when I'm trying to buy those shoes to prevent fraudulent
transactions? That's really one of the basis that our company goes on is trying to reduce fraud
globally. That's fascinating, just from the perspective of you take a look at cloud vendors,
the space that I tend to hang out with, and a lot of their identity verification of is this person who they claim to be, in fact, is put back onto the payment providers.
Take Oracle Cloud, which I periodically beat up, but also really enjoy aspects of their platform on, where to get to their always free tier, you have to provide a credit card.
Now, they'll never charge you anything until you affirmatively upgrade the account. But so what do you do my card for? Ah,
identity and fraud verification. So it feels like the way that everyone else handles this is, ah,
we'll make it the payment networks problem. Well, you're now owned by MasterCard. So I sort of
assume you are what the payment networks in turn use to solve that problem. Yeah. So basically one
of our flagship products
and things that we return is sort of like a score from zero to 400 on how confident are we are that
this person is who they are. And it's really about helping merchants help determine whether
they should either approve or deny or forward on a transaction to like a manual review agent,
as well as there's also another use case that's even more popular,
which is just like account creation.
As you can imagine, there's lots of bots
on everyone's favorite app or website
and things like that,
or customers offer a promotion,
like sign up and get $10.
Oh, I could probably get $10,000
if I make a thousand random accounts
and then I'll sign up with them.
But like make sure that those accounts are legitimate accounts. They'll prevent like that sort of promo abuse and
things like that. So it's also not just transactions. It's also like account openings and
stuff. Make sure that you actually have real people on your platform. The thing that always
annoyed me was the way that companies decide, oh, we're going to go ahead and solve that problem
with a captcha on it. It's no, no, I don't want to solve machine learning puzzles for Google for free
in order to sign up for something.
I am the customer here.
You're getting it wrong somewhere.
So I assume, given the fact that I buy an awful lot of stuff online,
but I don't recall ever seeing anything branded with Akata,
that you do this behind the scenes.
It is not something that requires human interaction, by which I mean friction.
Yeah, for sure.
Yeah, yeah, it's behind the scenes.
That's exactly what I was about to segue to is friction, is try to provide
a frictionless experience for users. In the US, it's not as common, but when you go into Europe
or anything like that, it's fairly common to get confirmations on transactions and things like that.
You may have to, I don't know, text or get a code text or enter that online to basically say like,
yes, I actually received this. But like helping in the reason companies do that is for that like extra bit of security
and assurance that that's actually legitimate. But obviously companies would like to prefer
not to have to do that because I don't know if I'm trying to buy something, this website makes
me do something extra. The site doesn't make me do anything extra. I'm probably going to go with
that one because it's just more convenient for me because there's less friction there.
You're obviously limited in how much you can say about this, just because it's,
here's a list of all the things we care about means that, great, you've given me a roadmap
to have things to wind up looking at. But do you have an example or two of the sort of
data that you wind up analyzing to figure out the likelihood that I'm a human versus a robot?
Yeah, for sure. I mean, it's fairly common across most payment forms. So things like you enter in
your first name, your last name, your address, your phone number, your email address. Those are
all identity elements that we look at. We have two data stores. We have our identity graph and
our identity network. The identity graph is what you would probably think of it if you think of a
web of a person and their identity,
like you have a name that's linked to a telephone that's in that name is also linked to an address, but that address used to have previous people living there, so on and so forth. So the various,
what we just call identity elements are the various things we look at. And it's fairly
common on any payment form. I'm sure like if you buy something on Amazon versus eBay or whatever,
you're probably going to be asked,
what's your name?
What's your address?
What's your email address?
What's your telephone?
It's one of the most obnoxious parts
of buying things online
from websites you haven't been to before.
It's one of the genius ideas
behind Apple Pay
and the other centralized payment systems.
Oh yeah, they already know who you are.
Just click the button and it's done.
Yeah, even something as small as that.
I mean, it gets a little bit easier
with like form autocompletes and stuff.
Like, oh, just type J and it's autocomplete everything for
me. That's not the worst of the world, but it is still some amount of annoyance and friction.
So as I look through all this, it seems like one of the key things you're trying to do,
since it's in line with someone waiting while something is spinning in their browser,
this needs to be quick. It also strikes me that this is likely not something that you're going to hit the same
people trying to identify all the time.
If so, that is a sign of fraud.
So it doesn't really seem like something can be heavily cached.
Yet you're using Redis, which tells me that your conception of how you're using it might
be different than the mental space that I put Redis into when I'm thinking about where
in this ridiculous architecture diagram is the Redis part going to go? Yeah, I mean, like, when everyone says Redis, thinks of Redis, I mean,
even before we went down this path, you always think of, oh, I need a cache, I'll just stuff
in Redis, just use Redis as a cache here and there, some small, I don't know, a few tens,
hundreds, gigabytes, maybe, cache spin that up, and you're good. But we actually use Redis as
our primary data store for our identity graph, specifically for the speed that we can get.
Because if you're trying to look for a person, like let's say you're buying something for
your brother, how do we know if that's true or not? Because you have this name, you're trying
to send it to a different address. How does that make sense? But how do we get from Corey to an address?
Like, oh, maybe you used to live with your brother.
It's funny you picked that as your exam.
My brother just moved to Dublin.
So it's the whole problem.
How do I get this from me to someone,
different country, different names, et cetera?
And yeah, how do you wind up mapping that
to figure out the likelihood
that it is either credit card fraud
or somebody actually trying to be, you know,
a decent brother
for once in my life. Yeah. So I mean, how it works is how you'd imagine. You start at some entry
point, which would probably be your name. Start there and say, can we match this to this person's
address that you believe you're sending to? And we can say, oh, you have a person-person relationship
like he's your brother. So it maps to him, which we can then get his address and say, oh, here's
that address. That matches what you're trying to send it to. Hey, this makes sense because you have a
legitimate reason to be sending something there. You're not just sending it to some random address
out in the middle of nowhere for no reason. Or the dropshipping scams or brushing scams.
That's the thing is every time you think you've seen it all, all you have to do is look at fraud.
That's where the real innovation seems to be happening, no matter how you slice it.
Yeah, it's quite an interesting space. I always like to say it's one of those things where
if you had the human element in it, it's not super easy, but it's generally easy to tell,
okay, that makes sense, or oh, no, that's just complete garbage. But trying to do it at scale
very fast in a general case becomes an actual substantially harder problem. It's one of those
things that people can probably do fairly well. I mean, that's why we still have manual reviews
and things like that. But trying to do it automatically on or just with computers is
much more difficult. I scammed a company out of 20 bucks is not the problem you're trying to avoid
for. It's the okay, I just did that 10 million times. And now we have a different problem.
Yeah, exactly. I mean, one of the biggest losses for a lot of companies is fraudulent
transactions and chargebacks. Usually in the case on e-commerce companies, or even especially
nowadays, where as you can imagine, more people are moving to a more online world and doing
shopping online and things like that. So as more people move to online shopping, some companies
are always going to get some amount
of chargebacks on their non-fragile transactions. But when it happens at scale, that's when you
start seeing many losses. Because not only are you issuing a chargeback, you probably sent out
some products that you're now out some physical product as well. So it's almost kind of like a
double whammy. So as I look through all this, I tend to do always view Redis in terms of a more or less a key value store.
Is that still accurate? Is that how you wind up working with it?
Or has it evolved significantly past then to the point where you can now do relational queries against it?
Yeah, so we do use Redis as a key value store because Redis is just a traditional key value store, very fast lookups.
When we first started building out our identity graph, as you can imagine, you're trying to model people to telephone addresses, your first thought is,
hey, this sounds a whole lot like a graph. That's sort of what we did quite a few years ago is let's
just put it in some graph database. But as time went on, and as it became much more important to
have lower and lower latency, we really started thinking about like, we don't really need all
the nice and shiny things that like a graph database or some sort of graph technology really offers you.
All we really need to do is I need to get from point A to point B, and that's it.
I ended a graph database.
What's the first thing I need to do?
We'll spend six weeks in school trying to figure out exactly what the hell a graph database is because they're challenging to wrap your head around at the best of times.
Then it just always seemed overpowered for a lot of, I don't want to say simple use cases, what
you're doing is not simple, but it doesn't seem to be leveraging the higher order advantages
that a graph database tends to offer.
It added a lot of complexity in the system.
And me and one of our senior principal engineers who's been here for a long time, we always
have a joke.
If you search our GitHub repository for, we'll say, kindly worded commit
messages, you can see a very large correlation of those types of commit messages to all the commits
to try and use a graph database from multiple years ago. It was not fun to work with, just
added too much complexity, and we just didn't need all that shiny stuff. So that's how we really just
took a step back. Like we really need to do it this way. We ended up effectively flattening the
entire graph into an adjacency list. So a key is basically some UUID to an entity. So Corey,
you'd have some UUID associated with you and the value would be whatever your information would be, as well as other UUIDs to links to the other entities.
So from that first retrieval, I can now unpack it and, oh, now I have a whole bunch of other UUIDs.
I can then query on to get that information, which then have more IDs associated with it.
It's more or less sort of how we do our graph traversal and query this in our graph queries.
One of the fun things about doing this sort of interview dance on the podcast, as long as I have,
is you start to pick up what people are saying by virtue of what they don't say. Earlier,
you wound up mentioning that you often use Redis for things like tens or hundreds of gigabytes,
which sort of leaves in my mind the strong implication that you're talking about something significantly larger than that. Can you disclose the scale of
data we're talking about here? Yeah, so we use Redis as our primary data store for our identity
graph and also for, or soon to be, for our identity network, which is our other database.
But specifically for our identity graph scale we're talking about, we do have some compression
added on there. But if you say uncompressed, it's about 12 terabytes of data that's compressed
with replication into about four. That's a relatively decent compression factor,
given that I imagine we're not talking about huge data sets.
Yeah, so this is actually basically driven directly by cost. If you need to store less
data, then you need less memory.
Therefore, you need to pay for less.
Sorry, just once again to short up my longtime argument that when it comes to cloud, cost and architecture are in fact the same thing.
Please continue by all means.
I would be lying if I said that we didn't do weekly slash monthly reviews of costs.
Where are we spending costs in AWS?
How can we improve costs?
How can we cut down on costs?
How can we store less?
You are singing my song.
It is a constant discussion. But yeah, so we use ZStandard compression,
which was developed at Facebook. It's a dictionary-based compression.
And the reason we went for this is, I mean, if I say I want to compress a Word document down,
you can get very, very, very high levels of compression. It exists. It's not that interesting.
Everyone does it all the time. But with this, we're talking about, so in that, very, very high levels of compression. It exists. It's not that interesting. Everyone does
it all the time. But with this, we're talking about, so in that basically four or so terabytes
of compressed data that we have, it's something around four to four and a half billion keys and
values. And so in that, we're talking about each key value only really having anywhere between 50
and 100 bytes. So we're
not compressing very large pieces of information. We're compressing very small 50 to 100 byte JSON
values that we have UUID keys and JSON strings stored as values. So we're compressing these 50
to 100 byte JSON strings with around 70 to 80% compression. I mean, that's using Z standard with a custom dictionary, which is probably gave us the
biggest cost savings of all.
If you can reset your data set size by 60, 70%, that's huge.
Did you start off doing this on top of Redis or was this an evolution that eventually got
you there?
It was an evolution over time.
We were formerly White Pages.
I mean, White Pages started back in the late 90s.
It really just started off as a...
You were a very early adopter of Red.
Yeah, at that point, we had a type machine and started using it before it existed.
Always a fun story.
Recruiters seem to want that all the time.
Yeah, so when we first started, I mean, we didn't have that much data.
It was basically just one provider that gave us some amount of data.
So it was kind of just a, we just need to start something quick, get something going.
And so, I mean, we just did what most people do,
just do the simplest thing,
just stuff it all in a Postgres database
and call it good.
Yeah, it was slow,
but hey, it was back a long time ago.
People were kind of okay with a little bit.
The world moved a bit slower back then.
Yeah, everything was a bit slower.
No one really minded too much.
The scale wasn't that large,
but business requirements always change over time and they evolve. And so to meet those ever-evolving
business requirements, we moved from Postgres and where a lot of the fun commit messages that I
mentioned earlier can be found is when we started working with Cassandra and Titan. That was before
my time, before I had started. But from what I understand, that was a very fun time. But then from there, that's when we really kind of just took a step
back and just said, there's so much stuff that we just don't need here. Let's really think about
this and let's try to optimize a bit more. We know our use case, why not optimize for our use case?
And that's how we ended up with the flattened graph storage, stuffing it into Redis because
everyone thought of Redis as a cache, but everyone also knows that, why is it a cache?
Because it's fast.
We need something that's very fast.
I still conceptualize it as an in-memory data store,
just because when I turned on disk persistence model
back in 2011, give or take,
it suddenly started slamming the entire data store to a halt
for about three seconds every time it did it.
It was, what's this piece of crap here?
And it was, oh, yeah, it turns out there was a regression
on Zen, which is what AWS was using as a hypervisor
back then. And, oh, yeah,
so fork became an expensive call. It took
forever to wind up running. So, oh, the obvious
lesson we take from this is, oh, yeah, Redis
is not designed to be used with disk persistence.
Wrong lesson to take from the behavior,
but it cemented, in my mind at least,
the idea that this is something that we
tend to use only as a in-memory store.
It's clear that the technology has evolved.
And in fact, I'm super glad that Redis threw you my direction to talk to about this stuff.
Because until talking to you, I was still, I've got to admit, sort of in the position of thinking of it still as an in-memory data store.
Because the fact that Redis says otherwise because they're envisioning it being something else.
Well, okay, marketer is going to market.
You're a customer.
It's a lot harder for me to talk smack about your approach to this thing when I see you doing it for, let's be serious here, what is a very important use case.
If identity verification starts failing open and everyone claims to be who they say they are, that's something that's visible from orbit when it comes to the macroeconomic effect. Yeah, exactly. It's actually funny because before
we moved to primarily just using Redis, before going to fully Redis, we did still use Redis,
but we used ElastiCache. We had it loaded into ElastiCache, but we also had it loaded into DynamoDB
as sort of a, I don't want this to fail because we weren't comfortable with actually using Redis as a primary database.
So we used to use ElastiCache
with a fallback to DynamoDB just in that off chance,
which sometimes it happened, sometimes it didn't.
But that's when we basically just went searching
for new technologies.
And that's actually how we landed on Redis on Flash,
which is it kind of breaks the whole idea
of Redis as an in-memory database
to where it's Redis, but it Redis as an in-memory database to where
it's Redis, but it's not just an in-memory database. You also have Flashback storage.
So you'll forgive me if I combine my day job with this side project of mine where I fix the
horrifying AWS bills for large companies. My bias as a result is to look at infrastructure
environments primarily through the lens of AWS
Bill. And, oh, great, go ahead and use an enterprise offering that someone else runs,
because sure, it might cost more money, but it's not showing up on the AWS Bill. Therefore,
my job is done. Yeah, it turns out that doesn't actually work, or the answer to every AWS billing
problem is to migrate to Azure or to GCP. It turns out that that doesn't actually solve the problem
that you would expect. But you're obviously an enterprise customer of Redis. Does that data live
in your AWS account? Is it something you're using as their managed service and throwing over the
walls so it shows up as data transfer on your side? How is that implemented? I know they've
got a few different models. There's a couple of aspects onto how we're actually built. I mean,
so like when you have ElastiCache, you're just built for your, I don't know, whatever nodes
you're using, cache.R5, whatever they are. I wish most people were using things that modern,
but please continue. Yeah. So you're basically just built for whatever ElastiCache nodes you
have. You have your hourly rate. I don't know, maybe you might reserve them. But with Redis
Enterprise, the way that we're built, there's two aspects. One is, well,
the contract that we signed that basically allows us to use their technology. We go with a managed service, a managed solution. So there's some amount that we pay them directly within some contract,
as well as the actual nodes themselves that exist in the cluster. And so basically the way that this
is set up is we effectively have a sub-acc account within our AWS account that Redis Labs has, or not Redis Labs,
Redis Enterprise has access to, which they deploy directly into and effectively using VPC peering.
That's how we allow our applications to talk directly to it. So we're built directly or so
the actual nodes of the cluster, which are I3 8x, I believe. They basically just
run EC2 instances. All of those instances, those exist on our bill. We get billed for them. We pay
for them. It's just basically some sub-account that they have access to that they can deploy into.
So we get billed for the instances of the cluster as well as whatever we pay for our
enterprise contract. So there's sort of two aspects to the actual billing of it. They are a cloud provider that provides surprisingly high performance cloud compute at a price that, well, sure, they claim it is better than AWS's pricing.
And when they say that, they mean that it's less money.
Sure, I don't dispute that.
But what I find interesting is that it's predictable.
They tell you in advance on a monthly basis what it's going to cost.
They have a bunch of advanced networking features.
They have 19 global
locations and scale things elastically, not to be confused with openly, which is apparently
elastic and open. They can mean the same thing sometimes. They have had over a million users.
Deployments take less than 60 seconds across 12 pre-selected operating systems. Or if you're one
of those nutters like me,
you can bring your own ISO
and install basically any operating system you want.
Starting with pricing as low as $2.50 a month
for Vulture Cloud Compute,
they have plans for developers and businesses of all sizes,
except maybe Amazon,
who stubbornly insists on having something of the scale
all on their own.
But you don't have to take
my word for it with an exclusive offer for you. Sign up today for free and receive $100 in credits
to kick the tires and see for yourself. Get started at vulture.com slash morning brief.
That's v-u-l-t-r dot com slash morning brief. So it's easy to sit here as an engineer, and believe me, having been one for
most of my career, I fall subject to this bias all the time, where it's, oh, you're going to
charge me a management fee to run this thing? Oh, that's ridiculous. I can do it myself instead,
because at least when I was learning in my dorm room, it was always a, well, my time is free,
but money is hard to come by. And shaking off that perspective as my career continued, it was always a, well, my time is free, but money is hard to come by. And shaking
off that perspective as my career continued to evolve was always a bit of a challenge for me.
Do you ever find yourself or your team drifting in the idea toward the direction of, well,
what are we paying for Redis Enterprise for? We could just run it ourselves with the open
source version and save whatever it is that they're charging on top of that.
Before we landed on Redis on Flash, we had that same thought, like,
why don't we just run our own Redis? And the decision to that is, well, managing such a large
cluster that's so important to the function of our business, like you effectively would have needed
to hire someone full-time to just sit there and stare at the cluster the whole time just to operate
it, maintain it, make sure things are running smoothly. And it's something that we just, we made a decision that, no, we're just going to,
we're going to go with a managed solution. It's not easy to manage and maintain clusters of that
size, especially when they're so important to business continuity. From our eyes, it was,
it was just not worth the investment for us to try and manage it ourselves and go with the fully
managed solution. Even when we talk about it, it's one of those, well, everyone talks about the wrong side
of it first.
Oh, it's easier if things are down, if we wind up being able to say, oh, we have a ticket
open rather than I'm on the support form and waiting for people to get back to me.
There's a defensibility perspective.
We all just sort of sidestep past the real truth of it of, yeah, the people who are best
in the world at running and building these things
are right now working on the problem when there is one.
Yeah, they're also probably the best in the world at trying to solve what's going on.
Yeah, because that is what we're paying them to do. Oh, right.
People don't always volunteer for for-profit entities. I keep forgetting
that part of it. Yeah, I mean, we've had some very, very
fun production outages that just randomly happened because
to our knowledge, we would just look at it like, I have no idea what's going on.
And working with their support team, their DevOps team, honestly, it was a good one-week troubleshooting when we were evaluating the technology.
We accidentally halted the database for seemingly no reason.
And we couldn't possibly figure out what was going on.
We kept talking to their DevOps team. They're saying, oh, we see all these rights going on for some reason. And we couldn't possibly figure out what's going on. We kept talking to, we were talking to their DevOps team. They're saying, oh, we see all these rights going on
for some reason. We're like, we're not sending any rights. Why is there rights? And that was
a whole back and forth for almost a week trying to figure out what the heck was going on.
And it happened to be like a very subtle case in terms of like the, how the keys and values
are actually stored between RAM and flash and how it might swap in and out of flash. And all the way down to that level where I want to say we probably talked to
their DevOps team at least two to three times like, can you just explain this to me? Sure,
why does this happen? I didn't know this was a thing, so on and so forth. There's definitely
some things that are fairly difficult to try and debug, which definitely helps having that
enterprise level solution.
Well, that's the most valuable thing in any sort of operational experience, where,
okay, I can read the documentation and all the other things, and it tells me how it works.
Great. The real value of whether I trust something in production is whether or not I know how it
breaks. Because the one thing you want to hear when you're calling someone up is, oh yeah,
we've seen this before. This is what you do to fix it. The worst thing in the world is, oh, that's interesting. We've never seen that before
because then, oh dear Lord, we're off in the mists of trying to figure out what's going on here
while production's done. Yeah. Kind of like, what does this database do in terms of what do we do?
This is what we store our identity graph in. This has the graph of people's information. If we're
trying to do identity verification for transactions or
anything for any of our products, I mean, we need to be able to query this database. It needs to be
up. We have a certain requirement in terms of uptime where we want at least like four nines
of uptime. So we also want a solution that, hey, even if it wants to break, don't break that bad. There's a difference between, oh, a node failed, and okay,
like we're good in 10-20 seconds, versus, oh, a node failed, you lost data, you need to start
reloading your data set, or you can't query this anymore. There's a very large difference between
those two. A little bit, yeah. That's also a great story to drive things across. Like really,
what is this going to cost us if we pay for the enterprise version? Great. Is it going to be
more than some extortionately large number? Because if we're down for three hours in the
course of a year, that's what we owe our customers back for not being able to deliver. So it seems to
me this is kind of a no-brainer for things like that. Yeah, exactly. And that's part of the reason,
I mean, a lot of the things we do at Akata, we usually go with enterprise level for a lot of things we do. And that's, it's really for
that support factor and helping reduce any potential downtime for what we have, because,
well, if we don't consider ourselves comfortable or expert level in that subject, I mean, then
yeah, if it goes down, that's, that's terrible for our customers. I mean, it's needed for literally
every single query that
comes through us. I did want to ask you, but you keep talking about the database and the cluster.
That seems like if you have a single database or a single cluster that winds up being responsible
for all of this, that feels like the blast radius of that thing going down must be enormous. Have
you done any research into breaking that out into smaller databases? What is it that's driven you
toward this architectural pattern? Yeah, so for right now, so we have actually three regions we're
deployed into. We have a copy of it in US West and AWS. We have one in EU Central 1, and we also
have one in AP Southeast 1. So we have a complete copy of this database in three separate regions,
as well as we're spread across all the available availability zones for that region.
So we try and be as multi-AZ as we can within a specific region.
So we have thought about breaking it down, but having high availability,
having multiple replication factors, having also it stored in multiple data centers
provides us at least a good level of comfortability.
Specifically in our US cluster,
we actually have two. We literally also, with a lot of the cost savings that we got, we actually
have two. We have one that literally sits idle 24-7 that we just call our backup and our standby,
where it's ready to go at a moment's notice. Thankfully, we haven't had to use it since,
I want to say, its creation about a year and a half ago. But it sits
there in that doomsday scenario. Oh my gosh, this cluster literally cannot function anymore.
Something crazy, catastrophic happened. And we can basically hot swap back into another production
ready cluster as needed, if needed. Because the really important thing is that if we broke it up
into separate databases, if one of them goes down,
that could still fail your entire query. Because what if that's the database that held
your address? We could still query you, but we're going to try and get your address. And well,
there your traversal just died because you can no longer get that. So even trying to break it up
doesn't really help us too much. We can still fail the entire traversal query.
Yeah, which makes an awful lot of sense. But again, to be clear, you've obviously put thought
into this. This goes way beyond me hearing something in passing and saying, hey, you
considered this thing. Let's be very clear here. That is the sign of a terrible junior consultant.
Well, it sounds like what you built sucked if you considered building something that didn't suck.
Oh, thanks, professor. Really appreciate your pointing that out.
It's one of those useful things.
It's like, oh, wow, we've been doing this for,
I don't know, many, many years.
It's like, oh, wow, yeah.
Haven't thought about that one yet.
So it sounds like you're relatively happy
with how Redis has worked out for you
as the primary data store.
If you were doing it all again from scratch,
would you make the same technology selection there?
Or would you go in a different direction?
Yeah, I think I'd make the same decision. I mean, we've been using Greta's Sunflash for,
at this point, three, maybe coming up to four years at this point. There's a reason we keep
renewing our contract and just keep continuing with them is because, to us, it just fits our
use case so well. We very much choose to continue going in with this direction and this technology.
What would you have them change as far as feature enhancements and new options being enabled there? Because remember, asking
them right now in front of an audience like this puts them in a situation where they cannot possibly
refuse. Please, how do you improve Redis from where it is now? I like how you think. That's
a fair way to describe it. There's a couple of things for optimizations that can always be done.
I'm like, there's specifically with like Redis on Flash, there's some issue we had with
storing as binary keys that, to my knowledge, hasn't necessarily been complete yet, that
basically prevents us from storing as binary, which has some amount of benefit because well,
binary keys require less memory to store. When you're talking about 4 billion keys,
even if you're just saving 20 bytes a key, like you're talking about potentially hundreds of
gigabytes of savings once it adds up with the gigabytes of savings. It adds up pretty quick. So that's probably one of the big things that
we've been in contact with them with about fixing that hasn't gotten there yet. The other thing is
there's a couple of random gotchas that we had to learn along the way. It does add a little bit of
complexity in our loading process. Effectively, when you first write a value into the database, it'll write it to RAM. But then once it gets flushed to Flash, the database effectively asks itself, does this value already exist in Flash? Because once it's first written, it's just written to RAM then does a write to write it into Flash and then evict it out of RAM. That sounds pretty innocent, but if it already exists in Flash, when you read it, it says,
hey, I need to evict this. Does it already exist in Flash? Yep. Okay, just chuck it away. It
already exists. We're good. It sounds pretty nice, but this is where we accidentally halted our
database. Once we started putting a huge amount of load on the cluster, our general throughput on peak day
is somewhere in the order of 160,000 to 200,000
Redis operations per second.
So you're starting to think of,
hey, you might be evicting 100,000 values per second into Flash.
You're talking about added 100,000 write operations
per second into your cluster,
and that accidentally halted our database.
So the way we actually go around this is once we write our data store, we actually basically read
the whole thing once. Because if you read every single key, you pretty much guarantee to cycle
everything into Flash, so it doesn't have to do any of those writes. For right now, there is no
option to basically say that if I write... For our use case, we do very little writes except for
upfront. So it'd be super nice for our use case if we could say, hey, our write operations, no,
I want you to actually do a full write through to Flash. Because that would effectively cut our
entire database prep in half. We no longer have to do that read to cycle everything through.
Those are probably the two big things and one of the biggest gotchas that we ran into
that maybe isn't so known. I really want to thank you for taking the time to speak with me
today. If people want to learn more, where can they find you? And I will also theorize wildly
that if you're like basically every other company out there right now, you're probably hiring on
your team too. Yeah, I very much am hiring. I'm actually hiring quite a lot right now.
So they can reach me.
My email is simply jason.frasier at akata.com.
I unfortunately don't have a Twitter handle or you can find me on LinkedIn.
I'm pretty sure most people have LinkedIn nowadays.
But yeah, and also feel free to reach out
if you're also interested in learning more opportunities.
Like I said, I'm hiring quite extensively.
I'm specifically the team that builds our actual product APIs that we offer to customers. So a lot of the sort of latency
optimizations that we do usually are kind of through my team in coordination with all the
other teams, since we need to build a new API with this requirement. How do we get that requirement?
Like, let's go start exploring. Excellent. I will, of course, throw a link to that in the
show notes as well. I want to thank you, throw a link to that in the show notes as well.
I want to thank you
for spending the time
to speak with me today.
I really do appreciate it.
Yeah, I appreciate you having me on.
It's been a good chat.
Likewise.
I'm sure we will cross paths
in the future,
especially as we stumble
through the wide world
of, you know,
data stores and AWS
and this ecosystem
keeps getting bigger,
but somehow feels smaller
all the time.
Yeah, exactly.
And you know, we'll still be where we are,
hopefully proving all of your transactions
as they go through,
make sure that you don't run into any friction.
Thank you once again for speaking to me.
I really do appreciate it.
No problem.
Thanks again for having me.
Jason Frazier, software engineering manager at Akata.
This has been a promoted episode
brought to us by our friends at Redis.
I'm cloud economist Corey Quinn, and this
is Screaming in the Cloud. If you've
enjoyed this podcast, please leave a five-star
review on your podcast platform of choice.
Whereas if you hated this podcast, please
leave a five-star review on your podcast
platform of choice, along with an angry,
insulting comment telling me that Enterprise
Redis is ridiculous because you can
build it yourself on a Raspberry Pi in only eight short months. If your AWS bill keeps rising and your blood
pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill
by making it smaller and less horrifying. The duck bill group works for you, not AWS.
We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started. this has been a humble pod production
stay humble