Screaming in the Cloud - Keeping Life on the Internet Friction Free with Jason Frazier

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Today's episode is brought to you in part by our friends at Minio, the high-performance Kubernetes native object store that's built for the multi-cloud,

Starting point is 00:00:40 creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere. And that's exactly what Minio offers. With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got on the system, it's exactly what you've been looking for. Check it out today at min.io slash download and see for yourself. That's min.io slash download, and be sure to tell them that I sent you.

Starting point is 00:01:30 This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They've also gone deep in depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense. Welcome to Screaming in the Cloud. I'm Corey Quinn. This one is a bit fun because it's a promoted episode sponsored by our friends at Redis, but my guest does not work at Redis,

Starting point is 00:02:18 nor has he ever. Jason Frazier is a software engineering manager at Akata, a MasterCard company, which I feel like that should have some sort of music backstopping into it, just because large companies always have that magic sheen on it. Jason, thank you for taking the time to speak with me today. Yeah, thanks for inviting me. Happy to be here. So other than the obvious assumption based upon the fact that Redis is kind enough to be sponsoring this episode, I'm going to assume that you're a Redis customer at this point. But I'm sure we'll get there.

Starting point is 00:02:48 Before we do, what is Akata? What do you folks do? So the whole idea behind Akata is, I mean, if you go to our website, our mission statement is, we want to be the global leader in online identity verification. What that really means is in a more increasingly digital world, when anyone can put anything they want into any text field they want, especially when purchasing anything online. You really think people do that? Just go on the internet and tell lies? I know. It's shocking to think that someone could lie about who they are online. But that's sort of what we're trying to solve specifically in the payment space. Like, I don't know, I want to buy a new pair of shoes online, and I enter in some information. Am I

Starting point is 00:03:29 really the person that I say I am when I'm trying to buy those shoes to prevent fraudulent transactions? That's really one of the basis that our company goes on is trying to reduce fraud globally. That's fascinating, just from the perspective of you take a look at cloud vendors, the space that I tend to hang out with, and a lot of their identity verification of is this person who they claim to be, in fact, is put back onto the payment providers. Take Oracle Cloud, which I periodically beat up, but also really enjoy aspects of their platform on, where to get to their always free tier, you have to provide a credit card. Now, they'll never charge you anything until you affirmatively upgrade the account. But so what do you do my card for? Ah, identity and fraud verification. So it feels like the way that everyone else handles this is, ah, we'll make it the payment networks problem. Well, you're now owned by MasterCard. So I sort of

Starting point is 00:04:18 assume you are what the payment networks in turn use to solve that problem. Yeah. So basically one of our flagship products and things that we return is sort of like a score from zero to 400 on how confident are we are that this person is who they are. And it's really about helping merchants help determine whether they should either approve or deny or forward on a transaction to like a manual review agent, as well as there's also another use case that's even more popular, which is just like account creation. As you can imagine, there's lots of bots

Starting point is 00:04:50 on everyone's favorite app or website and things like that, or customers offer a promotion, like sign up and get $10. Oh, I could probably get $10,000 if I make a thousand random accounts and then I'll sign up with them. But like make sure that those accounts are legitimate accounts. They'll prevent like that sort of promo abuse and

Starting point is 00:05:09 things like that. So it's also not just transactions. It's also like account openings and stuff. Make sure that you actually have real people on your platform. The thing that always annoyed me was the way that companies decide, oh, we're going to go ahead and solve that problem with a captcha on it. It's no, no, I don't want to solve machine learning puzzles for Google for free in order to sign up for something. I am the customer here. You're getting it wrong somewhere. So I assume, given the fact that I buy an awful lot of stuff online,

Starting point is 00:05:34 but I don't recall ever seeing anything branded with Akata, that you do this behind the scenes. It is not something that requires human interaction, by which I mean friction. Yeah, for sure. Yeah, yeah, it's behind the scenes. That's exactly what I was about to segue to is friction, is try to provide a frictionless experience for users. In the US, it's not as common, but when you go into Europe or anything like that, it's fairly common to get confirmations on transactions and things like that.

Starting point is 00:05:58 You may have to, I don't know, text or get a code text or enter that online to basically say like, yes, I actually received this. But like helping in the reason companies do that is for that like extra bit of security and assurance that that's actually legitimate. But obviously companies would like to prefer not to have to do that because I don't know if I'm trying to buy something, this website makes me do something extra. The site doesn't make me do anything extra. I'm probably going to go with that one because it's just more convenient for me because there's less friction there. You're obviously limited in how much you can say about this, just because it's, here's a list of all the things we care about means that, great, you've given me a roadmap

Starting point is 00:06:34 to have things to wind up looking at. But do you have an example or two of the sort of data that you wind up analyzing to figure out the likelihood that I'm a human versus a robot? Yeah, for sure. I mean, it's fairly common across most payment forms. So things like you enter in your first name, your last name, your address, your phone number, your email address. Those are all identity elements that we look at. We have two data stores. We have our identity graph and our identity network. The identity graph is what you would probably think of it if you think of a web of a person and their identity, like you have a name that's linked to a telephone that's in that name is also linked to an address, but that address used to have previous people living there, so on and so forth. So the various,

Starting point is 00:07:15 what we just call identity elements are the various things we look at. And it's fairly common on any payment form. I'm sure like if you buy something on Amazon versus eBay or whatever, you're probably going to be asked, what's your name? What's your address? What's your email address? What's your telephone? It's one of the most obnoxious parts

Starting point is 00:07:31 of buying things online from websites you haven't been to before. It's one of the genius ideas behind Apple Pay and the other centralized payment systems. Oh yeah, they already know who you are. Just click the button and it's done. Yeah, even something as small as that.

Starting point is 00:07:42 I mean, it gets a little bit easier with like form autocompletes and stuff. Like, oh, just type J and it's autocomplete everything for me. That's not the worst of the world, but it is still some amount of annoyance and friction. So as I look through all this, it seems like one of the key things you're trying to do, since it's in line with someone waiting while something is spinning in their browser, this needs to be quick. It also strikes me that this is likely not something that you're going to hit the same people trying to identify all the time.

Starting point is 00:08:08 If so, that is a sign of fraud. So it doesn't really seem like something can be heavily cached. Yet you're using Redis, which tells me that your conception of how you're using it might be different than the mental space that I put Redis into when I'm thinking about where in this ridiculous architecture diagram is the Redis part going to go? Yeah, I mean, like, when everyone says Redis, thinks of Redis, I mean, even before we went down this path, you always think of, oh, I need a cache, I'll just stuff in Redis, just use Redis as a cache here and there, some small, I don't know, a few tens, hundreds, gigabytes, maybe, cache spin that up, and you're good. But we actually use Redis as

Starting point is 00:08:45 our primary data store for our identity graph, specifically for the speed that we can get. Because if you're trying to look for a person, like let's say you're buying something for your brother, how do we know if that's true or not? Because you have this name, you're trying to send it to a different address. How does that make sense? But how do we get from Corey to an address? Like, oh, maybe you used to live with your brother. It's funny you picked that as your exam. My brother just moved to Dublin. So it's the whole problem.

Starting point is 00:09:12 How do I get this from me to someone, different country, different names, et cetera? And yeah, how do you wind up mapping that to figure out the likelihood that it is either credit card fraud or somebody actually trying to be, you know, a decent brother for once in my life. Yeah. So I mean, how it works is how you'd imagine. You start at some entry

Starting point is 00:09:30 point, which would probably be your name. Start there and say, can we match this to this person's address that you believe you're sending to? And we can say, oh, you have a person-person relationship like he's your brother. So it maps to him, which we can then get his address and say, oh, here's that address. That matches what you're trying to send it to. Hey, this makes sense because you have a legitimate reason to be sending something there. You're not just sending it to some random address out in the middle of nowhere for no reason. Or the dropshipping scams or brushing scams. That's the thing is every time you think you've seen it all, all you have to do is look at fraud. That's where the real innovation seems to be happening, no matter how you slice it.

Starting point is 00:10:04 Yeah, it's quite an interesting space. I always like to say it's one of those things where if you had the human element in it, it's not super easy, but it's generally easy to tell, okay, that makes sense, or oh, no, that's just complete garbage. But trying to do it at scale very fast in a general case becomes an actual substantially harder problem. It's one of those things that people can probably do fairly well. I mean, that's why we still have manual reviews and things like that. But trying to do it automatically on or just with computers is much more difficult. I scammed a company out of 20 bucks is not the problem you're trying to avoid for. It's the okay, I just did that 10 million times. And now we have a different problem.

Starting point is 00:10:43 Yeah, exactly. I mean, one of the biggest losses for a lot of companies is fraudulent transactions and chargebacks. Usually in the case on e-commerce companies, or even especially nowadays, where as you can imagine, more people are moving to a more online world and doing shopping online and things like that. So as more people move to online shopping, some companies are always going to get some amount of chargebacks on their non-fragile transactions. But when it happens at scale, that's when you start seeing many losses. Because not only are you issuing a chargeback, you probably sent out some products that you're now out some physical product as well. So it's almost kind of like a

Starting point is 00:11:21 double whammy. So as I look through all this, I tend to do always view Redis in terms of a more or less a key value store. Is that still accurate? Is that how you wind up working with it? Or has it evolved significantly past then to the point where you can now do relational queries against it? Yeah, so we do use Redis as a key value store because Redis is just a traditional key value store, very fast lookups. When we first started building out our identity graph, as you can imagine, you're trying to model people to telephone addresses, your first thought is, hey, this sounds a whole lot like a graph. That's sort of what we did quite a few years ago is let's just put it in some graph database. But as time went on, and as it became much more important to have lower and lower latency, we really started thinking about like, we don't really need all

Starting point is 00:12:03 the nice and shiny things that like a graph database or some sort of graph technology really offers you. All we really need to do is I need to get from point A to point B, and that's it. I ended a graph database. What's the first thing I need to do? We'll spend six weeks in school trying to figure out exactly what the hell a graph database is because they're challenging to wrap your head around at the best of times. Then it just always seemed overpowered for a lot of, I don't want to say simple use cases, what you're doing is not simple, but it doesn't seem to be leveraging the higher order advantages that a graph database tends to offer.

Starting point is 00:12:33 It added a lot of complexity in the system. And me and one of our senior principal engineers who's been here for a long time, we always have a joke. If you search our GitHub repository for, we'll say, kindly worded commit messages, you can see a very large correlation of those types of commit messages to all the commits to try and use a graph database from multiple years ago. It was not fun to work with, just added too much complexity, and we just didn't need all that shiny stuff. So that's how we really just took a step back. Like we really need to do it this way. We ended up effectively flattening the

Starting point is 00:13:10 entire graph into an adjacency list. So a key is basically some UUID to an entity. So Corey, you'd have some UUID associated with you and the value would be whatever your information would be, as well as other UUIDs to links to the other entities. So from that first retrieval, I can now unpack it and, oh, now I have a whole bunch of other UUIDs. I can then query on to get that information, which then have more IDs associated with it. It's more or less sort of how we do our graph traversal and query this in our graph queries. One of the fun things about doing this sort of interview dance on the podcast, as long as I have, is you start to pick up what people are saying by virtue of what they don't say. Earlier, you wound up mentioning that you often use Redis for things like tens or hundreds of gigabytes,

Starting point is 00:14:02 which sort of leaves in my mind the strong implication that you're talking about something significantly larger than that. Can you disclose the scale of data we're talking about here? Yeah, so we use Redis as our primary data store for our identity graph and also for, or soon to be, for our identity network, which is our other database. But specifically for our identity graph scale we're talking about, we do have some compression added on there. But if you say uncompressed, it's about 12 terabytes of data that's compressed with replication into about four. That's a relatively decent compression factor, given that I imagine we're not talking about huge data sets. Yeah, so this is actually basically driven directly by cost. If you need to store less

Starting point is 00:14:43 data, then you need less memory. Therefore, you need to pay for less. Sorry, just once again to short up my longtime argument that when it comes to cloud, cost and architecture are in fact the same thing. Please continue by all means. I would be lying if I said that we didn't do weekly slash monthly reviews of costs. Where are we spending costs in AWS? How can we improve costs? How can we cut down on costs?

Starting point is 00:15:03 How can we store less? You are singing my song. It is a constant discussion. But yeah, so we use ZStandard compression, which was developed at Facebook. It's a dictionary-based compression. And the reason we went for this is, I mean, if I say I want to compress a Word document down, you can get very, very, very high levels of compression. It exists. It's not that interesting. Everyone does it all the time. But with this, we're talking about, so in that, very, very high levels of compression. It exists. It's not that interesting. Everyone does it all the time. But with this, we're talking about, so in that basically four or so terabytes

Starting point is 00:15:31 of compressed data that we have, it's something around four to four and a half billion keys and values. And so in that, we're talking about each key value only really having anywhere between 50 and 100 bytes. So we're not compressing very large pieces of information. We're compressing very small 50 to 100 byte JSON values that we have UUID keys and JSON strings stored as values. So we're compressing these 50 to 100 byte JSON strings with around 70 to 80% compression. I mean, that's using Z standard with a custom dictionary, which is probably gave us the biggest cost savings of all. If you can reset your data set size by 60, 70%, that's huge.

Starting point is 00:16:15 Did you start off doing this on top of Redis or was this an evolution that eventually got you there? It was an evolution over time. We were formerly White Pages. I mean, White Pages started back in the late 90s. It really just started off as a... You were a very early adopter of Red. Yeah, at that point, we had a type machine and started using it before it existed.

Starting point is 00:16:33 Always a fun story. Recruiters seem to want that all the time. Yeah, so when we first started, I mean, we didn't have that much data. It was basically just one provider that gave us some amount of data. So it was kind of just a, we just need to start something quick, get something going. And so, I mean, we just did what most people do, just do the simplest thing, just stuff it all in a Postgres database

Starting point is 00:16:50 and call it good. Yeah, it was slow, but hey, it was back a long time ago. People were kind of okay with a little bit. The world moved a bit slower back then. Yeah, everything was a bit slower. No one really minded too much. The scale wasn't that large,

Starting point is 00:17:04 but business requirements always change over time and they evolve. And so to meet those ever-evolving business requirements, we moved from Postgres and where a lot of the fun commit messages that I mentioned earlier can be found is when we started working with Cassandra and Titan. That was before my time, before I had started. But from what I understand, that was a very fun time. But then from there, that's when we really kind of just took a step back and just said, there's so much stuff that we just don't need here. Let's really think about this and let's try to optimize a bit more. We know our use case, why not optimize for our use case? And that's how we ended up with the flattened graph storage, stuffing it into Redis because everyone thought of Redis as a cache, but everyone also knows that, why is it a cache?

Starting point is 00:17:47 Because it's fast. We need something that's very fast. I still conceptualize it as an in-memory data store, just because when I turned on disk persistence model back in 2011, give or take, it suddenly started slamming the entire data store to a halt for about three seconds every time it did it. It was, what's this piece of crap here?

Starting point is 00:18:03 And it was, oh, yeah, it turns out there was a regression on Zen, which is what AWS was using as a hypervisor back then. And, oh, yeah, so fork became an expensive call. It took forever to wind up running. So, oh, the obvious lesson we take from this is, oh, yeah, Redis is not designed to be used with disk persistence. Wrong lesson to take from the behavior,

Starting point is 00:18:20 but it cemented, in my mind at least, the idea that this is something that we tend to use only as a in-memory store. It's clear that the technology has evolved. And in fact, I'm super glad that Redis threw you my direction to talk to about this stuff. Because until talking to you, I was still, I've got to admit, sort of in the position of thinking of it still as an in-memory data store. Because the fact that Redis says otherwise because they're envisioning it being something else. Well, okay, marketer is going to market.

Starting point is 00:18:48 You're a customer. It's a lot harder for me to talk smack about your approach to this thing when I see you doing it for, let's be serious here, what is a very important use case. If identity verification starts failing open and everyone claims to be who they say they are, that's something that's visible from orbit when it comes to the macroeconomic effect. Yeah, exactly. It's actually funny because before we moved to primarily just using Redis, before going to fully Redis, we did still use Redis, but we used ElastiCache. We had it loaded into ElastiCache, but we also had it loaded into DynamoDB as sort of a, I don't want this to fail because we weren't comfortable with actually using Redis as a primary database. So we used to use ElastiCache with a fallback to DynamoDB just in that off chance,

Starting point is 00:19:32 which sometimes it happened, sometimes it didn't. But that's when we basically just went searching for new technologies. And that's actually how we landed on Redis on Flash, which is it kind of breaks the whole idea of Redis as an in-memory database to where it's Redis, but it Redis as an in-memory database to where it's Redis, but it's not just an in-memory database. You also have Flashback storage.

Starting point is 00:19:51 So you'll forgive me if I combine my day job with this side project of mine where I fix the horrifying AWS bills for large companies. My bias as a result is to look at infrastructure environments primarily through the lens of AWS Bill. And, oh, great, go ahead and use an enterprise offering that someone else runs, because sure, it might cost more money, but it's not showing up on the AWS Bill. Therefore, my job is done. Yeah, it turns out that doesn't actually work, or the answer to every AWS billing problem is to migrate to Azure or to GCP. It turns out that that doesn't actually solve the problem that you would expect. But you're obviously an enterprise customer of Redis. Does that data live

Starting point is 00:20:31 in your AWS account? Is it something you're using as their managed service and throwing over the walls so it shows up as data transfer on your side? How is that implemented? I know they've got a few different models. There's a couple of aspects onto how we're actually built. I mean, so like when you have ElastiCache, you're just built for your, I don't know, whatever nodes you're using, cache.R5, whatever they are. I wish most people were using things that modern, but please continue. Yeah. So you're basically just built for whatever ElastiCache nodes you have. You have your hourly rate. I don't know, maybe you might reserve them. But with Redis Enterprise, the way that we're built, there's two aspects. One is, well,

Starting point is 00:21:10 the contract that we signed that basically allows us to use their technology. We go with a managed service, a managed solution. So there's some amount that we pay them directly within some contract, as well as the actual nodes themselves that exist in the cluster. And so basically the way that this is set up is we effectively have a sub-acc account within our AWS account that Redis Labs has, or not Redis Labs, Redis Enterprise has access to, which they deploy directly into and effectively using VPC peering. That's how we allow our applications to talk directly to it. So we're built directly or so the actual nodes of the cluster, which are I3 8x, I believe. They basically just run EC2 instances. All of those instances, those exist on our bill. We get billed for them. We pay for them. It's just basically some sub-account that they have access to that they can deploy into.

Starting point is 00:21:56 So we get billed for the instances of the cluster as well as whatever we pay for our enterprise contract. So there's sort of two aspects to the actual billing of it. They are a cloud provider that provides surprisingly high performance cloud compute at a price that, well, sure, they claim it is better than AWS's pricing. And when they say that, they mean that it's less money. Sure, I don't dispute that. But what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to cost. They have a bunch of advanced networking features. They have 19 global

Starting point is 00:22:46 locations and scale things elastically, not to be confused with openly, which is apparently elastic and open. They can mean the same thing sometimes. They have had over a million users. Deployments take less than 60 seconds across 12 pre-selected operating systems. Or if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vulture Cloud Compute,

Starting point is 00:23:15 they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something of the scale all on their own. But you don't have to take my word for it with an exclusive offer for you. Sign up today for free and receive $100 in credits to kick the tires and see for yourself. Get started at vulture.com slash morning brief. That's v-u-l-t-r dot com slash morning brief. So it's easy to sit here as an engineer, and believe me, having been one for

Starting point is 00:23:46 most of my career, I fall subject to this bias all the time, where it's, oh, you're going to charge me a management fee to run this thing? Oh, that's ridiculous. I can do it myself instead, because at least when I was learning in my dorm room, it was always a, well, my time is free, but money is hard to come by. And shaking off that perspective as my career continued, it was always a, well, my time is free, but money is hard to come by. And shaking off that perspective as my career continued to evolve was always a bit of a challenge for me. Do you ever find yourself or your team drifting in the idea toward the direction of, well, what are we paying for Redis Enterprise for? We could just run it ourselves with the open source version and save whatever it is that they're charging on top of that.

Starting point is 00:24:21 Before we landed on Redis on Flash, we had that same thought, like, why don't we just run our own Redis? And the decision to that is, well, managing such a large cluster that's so important to the function of our business, like you effectively would have needed to hire someone full-time to just sit there and stare at the cluster the whole time just to operate it, maintain it, make sure things are running smoothly. And it's something that we just, we made a decision that, no, we're just going to, we're going to go with a managed solution. It's not easy to manage and maintain clusters of that size, especially when they're so important to business continuity. From our eyes, it was, it was just not worth the investment for us to try and manage it ourselves and go with the fully

Starting point is 00:25:03 managed solution. Even when we talk about it, it's one of those, well, everyone talks about the wrong side of it first. Oh, it's easier if things are down, if we wind up being able to say, oh, we have a ticket open rather than I'm on the support form and waiting for people to get back to me. There's a defensibility perspective. We all just sort of sidestep past the real truth of it of, yeah, the people who are best in the world at running and building these things are right now working on the problem when there is one.

Starting point is 00:25:29 Yeah, they're also probably the best in the world at trying to solve what's going on. Yeah, because that is what we're paying them to do. Oh, right. People don't always volunteer for for-profit entities. I keep forgetting that part of it. Yeah, I mean, we've had some very, very fun production outages that just randomly happened because to our knowledge, we would just look at it like, I have no idea what's going on. And working with their support team, their DevOps team, honestly, it was a good one-week troubleshooting when we were evaluating the technology. We accidentally halted the database for seemingly no reason.

Starting point is 00:26:01 And we couldn't possibly figure out what was going on. We kept talking to their DevOps team. They're saying, oh, we see all these rights going on for some reason. And we couldn't possibly figure out what's going on. We kept talking to, we were talking to their DevOps team. They're saying, oh, we see all these rights going on for some reason. We're like, we're not sending any rights. Why is there rights? And that was a whole back and forth for almost a week trying to figure out what the heck was going on. And it happened to be like a very subtle case in terms of like the, how the keys and values are actually stored between RAM and flash and how it might swap in and out of flash. And all the way down to that level where I want to say we probably talked to their DevOps team at least two to three times like, can you just explain this to me? Sure, why does this happen? I didn't know this was a thing, so on and so forth. There's definitely

Starting point is 00:26:38 some things that are fairly difficult to try and debug, which definitely helps having that enterprise level solution. Well, that's the most valuable thing in any sort of operational experience, where, okay, I can read the documentation and all the other things, and it tells me how it works. Great. The real value of whether I trust something in production is whether or not I know how it breaks. Because the one thing you want to hear when you're calling someone up is, oh yeah, we've seen this before. This is what you do to fix it. The worst thing in the world is, oh, that's interesting. We've never seen that before because then, oh dear Lord, we're off in the mists of trying to figure out what's going on here

Starting point is 00:27:12 while production's done. Yeah. Kind of like, what does this database do in terms of what do we do? This is what we store our identity graph in. This has the graph of people's information. If we're trying to do identity verification for transactions or anything for any of our products, I mean, we need to be able to query this database. It needs to be up. We have a certain requirement in terms of uptime where we want at least like four nines of uptime. So we also want a solution that, hey, even if it wants to break, don't break that bad. There's a difference between, oh, a node failed, and okay, like we're good in 10-20 seconds, versus, oh, a node failed, you lost data, you need to start reloading your data set, or you can't query this anymore. There's a very large difference between

Starting point is 00:28:00 those two. A little bit, yeah. That's also a great story to drive things across. Like really, what is this going to cost us if we pay for the enterprise version? Great. Is it going to be more than some extortionately large number? Because if we're down for three hours in the course of a year, that's what we owe our customers back for not being able to deliver. So it seems to me this is kind of a no-brainer for things like that. Yeah, exactly. And that's part of the reason, I mean, a lot of the things we do at Akata, we usually go with enterprise level for a lot of things we do. And that's, it's really for that support factor and helping reduce any potential downtime for what we have, because, well, if we don't consider ourselves comfortable or expert level in that subject, I mean, then

Starting point is 00:28:39 yeah, if it goes down, that's, that's terrible for our customers. I mean, it's needed for literally every single query that comes through us. I did want to ask you, but you keep talking about the database and the cluster. That seems like if you have a single database or a single cluster that winds up being responsible for all of this, that feels like the blast radius of that thing going down must be enormous. Have you done any research into breaking that out into smaller databases? What is it that's driven you toward this architectural pattern? Yeah, so for right now, so we have actually three regions we're deployed into. We have a copy of it in US West and AWS. We have one in EU Central 1, and we also

Starting point is 00:29:14 have one in AP Southeast 1. So we have a complete copy of this database in three separate regions, as well as we're spread across all the available availability zones for that region. So we try and be as multi-AZ as we can within a specific region. So we have thought about breaking it down, but having high availability, having multiple replication factors, having also it stored in multiple data centers provides us at least a good level of comfortability. Specifically in our US cluster, we actually have two. We literally also, with a lot of the cost savings that we got, we actually

Starting point is 00:29:51 have two. We have one that literally sits idle 24-7 that we just call our backup and our standby, where it's ready to go at a moment's notice. Thankfully, we haven't had to use it since, I want to say, its creation about a year and a half ago. But it sits there in that doomsday scenario. Oh my gosh, this cluster literally cannot function anymore. Something crazy, catastrophic happened. And we can basically hot swap back into another production ready cluster as needed, if needed. Because the really important thing is that if we broke it up into separate databases, if one of them goes down, that could still fail your entire query. Because what if that's the database that held

Starting point is 00:30:30 your address? We could still query you, but we're going to try and get your address. And well, there your traversal just died because you can no longer get that. So even trying to break it up doesn't really help us too much. We can still fail the entire traversal query. Yeah, which makes an awful lot of sense. But again, to be clear, you've obviously put thought into this. This goes way beyond me hearing something in passing and saying, hey, you considered this thing. Let's be very clear here. That is the sign of a terrible junior consultant. Well, it sounds like what you built sucked if you considered building something that didn't suck. Oh, thanks, professor. Really appreciate your pointing that out.

Starting point is 00:31:05 It's one of those useful things. It's like, oh, wow, we've been doing this for, I don't know, many, many years. It's like, oh, wow, yeah. Haven't thought about that one yet. So it sounds like you're relatively happy with how Redis has worked out for you as the primary data store.

Starting point is 00:31:17 If you were doing it all again from scratch, would you make the same technology selection there? Or would you go in a different direction? Yeah, I think I'd make the same decision. I mean, we've been using Greta's Sunflash for, at this point, three, maybe coming up to four years at this point. There's a reason we keep renewing our contract and just keep continuing with them is because, to us, it just fits our use case so well. We very much choose to continue going in with this direction and this technology. What would you have them change as far as feature enhancements and new options being enabled there? Because remember, asking

Starting point is 00:31:48 them right now in front of an audience like this puts them in a situation where they cannot possibly refuse. Please, how do you improve Redis from where it is now? I like how you think. That's a fair way to describe it. There's a couple of things for optimizations that can always be done. I'm like, there's specifically with like Redis on Flash, there's some issue we had with storing as binary keys that, to my knowledge, hasn't necessarily been complete yet, that basically prevents us from storing as binary, which has some amount of benefit because well, binary keys require less memory to store. When you're talking about 4 billion keys, even if you're just saving 20 bytes a key, like you're talking about potentially hundreds of

Starting point is 00:32:24 gigabytes of savings once it adds up with the gigabytes of savings. It adds up pretty quick. So that's probably one of the big things that we've been in contact with them with about fixing that hasn't gotten there yet. The other thing is there's a couple of random gotchas that we had to learn along the way. It does add a little bit of complexity in our loading process. Effectively, when you first write a value into the database, it'll write it to RAM. But then once it gets flushed to Flash, the database effectively asks itself, does this value already exist in Flash? Because once it's first written, it's just written to RAM then does a write to write it into Flash and then evict it out of RAM. That sounds pretty innocent, but if it already exists in Flash, when you read it, it says, hey, I need to evict this. Does it already exist in Flash? Yep. Okay, just chuck it away. It already exists. We're good. It sounds pretty nice, but this is where we accidentally halted our database. Once we started putting a huge amount of load on the cluster, our general throughput on peak day is somewhere in the order of 160,000 to 200,000

Starting point is 00:33:28 Redis operations per second. So you're starting to think of, hey, you might be evicting 100,000 values per second into Flash. You're talking about added 100,000 write operations per second into your cluster, and that accidentally halted our database. So the way we actually go around this is once we write our data store, we actually basically read the whole thing once. Because if you read every single key, you pretty much guarantee to cycle

Starting point is 00:33:55 everything into Flash, so it doesn't have to do any of those writes. For right now, there is no option to basically say that if I write... For our use case, we do very little writes except for upfront. So it'd be super nice for our use case if we could say, hey, our write operations, no, I want you to actually do a full write through to Flash. Because that would effectively cut our entire database prep in half. We no longer have to do that read to cycle everything through. Those are probably the two big things and one of the biggest gotchas that we ran into that maybe isn't so known. I really want to thank you for taking the time to speak with me today. If people want to learn more, where can they find you? And I will also theorize wildly

Starting point is 00:34:36 that if you're like basically every other company out there right now, you're probably hiring on your team too. Yeah, I very much am hiring. I'm actually hiring quite a lot right now. So they can reach me. My email is simply jason.frasier at akata.com. I unfortunately don't have a Twitter handle or you can find me on LinkedIn. I'm pretty sure most people have LinkedIn nowadays. But yeah, and also feel free to reach out if you're also interested in learning more opportunities.

Starting point is 00:35:01 Like I said, I'm hiring quite extensively. I'm specifically the team that builds our actual product APIs that we offer to customers. So a lot of the sort of latency optimizations that we do usually are kind of through my team in coordination with all the other teams, since we need to build a new API with this requirement. How do we get that requirement? Like, let's go start exploring. Excellent. I will, of course, throw a link to that in the show notes as well. I want to thank you, throw a link to that in the show notes as well. I want to thank you for spending the time

Starting point is 00:35:27 to speak with me today. I really do appreciate it. Yeah, I appreciate you having me on. It's been a good chat. Likewise. I'm sure we will cross paths in the future, especially as we stumble

Starting point is 00:35:35 through the wide world of, you know, data stores and AWS and this ecosystem keeps getting bigger, but somehow feels smaller all the time. Yeah, exactly.

Starting point is 00:35:45 And you know, we'll still be where we are, hopefully proving all of your transactions as they go through, make sure that you don't run into any friction. Thank you once again for speaking to me. I really do appreciate it. No problem. Thanks again for having me.

Starting point is 00:35:59 Jason Frazier, software engineering manager at Akata. This has been a promoted episode brought to us by our friends at Redis. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please

Starting point is 00:36:15 leave a five-star review on your podcast platform of choice, along with an angry, insulting comment telling me that Enterprise Redis is ridiculous because you can build it yourself on a Raspberry Pi in only eight short months. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The duck bill group works for you, not AWS. We tailor recommendations to your business and we get to the point.

Starting point is 00:36:53 Visit duckbillgroup.com to get started. this has been a humble pod production stay humble

Screaming in the Cloud - Keeping Life on the Internet Friction Free with Jason Frazier

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.