Screaming in the Cloud - Would You Kindly Remind with Peter Hamilton

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Today's episode is brought to you in part by our friends at Minio, the high-performance Kubernetes native object store that's built for the multi-cloud,

Starting point is 00:00:39 creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere. And that's exactly what Minio offers.

Starting point is 00:01:09 With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got on the system, it's exactly what you've been looking for. Check it out today at min.io slash download and see for yourself. That's min.io slash download. And be sure to tell them that I sent you. This episode is sponsored in part by our friends at Vulture, spelled V-U-L-T-R,

Starting point is 00:01:36 because they're all about helping save money, including on things like, you know, vowels. So what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that, well, sure, they claim it is better than AWS's pricing. And when they say that, they mean that it's less money. Sure, I don't dispute that. But what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to cost. They have a bunch of advanced networking features. They have 19 global locations and scale things elastically, not to be confused with openly, which is apparently elastic and open. They can mean the same thing sometimes. They have had over a million users. Deployments take less

Starting point is 00:02:22 than 60 seconds across 12 pre-selected operating systems, or if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vulture Cloud Compute, they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something of the scale on their own. Try Vulture today for free by visiting vulture.com slash screaming, and you'll receive $100 in credit. That's v-u-l-t-r dot com slash screaming. Welcome to Screaming in the Cloud. I'm Corey Quinn, and this is a fun episode. It is a promoted episode, which means that our friends at Redis have gone ahead and sponsored this entire episode.

Starting point is 00:03:14 I asked them, great, who are you going to send me from generally your executive suite? And they said, nah, you already know what we're going to say. We want you to talk to one of our customers. And so here we are. My guest today is Peter Hamilton, VP of Technology at Remind. Peter, thank you for joining me. Thanks, Corey. Excited to be here. It's always interesting when I get to talk to people on Promoted Guest episodes when they're a customer of the sponsor. Because to be clear, you do not work for Redis. This is one of those stories you enjoy telling,

Starting point is 00:03:49 but you don't personally have a stake in whether people love Redis, hate Redis, adopt it or not, which is exactly what I try and do on these shows. There's an authenticity to people who have in the trenches experience who aren't themselves trying to sell the thing because that is their entire job in this world. Yeah. You just presented three or four different opinions, and I guarantee we felt all of them at

Starting point is 00:04:13 different times. So let's start at the very beginning. What does Remind do? So Remind is a messaging tool for education, largely K through 12. We support about 30 million active users across the country, over 2 million teachers, making sure that every student has equal opportunities to succeed and that we can facilitate as much learning as possible. When you say messaging, that could mean a bunch of different things to a bunch of different people. Once on a lark, I wound up sitting down, this was years ago, so I'm sure the number is a woeful underestimate now, of how many AWS services I could use to send a message from me to you. And this is without going into the lunacy territory of, well, I can tag a thing and then mail it to you like a snowball edge or something. No, this is using them as intended. I think I got to 15 or 16 of them. When you say messaging, what does that mean to you? So for us, it's about communication to the end user. We will do

Starting point is 00:05:17 everything we can to deliver whatever message a teacher or a district administrator has to the user. We go through SMS, text messaging. We go through Apple and Google's push services. We go through email. We go through voice call, really pulling at all the stops we can to make sure that these important messages get out. And I can only imagine some of the regulatory pressure you almost certainly experience. It feels like it's not quite to HIPAA levels where, oh, there's a private cause of action if any of this stuff gets out. But people are inherently sensitive about communications involving their children. I always sort of knew this in a general sense.

Starting point is 00:05:52 And then I had kids myself. And, oh, yeah, suddenly I really care about those sorts of things. Yeah. One of the big challenges is you can build great systems that do the correct thing. But at the end of the day, we're relying on a teacher choosing the right recipient when they send a message. And so we've had to build a lot of processes and controls in place so that we can kind of satisfy two conflicting needs. One is to provide a clear audit log, because that's an important thing for districts to know

Starting point is 00:06:20 if something does happen, that we have clear communication. And the other is to also be able to jump in and intervene when something inappropriate or mistaken is sent out to the wrong people. Remind has always been one of those companies that has a somewhat exalted reputation in the AWS space. You folks have been early adopters of a bunch of different services, which, let's be clear, in the responsible way, not the, well, they said it on stage, time to go ahead and put everything they just listed into production, because we, for some godforsaken reason, view it as a to-do list. But you've been thoughtful about how you approach things, and you have been around as a company for a while, but you've also been

Starting point is 00:06:59 making a significant push toward being cloud-native by certain definitions of that term. So I know this sounds like a college entrance essay, but what does cloud native mean to you? So one of the big gaps, if you take an application that was written to be deployed in a traditional data center environment and just drop it in the cloud, what you're going to get is a flaky data center. Well, that's not fair. It's also going to be extremely expensive. Sorry, an expensive flaky data center. There we go. There we go.

Starting point is 00:07:30 What we've really looked at, and a lot of this goes back to our history in the earlier days, we ran on top of Heroku, and it was kind of the early days of what they call the 12-factor application. But making aggressive decisions about how you structure your architecture and application so that you fit in with some of the cloud tools architecture and application so that you fit in with some of the cloud tools that are available and that you fit in, you know, with the operating models that are out there. When you say an aggressive decision, what sort of thing are you talking about? Because when I think of being aggressive with my approach to things like AWS,

Starting point is 00:08:00 it usually involves Twitter. And I'm guessing that is not the direction you intend that to go. No, I think if you look at Twitter or Netflix or some of these players that, quite frankly, have defined what AWS is to us today through their usage patterns, not quite that. Oh, I mean using Twitter to yell at them explicitly about things, because I don't do passive aggressive. I just do aggressive. Got it. No, I think in our case, it's been plotting a very narrow path that allows us to avoid some of the bigger pitfalls. We have our sponsor here, Redis. I can talk a little bit about our usage of Redis and how that's helped us in some of these cases. One of the pitfalls you'll find with pulling a non-cloud native application, putting the cloud is state is hard to manage. If you put state on all your machines and machines go down, networks fail, all those things, you now no longer have access to that state. And we start to see a lot of problems. One of the decisions we've made is try to put as much state as we can into data stores like Redis or Postgres or something in order to decouple our

Starting point is 00:08:59 hardware from the state we're trying to manage and provide for our users so that we're more resilient to those sorts of failures. I get the sense from the way that we're having this conversation, when you talk about Redis, you mean actual Redis itself, not ElastiCache for Redis, or as I'm tending to increasingly think about AWS's services, Amazon Basics for Redis. Yeah, I mean, Amazon has launched a number of products. They have their ElastiCache, they have their new MemoryDB. There's a lot of different ways to use this. We've relied pretty heavily on Redis, previously known as Redis Labs, and their enterprise product in their cloud in order to take care of our most important data, which we just don't want to manage ourselves.

Starting point is 00:09:41 Trying to manage that on our own, using something like Elastic Cache, there's so many pitfalls, so many ways that we can lose that data. This data is important to us. By having it in a trusted place and managed by a great ops team like they have at Redis, we're able to then lean in on the other aspects of cloud native to really get as much value as we can out of AWS. I am curious. As I said, you've had a reputation as a company for a while in the AWS space of doing an awful lot of really interesting things. I mean, you have a robust

Starting point is 00:10:15 GitHub presence. You have a whole bunch of tools that have come out of our mind that are great. I've linked to a number of them over the years in the newsletter. You are clearly not afraid culturally to get your hands dirty and build things yourself, but you are using Redis Enterprise as opposed to open source Redis. What drove that decision? I have to assume it's not, wait, you mean I could get it for free as an open source project? Why didn't someone tell me? What brought you to that decision? Yeah, a big part of this is what we could call operating leverage. Building a great set of tools that allow you to get more value out of AWS is a little different story than babysitting servers all day and making sure they stay up. So if you look through most of our contributions in open source space, have really been around, here's how to expand upon these foundational pieces from AWS. Here's how to more efficiently

Starting point is 00:11:12 launch a suite of servers into an auto-scaling group. Here's our Troposphere and other pieces there. This was all before Amazon's CDK product, but really it was, here's how we can more effectively use CloudFormation to capture our infrastructure as code. And so we are not afraid in any way to invest in our tooling and invest in some of those things. But when we look at the trade-off of directly managing stateful services and dealing with all the uncertainty that comes, we feel our time is better spent working on our product and delivering value to our users and relying on partners like Redis in order to provide that stability we need. You raise a good point. An awful lot of the tools that you have put out there are the best, from my perspective, approach to working with AWS services. And that is a relatively thin layer built on top of them

Starting point is 00:12:07 with an eye toward making the user experience more polished, but not being so heavily opinionated that as soon as the service goes in a different direction, the tool becomes completely useless. You just decide to make it a bit easier to wind up working with specific environment variables or profiles rather than what appears to be the AWS UX approach of, oh, now type in your access key, your secret key, and your session token,

Starting point is 00:12:31 and we've disabled copy and paste. Go, have fun. You've really done a lot of quality of life improvements more so than you have. This is the entire system of how we do deploys start to finish. It's opinionated and sort of like a take on what Netflix did once upon a time with Asgard. It really feels like it's just the right level of abstraction. We've done a pretty good job. I will say, years later, we felt that we got it wrong a couple of times. It's been really interesting to see that, that there are times

Starting point is 00:12:59 when we say, oh, we could take these three or four services and wrap it up into this new concept of an application. And over time, we start poking holes in that four services and wrap it up into this new concept of an application. And over time, we have to start poking holes in that new layer and we start to see we would have been better served by sticking with as thin a layer as possible that enables us rather than trying to get these higher level pieces. It's remarkably refreshing to hear you say that, just because so many people love to tell the story on podcasts or on conference stages or whatever format they have of, this is what we built. And it is an aspirationally superficial story about this. They don't talk about that, well, first we went down these three wrong paths first.

Starting point is 00:13:39 It's always a, oh, yes, obviously we are smart people and we only make the correct decision. And I remember in the before time, sitting in conference talks, watching people talk about great things they've done. And I'll turn to the person next to me and say, wow, I wish I could be involved in a project like that. And they'll say, yes, so do I. And it turns out they work at the company the speaker is from. Because all of these things tend to be the most positive story. Do you have an example of something that you have done in your production environment that going back, yeah, in hindsight, would have done that completely differently? Yeah. So coming from Heroku, moving into AWS, we had a great open source project called Empire,

Starting point is 00:14:17 which kind of bridged that gap between them, but used Amazon's ECS in order to launch applications. It was actually command line compatible with the Heroku command when it first launched. So a very big commitment there. And at the time, I mean, this comes back to a point I think we were talking about earlier, where architecture, costs, infrastructure, they're all interlinked. And I'm a big fan of Conway's law, which says that an organization's structure needs to match its architecture. And I'm a big fan of Conway's law, which says that an organization's structure needs to match its architecture. And so six, seven years ago, we're a heavy growth-based company, and we are interns running around, doing all the things. And we wanted to have really strict

Starting point is 00:14:57 guardrails and a narrow set of things that our development team could do. And so we built a pretty constrained, you will launch, you will have one Docker image per ECS service. It can only do these specific things. And this allowed our development team to focus on pretty buttons on the screen and user engagement and experiments and whatnot. But as we've evolved as a company, as we've built out a more robust business, we've started to track revenue and costs of goods sold more aggressively, we've seen there's a lot of inefficient things that come out of that. One particular example was we use pgBouncer for our connection pooling to our Postgres application. In the traditional model, we had an autoscaling group for our pgBouncer, and then our autoscaling groups for the other applications would connect to

Starting point is 00:15:43 it. And we saw additional latency, we saw additional costs, and we eventually kind of tore all that down and packaged that PG Bouncer alongside the applications that needed it. And this was a configuration that wasn't available in our first pass. It was something we intentionally did not provide to our development team. And we had to unwind that. And when we did, we saw better performance, we saw better performance. We saw better cost efficiency, all sorts of benefits that we care a lot about now that we didn't care about as much many years ago.

Starting point is 00:16:12 It sounds like you're describing some semblance of an internal platform where instead of letting all of your engineers effectively, well, here's the console. Ideally, you use some form of infrastructure as code. Good luck, have fun. You effectively gate access to that. Is that something that you're still doing

Starting point is 00:16:28 or have you taken a different approach? So our primary gate is our infrastructure as code repository. If you want to make a meaningful change, you open up a PR, got to go through code review. You need people to sign off on it. Anything that's not there may not exist tomorrow. There's no guarantees. And we've gone around occasionally just shut random servers down that people spun up in our account. And sometimes people are a little grumpy about it, but you

Starting point is 00:16:53 really need to enforce that culture that we have to go through the correct channels and we have to have this cohesive platform, as you said, to support our development efforts. So you're a messaging service in education. So whenever I do a little bit of digging into backstories of companies and what has made, I guess, an impression, you look for certain things and explicit dates are one of them, where on March 13th of 2020, your business changed just a smidgen. What happened other than the obvious, we never went outside for two years? So if we roll back a week, you know, that's March 13th. So if we roll back a week, we're looking at March 6th. On that day, we sent out about 60 million messages over all of our different mediums, text, email, push notifications.

Starting point is 00:17:45 On March 13th, that was 100 million. And then a few weeks later, on March 30th, that was 177 million. And so our traffic effectively tripled over the course of those three weeks. And yeah, that's quite a ride, let me tell you. The opinion that a lot of folks have who've not gotten to play in sophisticated distributed systems is, well, what's the hard part there? You have an auto-scaling group just spin up three times the number of servers in that fleet and problem solved. What's challenging? A lot. But what did you find that the pressure points were? So I love that example that your auto-scaling group will just work. By default, Amazon's auto scaling groups only support a thousand backends. So when your auto scaling group goes from 400 backends to 1200, things break and not in ways that you would have

Starting point is 00:18:36 expected. You start to learn things about how database systems provided by Amazon have limits other than CPU and memory. And they're clearly laid out that there's network bandwidth limits and things you have to worry about. We had a pretty small team in that time, and we got in this cadence where every Monday morning we would wake up at 4 a.m. Pacific because as part of the pandemic, our traffic shifted. So our East Coast users would be most active in the morning

Starting point is 00:19:02 rather than the afternoon. And so at about 7 a.m. on the East Coast is when be most active in the morning rather than the afternoon. And so at about 7 a.m. on the East Coast is when everyone came online. And we had our Monday morning crew there and just looking to see where the next pain point was going to be. And we'd have Monday, walk through it all. Monday afternoon, we'd meet together. We'd come up with our three or four hypotheses on what will break if our traffic doubles again. And we'd spend the rest of that next week addressing those the best we could and repeat for the next Monday. And we did this for three, four, five weeks in a row. And finally, it stabilized.

Starting point is 00:19:33 But yeah, it's all the small little things. The things you don't know about, the limits and places you don't recognize that just catch up to you. And you need to have a team that can move fast and adapt quickly. You've been using Redis for six, seven years, something along those lines. As an enterprise offering, you've been working with the same vendor who provides this managed service for a while now. What have been the fruits of that relationship? What is the value that you see by continuing to have a long-term relationship with vendors? Because let's be serious, most of us don't stay in jobs that long, let alone work with the same vendor.

Starting point is 00:20:08 Yeah. So coming back to the March 2020 story, many of our vendors started to see some issues here that various services weren't scaled properly. We made a lot of phone calls to a lot of vendors and working with them. And I'm very impressed with how Redis Labs at the time was able to respond. We hopped on a call. They said, here's what we think we need to do. We'll go ahead and do this. We'll sort this out in a few weeks and figure out what this means for your contract. We're here to help and support in this pandemic because we recognize how this is affecting everyone around the world. And so I think when you get in those deeper relationships, those long-term relationships,

Starting point is 00:20:47 it is so helpful to have that trust, to have a little bit of that give when you need it in times of crisis and that they're there and willing to jump in right away. There's a lot to be said for having those working relationships before you need them. So often, I think that a lot of

Starting point is 00:21:05 engineering teams just don't talk to their vendors to a point where they may as well be strangers. You'll see this most notably, because at least I feel it most acutely, with AWS service teams. They'll do a whole kickoff when the enterprise support deal is signed three years ago past, and both the AWS team and the customer's team have completely rotated since then, and they may as well be strangers. Being able to have that relationship to fall back on in those really weird, really, honestly, high stress moments has been one of those things where I didn't see the value myself until the first time I went through a hairy situation where I found that that was useful.

Starting point is 00:21:41 And now it's, oh, I'm now biased instead for, oh, I can fit into the free tier of this service. No, no, I'm going to pay and become a paying customer. I'd rather be a customer that can have that relationship and pick up the phone than someone whining at people in a forum somewhere of, hey, I'm a free user and I'm having some problems with production, just never felt right to me. Yeah, there's nothing worse than calling your account

Starting point is 00:22:05 rep and being told, oh, I'm not your account rep anymore. Somehow you missed the email, you missed who it was prior to COVID. And we saw this a couple of many, many years ago. One of the things about Remind is every back to school season, our traffic 10Xs in about three weeks. And so we're used to emergencies happening and unforeseen things happening. And we plan their year and try to do capacity planning and everything. But we've been around the block a couple of times. And so we have a pretty strong culture now

Starting point is 00:22:34 of leaning in hard with our support reps. We have them in our Slack channels, our AWS team we meet with often, our Redis labs, we have them on Slack as well. We're constantly talking about databases that may or may not be labs, we have them on Slack as well. We're constantly talking about databases that may or may not be performing as we expect them to. They're an extension of our team. We have an incident, we get paged. If it's related to one of those services, we hit them in Slack immediately and have them start checking on the back end while we're checking on our side.

Starting point is 00:22:58 So one of the biggest takeaways I wish more companies would have is that when you are dependent upon another company to effectively run your production infrastructure, they are no longer your vendor. They're your partner, whether you want them to be or not. And approaching it with that perspective really pays dividends down the road. Yeah. One of the cases you get when you've been at a company for a long time and been in a relationship for a long time is growing together is always an interesting approach. And seeing sometimes there's some painful points. Sometimes you're on an old legacy version of their product that you were literally the last customer on, and you got to work with them to move off of. But you were there six years ago when they were just starting out,

Starting point is 00:23:46 and they've seen how you've grown, you've seen how they've grown, and you've kind of been able to marry that experience together in a meaningful way. still dreaming of deploying apps instead of hello world demos, allow me to introduce you to Oracle's always free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And let me be clear here, it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself, all while gaining the networking, load balancing, and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With

Starting point is 00:24:40 Always Free, you can do things like run small-scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisk next to the word free. This is actually free, no asterisk. Start now. Visit snark.cloud slash oci-free. That's snark.cloud slash oci-free. Redis is, these days, a data platform. Back once upon a time, I viewed it as more of a caching layer, and I admit that the capabilities of the platform have significantly advanced since those days when I viewed fairly through the lens of cache. But one of the

Starting point is 00:25:20 interesting parts is that neither one of those use cases, in my mind, blends particularly well with heavy use of spot fleets. But you're doing exactly that. What are you folks doing over there? Yeah, so as I mentioned earlier, coming back to some of the 12-factor app design, we heavily rely on Redis as sort of a distributed heap. One of our challenges of delivering all these messages is every single message has its in-flight state. Here's the content. Here's who we sent it to.

Starting point is 00:25:54 We wait for them to respond. On a traditional application, you might have one big server that stores it all in memory and you get the incoming request and you match things up. By moving all that state to Redis, all of our workers, all of our application servers, we know they can disappear at any point in time. We use Amazon's spot instances in their spot fleet for all of our production traffic, every single web service, every single worker that we have runs on this infrastructure. And we would not be able to do that if we didn't have a reliable and robust place to store this data that is in-flight and currently being accessed.

Starting point is 00:26:29 So we'll have a couple hundred gigs of data at any point in time in a Redis database just representing in-flight work that's happening on various machines. It's really neat seeing Spot Fleets being used as something more than a theoretical possibility. It's something I've always been very interested in, obviously, given the potential cost savings. They approach cheap as free in some cases. But it turns out, we talked earlier about the idea of being cloud native versus the rickety expensive data center in the cloud. And an awful lot of applications are simply not built in a way that, yeah, we're just going to randomly turn off a subset of your systems, ideally with two minutes of notice, but all right, have fun with that. And a lot of times, it just becomes a complete non-starter, even for stateless workloads, just based upon how

Starting point is 00:27:21 all of these things are configured. It is really interesting to watch a company that has an awful lot of responsibility that you've been entrusted with embraces that mindset. It's a lot more rare than you'd think. Yeah. And again, sometimes we overbuild things and sometimes we go down paths that may have been a little excessive, but it really comes down to your architecture. It's not just having everything running on spot. It's making effective use of SQS and other queuing products at Amazon to provide checkpointing abilities. And so you know that should you lose an instance, you're only going to lose a few seconds of productive work on that particular workload and be able to kick off where you left off. It's properly using auto-scaling groups.

Starting point is 00:28:07 From the financial side, there's all sorts of weird quirks you'll see. The spot market has a wonderful set of dynamics where the big instances are much, much cheaper per CPU than the small ones are on the spot market. And so structuring things in a way that you can co-locate different workloads onto the same hosts and hedge against that host going down by spreading across multiple availability zones. I think there's definitely a point where having enough workload, having enough scale allows you to take advantage of these things. But it all comes down to the architecture and design that really enables it. So you've been using Redis for longer than I think many of our listeners have been in tech. And the key distinguishing points for me between someone who is an advocate for a technology and

Starting point is 00:28:57 someone who's a zealot or a pure critic is they can identify use cases for which it's great and use cases for which it is not likely to be a great experience. In your time with Redis, what have you found that it's been great at? And what are some areas that you would encourage people to consider more carefully before diving into it? So we like to joke that five, six years ago, most of our development process was,

Starting point is 00:29:22 I've hit a problem, can I use Redis to solve that problem? And so we've tried every solution possible with Redis. We've done all the things. We have a number of very complicated Lua scripts that are managing different keys in an atomic way. Some of these have been more successful than others, for sure. Right now, our biggest philosophy is

Starting point is 00:29:41 if it is data we need quickly and it is data that is important to us, we put it in Enterprise Redis, the cloud product from Redis. Other use cases, there's a dozen things that you can use for a cache. Redis is great for a cache. Memcache does a decent job as well. You're not going to see a meaningful difference between those sorts of products. Where we've struggled a little bit has been when we have essentially relational data that we need fast access to.

Starting point is 00:30:08 And we're still trying to find the clear path forward here because you can do it and you can have atomic updates and you can kind of simulate some of the acid characteristics you would have in a relational database, but it adds a lot of complexity and adds a lot of overhead to our team as we're continuing

Starting point is 00:30:25 to develop these products, to extend them, to fix any bugs we might have in there. And so we're kind of recalibrating a bit. And some of those workloads are moving to other data stores where they're more appropriate. But at the end of the day, if it's data that we need fast and it's data that's important, we're sticking with what we got here because it's been working pretty well. It sounds almost like you started off with the mindset of one database for a bunch of different use cases and you're starting to differentiate into purpose built databases for certain things. Or is that not entirely accurate? There's a little bit of that. And I think coming back to some of our tooling, as we kind of jumped on a bit of the microservice bandwagon, we would see here's a small service

Starting point is 00:31:06 that only has a small amount of data that needs to be stored. It wouldn't make sense to bring up an RDS instance or an Aurora instance for that in Postgres. Let's just store it in an easy store like Redis. And some of those cases have been great. Some of them have been a little problematic.

Starting point is 00:31:20 And so as we've invested in our tooling to make all of our databases accessible and make it less of a weird trade-off between what the product needs, what we can do right now, and what we want to do long-term and reduce that friction, we've been able to be much more deliberate about the data source that we choose in each case. It's very clear that you're speaking with a voice of experience on this, where this is not something that you just woke up and figured out. One last area I want to go into with you is when I asked you what it is you care about primarily as an engineering leader and as you look at serving your customers as well, you effectively had a dual answer, almost off the cuff, of stability and security. I find the two of those things are deeply intertwined in most of the conversations I have, but they're

Starting point is 00:32:13 rarely called out explicitly in quite the way that you do. Talk to me about that. Yeah. So in our wild journey, stability has always been a challenge. And we've always been in early startup mode where you're constantly pushing, what can we ship? How quickly can we ship it? And in our particular space, we feel that this communication that we foster between teachers and students and their parents is incredibly important. And is a thing that we take very, very seriously. And so a couple of years ago, we were trying to create this balance and create not just the language that we could talk about on a podcast like this, but really recognizing

Starting point is 00:32:56 that framing these concepts out to our company internally, to our engineers, to help them to think as they're building a feature, what are the things they should think about? What are the concerns beyond the product spec? To work with our marketing and sales team to help them to understand why we're making these investments that may not get a particular feature out by X date, but it's still a worthwhile investment. And so from the security side, we've really focused on building out robust practices and robust controls that don't necessarily lock us into a particular standard like PCI compliance or things like that, but really focusing on the maturity of our company and our culture as we go forward. And so we're in a place now, we are ISO 27001.

Starting point is 00:33:43 We're heading into our third year. We leaned in hard on our disaster recovery processes. We leaned in hard on our bug bounties, pen tests, kind of found this incremental approach that day one, I remember we turned on our bug bounty and it was a scary day as the reports kept coming in. But we take on one thing at a time and continue to build on it and make it an essential part of how we build systems. It really has to be built in. It feels like security is not something that can be slapped on as an afterthought, however much companies try to do that.

Starting point is 00:34:17 Especially, again, as we started this episode with, you're dealing with communication with people's kids. That is something that people have remarkably little sense of humor around, and rightfully so. Seeing that there is as much, if not more, care taken around security than there is stability is generally the sign of a well-run organization. If there's a security lapse, I expect certain vendors to rip the power out of their data centers rather than run in an insecure fashion. And your job done correctly, which clearly you have gotten to, means that you never have to make that decision because you've approached this the right way from the beginning. Nothing's perfect, but there's always the idea of actually caring about it being the first step. Yeah.

Starting point is 00:35:02 And the other side of that was talking about stability. And again, it's avoiding the either or situation that we can work in as well along those two stability and security. We work in our cost of good soul and our operating leverage and other aspects of our business. And every single one of them,

Starting point is 00:35:22 it's our co-number one priorities are stability and security. And if it costs us a bit more money, if it's our co-number one priorities are stability and security. And if it costs us a bit more money, if it takes our dev team a little longer, there's not a choice at that point. We're doing the correct thing. Saving money is almost never the primary objective of any company that you really want to be dealing with. Something bizarre is going on. Yeah. Our philosophy on any cost reduction has has been this should have zero negative impact to our stability. If we do not feel we can safely do this, we won't.

Starting point is 00:35:49 And coming back to the spot instance piece, that was a journey for us. And, you know, we tested the waters a bit and we got to a point, we worked very closely with Amazon's team, and we came to that conclusion that we can safely do this. And we've been doing it for over a year and seen no adverse effects. Yeah. And a lot of shops, I've talked to folks about, well, let me go into a consulting project. Okay, there's a lot of things that could have been done before we got here. Why hasn't any of that been addressed? And the answer is, well, we tried to save money once and it caused an outage.

Starting point is 00:36:17 And then we weren't allowed to save money anymore. And here we are. And I absolutely get that perspective. It's a hard balance to strike. It always is. Yeah. The other aspect where stability and security kind of intertwine is you can think about security as infosec and our systems and locking things down.

Starting point is 00:36:34 But at the end of the day, why are we doing all that? It's for the benefit of our users. And Remind as a communication platform and safety and security of our users is as dependent on us being up and available so that teachers can reach out to parents with important communication. Things like attendance, things like natural disasters or lockdowns or any of the number of difficult situations schools find themselves in. This is part of why we take that stewardship that we have so seriously is that being up and protecting a user's data just has such a huge impact on education in this country. It's always interesting to talk to folks who insist they're making the world a better place. It is, what do you do? We're improving ad relevance. I mean, okay, great. Good for you.

Starting point is 00:37:21 You're serving a need that I would not shy away from classifying what you do fundamentally as critical infrastructure. And that is always a good conversation to have. It's nice being able to talk to folks who are doing things that you can unequivocally look at and say, this is a good thing. Yeah. And around 80% of public schools in the US are using Remind in some capacity. And so we're not a product that's used in a few specific regions all across the board. One of my favorite things about working at Remind is meeting people and telling them where I work and they recognize it. They say, oh, I have that app. I use that app. I love it. And I spent years in ads before this. And I've years in ads before this.

Starting point is 00:38:05 And I've been there and no one ever told me they were glad to see an ad. That was never the case. And it's been quite a rewarding experience coming in every day. And as you said, being part of this critical infrastructure, that's a special thing.

Starting point is 00:38:20 I look forward to installing the app myself as my eldest prepares to enter public school in the fall. So now at least I'll have a hotline of exactly where to complain when I didn't get the attendance message. Because, you know, there's no customer quite like a whiny customer. There's still customers. Happy to have them. True. We tend to be.

Starting point is 00:38:38 I want to thank you for taking so much time out of your day to speak with me. If people want to learn more about what you're up to, where's the best place to find you? So from an engineering perspective at Remind, we have our blog, engineering.remind.com. If you want to reach out to me directly, I'm on LinkedIn. It's a good place to find me. Or you can just reach out over email directly, peterh at remind101.com. And we will put all of that into the show notes. Thank you so much for your time. I appreciate it. Thanks, Corey. Peter Hamilton, VP of Technology at Remind. This has been a promoted episode brought to us by our friends at Redis.

Starting point is 00:39:14 And I'm cloud economist, Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry and insulting comment that you will then hope that Remind sends out to 20 million students all at once. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point.

Starting point is 00:39:58 Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

CODACE Plant Stand

Screaming in the Cloud - Would You Kindly Remind with Peter Hamilton

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Screaming in the Cloud - Would You Kindly Remind with Peter Hamilton

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.