Screaming in the Cloud - Making an Affordable Event Data Solution with Seif Lotfy

Episode Date: October 19, 2023

Seif Lotfy, Co-Founder and CTO at Axiom, joins Corey on Screaming in the Cloud to discuss how and why Axiom has taken a low-cost approach to event data. Seif describes the events that led to ...him helping co-found a company, and explains why the team wrote all their code from scratch. Corey and Seif discuss their views on AWS pricing, and Seif shares his views on why AWS doesn’t have to compete on price. Seif also reveals some of the exciting new products and features that Axiom is currently working on. About SeifSeif is the bubbly Co-founder and CTO of Axiom where he has helped build the next generation of logging, tracing, and metrics. His background is at Xamarin, and Deutche Telekom and he is the kind of deep technical nerd that geeks out on white papers about emerging technology and then goes to see what he can build.Links Referenced:Axiom: https://axiom.co/Twitter: https://twitter.com/seiflotfy

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn.
Starting point is 00:00:34 This promoted guest episode is brought to us by my friends, and soon to be yours, over at Axiom. Today, I'm talking with Saif Lotfi, who's the co-founder and CTO of Axiom. Saif, how are you? Hey, Corey. I am very good. Thank you. It's pretty late here, but it's worth it. I'm excited to be on this interview. How are you today? I'm not dead yet. It's weird. I see you at a bunch of different conferences, and I keep forgetting that you do, in fact, live half a world away. Is the entire company based in Europe? I mean, where are you folks? Where do you start? Where do you stop geographically?
Starting point is 00:01:11 Let's start there. Everyone dives right into product. No, no, no. I want to know where in the world people sit, because apparently that's the most important thing about a company in 2023. Unless you ask Zoom, because they're undoing whatever they did. We're from New Zealand all the
Starting point is 00:01:26 way to San Francisco and everything in between. So we have people in Egypt and Nigeria, all around Europe, all around the US and UK if you don't consider it's Europe anymore. Yeah, it really depends. There's a lot of unfortunate naming that needs to get changed in the wake of that. But enough about geopolitics. Let's talk about industry politics. I've been a fan of Axiom for a while, and I was somewhat surprised to realize how long it had been around, because I only heard about you folks a couple years back. What is it you folks do? Because I know how I think about what you're up to, but you've also gone through some messaging iteration, and it is a near certainty that I am behind the times. Well, at this point, we just define ourselves as the best home for event data. So Axiom is the best home for event data. We try to deal with everything that is event-based, so time series.
Starting point is 00:02:16 So we can talk metrics, logs, traces, etc. And right now, predominantly serving engineering and security. And we're trying to be, or we are, the first cloud-native time-series platform to provide streaming, search, reporting, and monitoring capabilities. And we're built from the ground up, by the way. Like, we didn't actually,
Starting point is 00:02:35 we're not using Parquet and any of these things. We're completely everything from the ground up. When I first started talking to you folks a few years back, there were two points to me that really stood out. And I know at least one of them still holds true. The first is that at the time, you were primarily talking about log data. Just send all your logs over to Axiom, the end. And that was a simple message that was simple enough that I could understand it, frankly.
Starting point is 00:03:01 Because back when I was slinging servers around and breaking half of them, logs were effectively how we kept track of what was going on where. These days, it feels like everything has been repainted with a very broad brush called observability. And the takeaway from most company pitches has been, you must be smarter than you are to understand what it is that we're up to.
Starting point is 00:03:23 In some cases, you scratch below the surface and realize that, no, they have no idea what they're talking about either. And they're really hoping you don't call them on that. It's packaging. Yeah, it is packaging. And that's important. It's literally packaging. If you look at it, traces and logs, these are events. There's timestamp and just data with it. It's a timestamp and data with it, right? Even metrics is all the way to that point. And a good example now, everybody's jumping on OTel. For me, OTel is nothing else but a different structure for time series, for different types of time series, and that can be used differently, right? Or at least not used differently, but you can leverage it differently. The other thing that you did that was interesting and is a lot, I think, more sustainable as far as moats go rather than things
Starting point is 00:04:05 that can be changed on a billboard or whatnot is your economic position. And your pricing has changed around somewhat, but I ran a number of analyses on your cost that you were passing on to customers. And my takeaway was that it was a little bit more expensive to store data for logs in Axiom than it was to store it in S3, but not by much. And it just blew away the price point of everything else focused around logs, including AWS. You're paying 50 cents a gigabyte to ingest CloudWatch logs data over there. Other companies are charging multiples of that. And Cisco recently bought Splunk for $28 billion because it was cheaper than paying their annual Splunk bill.
Starting point is 00:04:52 How did you get to that price point? Is this just a matter of everyone else being greedy, or have you done something different? We looked at it from a perspective of, so there's the three L's of logging. I forgot the name of the person at Netflix who talked about that. But basically, it's low cost, low latency, large scale, right? And you'll never be able to fulfill all three of them. And we decided to work on low cost and large scale. And in terms of low latency, we won't be low as others, like Clickhouse. But we are low enough.
Starting point is 00:05:28 Like we're fast enough. The idea is to be fast enough because in most cases, I don't want to compete on milliseconds. I think if the user can see his data in two seconds, he's happy. Or three seconds, he's happy. I'm not going to be like one to two seconds and make the cost exponentially higher because I'm one second faster than the other. And that's, I think, the way we approach this from day one. And from day one, we also start utilizing the idea of existence of object storage. We have our own compressions, our own encodings, et cetera, from day one, too.
Starting point is 00:06:00 And we still stick to that. That's why we never converted to other existing things like Parquet. Also because we are a schema on read, which Parquet doesn't allow you really to that. That's why we never converted to other existing things like Parquet. Also because we are a schema on read, which Parquet doesn't allow you really to do. But other than that, from day one, we want to save costs by also making coordination free. So it just has to be coordination free, right? Because then we don't run a shitty Kafka.
Starting point is 00:06:18 Like honestly, a lot of the locks companies who are running a Kafka in front of it, the Kafka tax reflects in the bill that you're paying for them. What I found fun about your pricing model is it gets to a point that for any reasonable workload, how much to log or what to log or sample or keep everything is no longer an investment decision.
Starting point is 00:06:39 It's just, just go ahead and handle it. And that was originally what you wound up building out. Increasingly, it seems like you're not just the place to send all the logs to, which, to be honest, I was excited enough about that. That was replacing one of the projects I did a couple of times myself, which is building highly available fault-tolerant rsyslog clusters in data centers. Okay, great. You've gotten that on lock. The economics are great. I don't have to worry about that anymore. And then you started adding interesting things on top of it.
Starting point is 00:07:13 Analyzing things, replaying events that happen to other players, etc., etc. It almost feels like you're not just a storage depot, but you also can forward certain things on under a variety of different rules or guises and format them as whatever on the other side is expecting them to be. So there's a story about integrating with other observability vendors, for example, and only sending the stuff that's germane and relevant to them since everyone loves to charge by ingest. Yeah. So we did this one thing called endpoints, the number one. Endpoints was the beginning where we said, let's let people send us data using whatever API they like using, let's say Elasticsearch, Datadog, Honeycomb, Lowkey, whatever. And we will just take that data in and multiplex it back to them. So that's how part of it started. This allows us to see like how, allows customers to see how we compare to others. But then we took it a bit further and now it's still enclosed invite only,
Starting point is 00:08:02 but we have pipelines, coden code name pipelines which allows you to send data to us and we will keep it as a source of truth then we will given specific rules we can then ship it anywhere to a different destination right and this allows you just to on the fly send specific filtered things out to i don't know a different vendor or even to S3, or you can send it to Spunk. But at the same time, because we have all your data, you can go back in the past if the incident happens and replay that completely into a different product. I would say that there's a definite approach to observability from the perspective of every company tends to visualize stuff a little bit differently. And one of the promises of OTEL that I'm seeing as it grows is the idea of, oh, I can send different parts of what I'm seeing off to different providers.
Starting point is 00:08:58 But the instrumentation story for OTEL is still very much emerging. Logs are kind of eternal, and the only real change we've seen to logs over the past decade or so has been instead of just being plain text and their positional parameters would define what was what, if it's in this column, it's an IP address, and if it's in this column, it's a return code, and that just wound up being ridiculous. Now you see them having schemas. They are structured in a variety of different ways, which, okay, it's a little harder to wind up just catting a file together and piping it to grep, but there are trade-offs that make it worth it, in my experience. This is one of those transitional products that not only is great once you get to where you're going, from my playing with it, but also it meets you where you already are to get started, because everything you've got is emitting logs somewhere, whether you know it or not.
Starting point is 00:09:47 Yes. And that's why we picked up on OTEL, right? Like one of the first things we now support, we have an OTEL endpoint natively or as a first class citizen because we wanted to build this experience around OTEL in general. Whether we like it or not, and there's more reasons to like it, OTEL is a standard that's going to stay, and it's going to move us forward. I think of OTEL as, will have the same effect, if not bigger, as StatsD back in the day. But now it just went away from metrics, just went to metrics, logs, and traces. Traces is, for me, very interesting, because I think OTEL is the first one to push it in a standard way.
Starting point is 00:10:23 There were several attempts to make standardized logs, but I think Traces was something that OTL really pushed into a proper standard that we can follow. It annoys me that everybody uses the different bits and pieces of it and adds something to it, but I think it's also because it's not that mature yet
Starting point is 00:10:39 so people are trying to figure out how to deliver the best experience and package it in a way that it's actually interesting for the user. What I've found is that there's a lot that's in this space that is just simply noise. Whenever I spend a protracted time period working on basically anything, and I'm still confused by the way people talk about that thing months or years later, I'm starting to get the realization that maybe I'm not the problem here. And I don't mean this to be insulting, but one of the things I've loved about you folks is I've always understood what you're saying.
Starting point is 00:11:15 Now, you could hear that as, oh, you mean we talk like simpletons? No, it means what you're talking about resonates with at least a subset of the people who have the problem you solve. That's not nothing. Yes, we've tried really hard because one of the people who have the problem you solve. That's not nothing. Yes. We tried really hard because one of the things we tried to do was actually bring our observability to people who are not always busy or it's not part of their day-to-day. So we tried to bring it to Vercel developers, right,
Starting point is 00:11:39 by doing a Vercel integration. And all of a sudden now they have their logs and they have metrics and they have some traces. So all of a sudden they're doing the observability work or they have actual observability for their Verso-based, Next.js-based product. And we try to meet the people where they are. So we try to, instead of actually telling people you should send us data, I mean, that's what they do now.
Starting point is 00:12:02 We try to find, okay, what product are you using and how can we grab data from there and send it to us to make your life easier? You see that we did that with Vercel, we did that with Cloudflare. AWS, we have extensions, Lambda extensions, et cetera, but we're doing it for more things. For Netlify, it's a one-click integration too.
Starting point is 00:12:18 And that's what we're trying to do to actually make the experience and the journey easier. I want to change gears a little bit because something that we spent a fair bit of time talking about, it's why we became friends, I think anyway, is that we have a shared appreciation for several things. One of which, most notable to anyone around us, is whenever we hang out, we greet each other effusively and then immediately begin complaining about costs of cloud services. What is your take on the way that clouds charge for things?
Starting point is 00:12:49 I know it's a bit of a leading question, but it's core and foundational to how you think about Axiom as well as how you serve customers. They're ripping us off. I'm sorry. They're just, the amount of money they make. It's crazy. I would love to know what margins they make. It's crazy. I would love to know what margins they have.
Starting point is 00:13:08 That's a big question. What are the margins they have at AWS right now? Across the board, it's something around 30 to 40%. Last time I looked at it. That's a lot too. Well, that's also across the board of everything, to be clear. It is very clear that some services are subsidized by other services, as it should be.
Starting point is 00:13:24 If you start charging me per IAM call, we're done. And also, I mean, the machine learning stuff, they won't be doing that much on top of it right now, right? Else nobody will be using it. But data transfer? Yeah, there's a significant upcharge on that, but I hear you. I would moderate it a bit. I don't think
Starting point is 00:13:39 that I would say that it's necessarily an intentional ripoff. My problem with most cloud services that they offer is not usually that they're too expensive, though there are exceptions to that, but rather that the dimensions are unpredictable in advance. So you run something for a while, then see what it costs. From where I sit, if a customer uses your service, and then at the end of that usage is surprised by how much it costs them, you kind of screwed up. Look, if they can make egress free, like you saw how Cloudflare just did the egress of R2 free, because I am still stuck with AWS
Starting point is 00:14:11 because let's face it, for me, it is still my favorite cloud, right? Cloudflare is my next favorite because of all the features they're trying to develop and the pace they're picking, the pace they're trying to catch up with. But again, one of the biggest things I liked is R2 and R2 egress is free. Now that's that's interesting right but i never saw anything coming back from s3 from aws for on s3 for that like you know i think amazon is so comfortable because from a product perspective they're simple they have the tools etc and the ui is not the flashiest one but you know what you're doing right the cli is not the flashiest, but you know what you're doing, right? The CLI is not the flashiest one, but you know what you're doing.
Starting point is 00:14:46 It is so cool that they don't really need to compete with others yet. And I think they're still dominantly the biggest cloud out there. I think you know more than me about that, but I think they are the biggest one right now in terms of data volume, like how many customers are using them.
Starting point is 00:15:02 And even in terms of profiles of people using them, it varies so much. I know a lot of the Microsoft Azure people who are using it are using it because they come from enterprises that have been always Microsoft, very Microsoft friendly. And eventually Microsoft also came in Europe in all these different weird ways. But I feel sometimes ripped off by the AWS because I see Cloudflare trying to reduce their prices and AWS just looking like, yeah, you're not a threat to us, so we'll just keep the prices as they are. I have it on good authority from folks who know
Starting point is 00:15:37 that there are reasons behind the economic structures of both of those companies based in terms of the primary direction, the traffic flows and the rest. But across the board, they've done such a poor job of articulating this that frankly, I think the confusion is on them to clear up, not us. True, true. And the reason I picked R2 and S3 to compare there and not look at workers and lambdas because I look at as R2 is S3 compatible from an API perspective, right? So they're giving me something that I already use.
Starting point is 00:16:09 Everything else I'm using, I'm using inside Amazon. So it's in a VPC, but just the idea. Let me dream. Let me dream that S3 egress will be free at some point. I can dream. That's like Christmas. It's better than Christmas. What I'm surprised about is how reasonable your pricing is
Starting point is 00:16:26 in turn. You wind up charging on a basis of ingest, which is basically the only thing that really makes sense for how your company is structured. But it's predictable in advance. The free tier is what, 500 gigs a month of ingest? And before people think, oh, that doesn't sound like a lot, I encourage you to just go back and think how much data that really is in the context of logs for any toy project. Well, our production environment spits out way more than that. Yes. And by the word production that you just used, you probably shouldn't be using a free trial of anything as your critical path observability tooling. Become a customer, not a user. I'm a big believer in that philosophy personally.
Starting point is 00:17:06 For all of my toy projects that are ridiculous, this is ample. People always tend to overestimate how much logs they're going to be sending. So there's one thing. What you said is right. People already have something going on. They already know how much logs
Starting point is 00:17:19 they'll be sending around. But then eventually they're sending too much. And that's why we're back here and they're talking to us like, we want to try your tool, but we'll be sending more than that. So if you don't like our pricing, go find something else. Because I think we're the cheapest out there right now. We're the competitive, the cheapest out there right now. If there is one that is less expensive, I'm unaware of it. And I've been looking, let's be clear. That's not just me saying, well,
Starting point is 00:17:42 nothing is skittered across my desk. No, no, no. this space hey hey where's cory we're friends loyalty exactly if you find something you tell me oh if i find something i'll tell everyone no no you tell me first they tell me in a nice way so i can reduce the prices on my site this is how we start a price war industry-wide i would love to see it. But there's enough channels that we share at this point across different slacks and messaging apps that you should be able to ping me if you find one.
Starting point is 00:18:14 Also, get me the name of the CEO and the CTO while you're at it. And where they live. Yes, yes, of course. The entire implications will be awesome. That was you, not me. That was your suggestion. Before we turn into a bit of an old thud and blunder, let's Contributions will be awesome. No, no, no, that was you, not me. That was your suggestion. Exactly. I will not. Before we turn into a bit of this old thud and blunder,
Starting point is 00:18:31 let's talk about something else that I'm curious about here. You've been working on Axiom for something like seven years now. You come from a world of databases and events and the like. Why start a company in the model of Axiom? Even back then when I looked around, my big problem with the entire observability space could never have been described as, you know what we need? More companies that do exactly this. What was it that you saw that made you say, yeah, we're going to start a company? Because that sounds easy. So I'll be very clear. I'm not going to sugarcoat this.
Starting point is 00:19:12 We kind of got in a position where I force-gumped our way into it. And by that, I mean, we came from a company where we were dealing with logs. We actually wrote an event, a crash analytics tool for a company. But then we ended up wanting to use stuff like Datadog but we didn't have the budget for that because Datadog was killing us. So we ended up hosting our own Elastic Search and Elastic Search it cost us more to maintain our Elastic Search cluster for the logs than to actually maintain our own little infrastructure for the crash events when we were getting like one billion crashes a month at this point. So eventually we just, that was the first burn. And then you had alert fatigue.
Starting point is 00:19:49 And then you had consolidating events and timestamps and whatnot. The whole thing just seemed very messy. So we started off, after some company got sold, we started off by saying, okay, let's go work on a new self-hosted version of Datalog where we do metrics and logs. And then that didn't go as well as we thought it would, but we ended up, because from day one we were working on it, because we were self-hosted, so we wanted to keep costs low,
Starting point is 00:20:17 we were working on making it stateless and work against object store. And this is kind of how it started. Then we realized, oh, we cost, we can host this and make it scale and it won't cost us that much. So we did that and that started gaining more attention. But the reason we started this was we wanted to start a self-hosted version of Datadog that is not costly. And we ended up doing a software as a service. I mean, you can still come and self-host it, but you'll have to pay money for it, like proper money for that.
Starting point is 00:20:46 But we do a SaaS version of this. And instead of trying to be a self-hosted Datadog, we are now trying to compete or we are competing with Datadog. Is the technology that you've built this on top of actually that different from everything else out there? Or is this effectively what you see in a lot of places? Oh yeah, we're just going to manage Elasticsearch for you because that's annoying. Do you have anything that
Starting point is 00:21:09 distinguishes you from, I guess, the rest of the field? Yeah, so very just bluntly, like I think Scuba was the first thing that started standing out. And then Honeycomb came into the scene and they started building something based on Scuba these are principles of scuba then one of the authors of actual scuba reached out to me when i told him i'm trying to build something and we he gave me some ideas and i started building building that and from day one i said okay everything in s3 all queries have to be serverless so all the queries run on functions there's no real disks it's just all in S3 right now. And the biggest issue achievement we got to lower our cost was to get rid of Kafka and have, let's say, behind the scenes, we have our own coordination-free mechanism, but the idea is not to
Starting point is 00:21:58 actually have to use Kafka at all and thus reduce the cost incredibly. In terms of technology, no, we don't use Elasticsearch. We wrote everything from the ground up from scratch. Even the query language. We have our own query language that's modeled after Kusto, KQL by Microsoft. So everything we have is built absolutely from the ground up. And no Elastic. I'm not using Elastic anymore.
Starting point is 00:22:21 Elastic is a horror for me. Absolute horror. People love the API, but no, I've never met anyone who likes managing Elasticsearch or OpenSearch or whatever we're calling your particular flavor of it. It is a colossal pain. It is subject to significant trade-offs,
Starting point is 00:22:39 regardless of how you work with it. And Amazon's managed offering doesn't make it better. It makes it worse in a bunch of ways. And the green status of Elasticsearch is a myth. You only see it once. The first time you start that cluster, that's when the Elasticsearch cluster is green. After that, it's just orange or red. And you know what? I'm happy when it's orange. Elasticsearch kept me up for so long. And we had actually a very interesting situation where we had Elasticsearch running on Azure, on Windows machines. And with
Starting point is 00:23:10 those servers, sorry. And I'd have to log in every day. You remember what's it called? RP something. What was it called? RDP? Remote Desktop Protocol? Or something else? Yeah, where you have to log in. It's actually a visual thing and you have to go in and visually go in and say, please don't restart.
Starting point is 00:23:26 Every day, I'd have to do that. Please don't restart. Please don't restart. And it was a lot of weird issues. And also, at that point, Azure would decide to disconnect the pod once you try to bring in a new pod. And all these weird things were happening back then. So eventually, you end up with a split-lane decision.
Starting point is 00:23:41 I'm talking 2013-14. So it was back in the day when Elasticsearch was very young. And so that was just a bad start for me. I will say that Azure is the most cost-effective cloud because their security is so clown shoes. You can just run whatever you want in someone else's account.
Starting point is 00:23:55 It's free to you. Problem solved. Don't tell people how we save costs, okay? I love that. Don't tell people how we do that. Like, Corey, come on. You're exposing me here. Let me tell you one thing, though.
Starting point is 00:24:08 Elasticsearch is the reason I literally used a shock collar or a shock bracelet on myself every time it went down, which was almost every day. Instead of having Patriot duty ring, like ring my phone and, you know, I'd wake up and my partner back then would wake up. I bought a Bluetooth collar off of Alibaba that would tase me every time I got a notification, regardless of the notification. So some things were false alarm, but I got tased for at least two, three weeks before I gave up. Every night I'd wake up to a full discharge.
Starting point is 00:24:43 I would never hook myself up to a shocker tied to outages. Even if I owned a company, there are pleasant ways to wake up, unpleasant ways to wake up, and even worse. So you're getting shocked for someone else can wind up effectively driving the future of the business. You're more or less the monkey that gets shocked awake to go ahead and fix the thing that just broke.
Starting point is 00:25:07 Well, the fix to that was moving from Azure to AWS without telling anybody. That got us in a lot of trouble. Again, it wasn't my company. They didn't notice that you did this or it caused a lot of trouble because suddenly nothing worked where they thought it would? No, no. Everything worked fine on AWS. That's how my love story began. But they didn't notice for like six months. That's kind of amazing. That was fantastic. We rewrote everything from C Sharp to Node.js
Starting point is 00:25:32 and moved everything away from Elasticsearch, started using Redshift, Redis, and you name it. We went AWS all the way and they didn't even notice. We took the budget from another department to start filling that in. But we cut the cost from 100,,000 down to like $40,000 and then eventually down to $30,000 a month. That's more than a little wild.
Starting point is 00:25:54 Oh, God. Yeah, good times. Good times. Next time, just ask Neil to tell you the full story about this. I can't go into details in this podcast. I think I'll get in trouble. I didn't sign anything, though. Those are the best stories. But'll get in a lot. I think I'll get in trouble. I didn't sign anything though. Those are the best stories.
Starting point is 00:26:07 But no, I hear you. I absolutely hear you. Saif, I really want to thank you for taking the time to speak with me. If people want to learn more, where should they go? So axiom.co, not.com,.co. That's where they learn more about Axiom.
Starting point is 00:26:23 And other than that, I think I have a Twitter somewhere. And if you know how to write my name, it's just one word and you'll find me on Twitter. We will put that all in the show notes. Thank you so much for taking the time to speak with me. I really appreciate it. Dude, that was awesome.
Starting point is 00:26:38 Thank you, man. Saif Lotfi, co-founder and CTO of Axiom, who has brought this promoted guest episode our way. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that one of these days I will get around to aggregating in some horrifying custom homebrew logging system, probably built on top of our syslog.
Starting point is 00:27:11 If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.