Screaming in the Cloud - Making an Affordable Event Data Solution with Seif Lotfy
Episode Date: October 19, 2023Seif Lotfy, Co-Founder and CTO at Axiom, joins Corey on Screaming in the Cloud to discuss how and why Axiom has taken a low-cost approach to event data. Seif describes the events that led to ...him helping co-found a company, and explains why the team wrote all their code from scratch. Corey and Seif discuss their views on AWS pricing, and Seif shares his views on why AWS doesn’t have to compete on price. Seif also reveals some of the exciting new products and features that Axiom is currently working on. About SeifSeif is the bubbly Co-founder and CTO of Axiom where he has helped build the next generation of logging, tracing, and metrics. His background is at Xamarin, and Deutche Telekom and he is the kind of deep technical nerd that geeks out on white papers about emerging technology and then goes to see what he can build.Links Referenced:Axiom: https://axiom.co/Twitter: https://twitter.com/seiflotfy
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
This promoted guest episode is brought to us by my friends, and soon to be yours, over at Axiom.
Today, I'm talking with Saif Lotfi, who's the co-founder and CTO of Axiom. Saif, how are you?
Hey, Corey. I am very good. Thank you. It's pretty late here, but it's worth it. I'm excited to be on this interview. How are you today?
I'm not dead yet. It's weird. I see you at a bunch of different conferences, and I keep forgetting that you do, in fact, live half a world away.
Is the entire company based in Europe?
I mean, where are you folks?
Where do you start?
Where do you stop geographically?
Let's start there.
Everyone dives right into product.
No, no, no.
I want to know where in the world people sit,
because apparently that's the most important thing about a company in 2023.
Unless you ask Zoom,
because they're undoing whatever they did.
We're from New Zealand all the
way to San Francisco and everything in between. So we have people in Egypt and Nigeria, all around
Europe, all around the US and UK if you don't consider it's Europe anymore. Yeah, it really
depends. There's a lot of unfortunate naming that needs to get changed in the wake of that. But enough about geopolitics. Let's talk about industry politics.
I've been a fan of Axiom for a while, and I was somewhat surprised to realize how long it had been around, because I only heard about you folks a couple years back.
What is it you folks do? Because I know how I think about what you're up to, but you've also gone through some messaging iteration, and it is a near certainty that I am behind the times.
Well, at this point, we just define ourselves as the best home for event data.
So Axiom is the best home for event data.
We try to deal with everything that is event-based, so time series.
So we can talk metrics, logs, traces, etc.
And right now, predominantly serving engineering and security.
And we're trying to be, or we are,
the first cloud-native time-series platform
to provide streaming, search, reporting,
and monitoring capabilities.
And we're built from the ground up, by the way.
Like, we didn't actually,
we're not using Parquet and any of these things.
We're completely everything from the ground up.
When I first started talking to you folks a few years back,
there were two points to me that really stood out.
And I know at least one of them still holds true.
The first is that at the time, you were primarily talking about log data.
Just send all your logs over to Axiom, the end.
And that was a simple message that was simple enough that I could understand it, frankly.
Because back when I was slinging servers around and breaking half of them,
logs were effectively how we kept track
of what was going on where.
These days, it feels like everything has been repainted
with a very broad brush called observability.
And the takeaway from most company pitches has been,
you must be smarter than you are
to understand what it is that we're up to.
In some cases, you scratch below the surface and realize that, no, they have no idea what they're talking about either. And they're
really hoping you don't call them on that. It's packaging. Yeah, it is packaging. And that's
important. It's literally packaging. If you look at it, traces and logs, these are events. There's
timestamp and just data with it. It's a timestamp and data with it, right? Even metrics is all the
way to that point. And a good example now, everybody's jumping on OTel. For me, OTel is nothing else but a different
structure for time series, for different types of time series, and that can be used differently,
right? Or at least not used differently, but you can leverage it differently.
The other thing that you did that was interesting and is a lot, I think, more sustainable as far as moats go rather than things
that can be changed on a billboard or whatnot is your economic position. And your pricing has
changed around somewhat, but I ran a number of analyses on your cost that you were passing on
to customers. And my takeaway was that it was a little bit more expensive to store data for logs in Axiom
than it was to store it in S3, but not by much.
And it just blew away the price point of everything else focused around logs, including AWS.
You're paying 50 cents a gigabyte to ingest CloudWatch logs data over there.
Other companies are charging multiples of that. And Cisco recently
bought Splunk for $28 billion because it was cheaper than paying their annual Splunk bill.
How did you get to that price point? Is this just a matter of everyone else being greedy,
or have you done something different? We looked at it from a perspective of,
so there's the three L's of logging. I forgot the name of the person at Netflix who talked about that.
But basically, it's low cost, low latency, large scale, right?
And you'll never be able to fulfill all three of them.
And we decided to work on low cost and large scale.
And in terms of low latency, we won't be low as others, like Clickhouse.
But we are low enough.
Like we're fast enough.
The idea is to be fast enough because in most cases, I don't want to compete on milliseconds.
I think if the user can see his data in two seconds, he's happy.
Or three seconds, he's happy.
I'm not going to be like one to two seconds and make the cost exponentially higher because I'm one second faster than the other.
And that's, I think, the way we approach this from day one.
And from day one, we also start utilizing the idea of existence of object storage.
We have our own compressions, our own encodings, et cetera, from day one, too.
And we still stick to that.
That's why we never converted to other existing things like Parquet.
Also because we are a schema on read, which Parquet doesn't allow you really to that. That's why we never converted to other existing things like Parquet. Also because we are a schema on read,
which Parquet doesn't allow you really to do.
But other than that,
from day one, we want to save costs by also making coordination free.
So it just has to be coordination free, right?
Because then we don't run a shitty Kafka.
Like honestly, a lot of the locks companies
who are running a Kafka in front of it,
the Kafka tax reflects in the bill
that you're paying for them.
What I found fun about your pricing model
is it gets to a point that for any reasonable workload,
how much to log or what to log or sample or keep everything
is no longer an investment decision.
It's just, just go ahead and handle it.
And that was originally what you wound up building out.
Increasingly, it seems like you're
not just the place to send all the logs to, which, to be honest, I was excited enough about that.
That was replacing one of the projects I did a couple of times myself, which is building highly
available fault-tolerant rsyslog clusters in data centers. Okay, great. You've gotten that on lock.
The economics are great. I don't have to worry about that anymore.
And then you started adding interesting things on top of it.
Analyzing things, replaying events that happen to other players, etc., etc. It almost feels like you're not just a storage depot, but you also can forward certain things on under a variety of different rules or guises
and format them as whatever on the other side is expecting them to be. So there's a story about integrating with other observability vendors, for example,
and only sending the stuff that's germane and relevant to them since everyone loves to charge
by ingest. Yeah. So we did this one thing called endpoints, the number one. Endpoints was the
beginning where we said, let's let people send us data using whatever API they like using, let's say Elasticsearch, Datadog,
Honeycomb, Lowkey, whatever. And we will just take that data in and multiplex it back to them.
So that's how part of it started. This allows us to see like how, allows customers to see how we
compare to others. But then we took it a bit further and now it's still enclosed invite only,
but we have pipelines, coden code name pipelines which allows you to
send data to us and we will keep it as a source of truth then we will given specific rules we can
then ship it anywhere to a different destination right and this allows you just to on the fly send
specific filtered things out to i don't know a different vendor or even to S3, or you can send it to
Spunk. But at the same time, because we have all your data, you can go back in the past if
the incident happens and replay that completely into a different product.
I would say that there's a definite approach to observability from the perspective of every company tends to visualize stuff a little bit differently.
And one of the promises of OTEL that I'm seeing as it grows is the idea of, oh, I can send different parts of what I'm seeing off to different providers.
But the instrumentation story for OTEL is still very much emerging. Logs are kind of eternal, and the only real change we've seen to logs over the past decade or so has been instead of just being plain text and their
positional parameters would define what was what, if it's in this column, it's an IP address, and
if it's in this column, it's a return code, and that just wound up being ridiculous. Now you see
them having schemas. They are structured in a variety of different ways, which, okay, it's a little
harder to wind up just catting a file together and piping it to grep, but there are trade-offs
that make it worth it, in my experience. This is one of those transitional products that not only
is great once you get to where you're going, from my playing with it, but also it meets you where
you already are to get started, because everything you've got is emitting logs somewhere, whether you know it or not.
Yes.
And that's why we picked up on OTEL, right?
Like one of the first things we now support, we have an OTEL endpoint natively or as a first class citizen because we wanted to build this experience around OTEL in general.
Whether we like it or not, and there's more reasons to like it, OTEL is a standard that's going to stay, and it's going to move us forward.
I think of OTEL as, will have the same effect, if not bigger, as StatsD back in the day.
But now it just went away from metrics, just went to metrics, logs, and traces.
Traces is, for me, very interesting,
because I think OTEL is the first one to push it in a standard way.
There were several attempts to make standardized
logs, but I think Traces was something
that OTL really pushed
into a proper standard
that we can follow. It annoys me that
everybody uses the different bits and pieces of it
and adds something to it, but
I think it's also because it's not that mature yet
so people are trying to figure out
how to deliver the best experience and
package it in a way that it's actually interesting for the user.
What I've found is that there's a lot that's in this space that is just simply noise.
Whenever I spend a protracted time period working on basically anything, and I'm still confused by the way people talk about that thing months or years later,
I'm starting to get
the realization that maybe I'm not the problem here. And I don't mean this to be insulting,
but one of the things I've loved about you folks is I've always understood what you're saying.
Now, you could hear that as, oh, you mean we talk like simpletons? No, it means what you're
talking about resonates with at least a subset of the people who have the problem you solve.
That's not nothing. Yes, we've tried really hard because one of the people who have the problem you solve. That's not nothing.
Yes.
We tried really hard because one of the things we tried to do
was actually bring our observability to people who are not always busy
or it's not part of their day-to-day.
So we tried to bring it to Vercel developers, right,
by doing a Vercel integration.
And all of a sudden now they have their logs and they have metrics
and they have some traces.
So all of a sudden they're doing the observability work
or they have actual observability for their Verso-based, Next.js-based product.
And we try to meet the people where they are.
So we try to, instead of actually telling people you should send us data,
I mean, that's what they do now.
We try to find, okay, what product are you using
and how can we grab data from there
and send it to us to make your life easier?
You see that we did that with Vercel,
we did that with Cloudflare.
AWS, we have extensions, Lambda extensions, et cetera,
but we're doing it for more things.
For Netlify, it's a one-click integration too.
And that's what we're trying to do
to actually make the experience and the journey easier.
I want to change gears a little bit because something that we spent a fair bit of time
talking about, it's why we became friends, I think anyway, is that we have a shared
appreciation for several things.
One of which, most notable to anyone around us, is whenever we hang out, we greet each
other effusively and then immediately begin complaining about costs of cloud services.
What is your take on the way that clouds charge for things?
I know it's a bit of a leading question,
but it's core and foundational to how you think about Axiom
as well as how you serve customers.
They're ripping us off.
I'm sorry.
They're just, the amount of money they make.
It's crazy. I would love to know what margins they make. It's crazy.
I would love to know what margins they have.
That's a big question.
What are the margins they have at AWS right now?
Across the board, it's something around 30 to 40%.
Last time I looked at it.
That's a lot too.
Well, that's also across the board of everything, to be clear.
It is very clear that some services are subsidized by other services,
as it should be.
If you start charging me per
IAM call, we're done.
And also, I mean, the machine learning stuff,
they won't be doing that much on top of it right now,
right? Else nobody will be using it.
But data transfer? Yeah, there's a significant
upcharge on that, but I hear you.
I would moderate it a bit. I don't think
that I would say that it's necessarily an
intentional ripoff. My problem with most
cloud services that they offer is not usually that they're too expensive, though there are exceptions
to that, but rather that the dimensions are unpredictable in advance. So you run something
for a while, then see what it costs. From where I sit, if a customer uses your service, and then
at the end of that usage is surprised by how much it costs them, you kind of screwed up.
Look, if they can make egress free,
like you saw how Cloudflare just did the egress of R2 free, because I am still stuck with AWS
because let's face it, for me, it is still my favorite cloud, right? Cloudflare is my next
favorite because of all the features they're trying to develop and the pace they're picking,
the pace they're trying to catch up with. But again, one of the biggest things I liked is R2
and R2 egress is free. Now that's that's interesting right but i never saw anything coming back from s3
from aws for on s3 for that like you know i think amazon is so comfortable because from a product
perspective they're simple they have the tools etc and the ui is not the flashiest one but you
know what you're doing right the cli is not the flashiest, but you know what you're doing, right? The CLI is not the flashiest one,
but you know what you're doing.
It is so cool that they don't really need
to compete with others yet.
And I think they're still dominantly
the biggest cloud out there.
I think you know more than me about that,
but I think they are the biggest one right now
in terms of data volume,
like how many customers are using them.
And even in terms of profiles of people using them,
it varies so much. I know a lot of the Microsoft Azure people who are using it are using it because they come from enterprises that have been always Microsoft, very Microsoft friendly.
And eventually Microsoft also came in Europe in all these different weird ways. But I feel sometimes ripped off by the AWS
because I see Cloudflare trying to reduce their prices
and AWS just looking like,
yeah, you're not a threat to us,
so we'll just keep the prices as they are.
I have it on good authority from folks who know
that there are reasons behind the economic structures
of both of those companies
based in terms of the primary direction,
the traffic flows and the rest. But across the board, they've done such a poor job of
articulating this that frankly, I think the confusion is on them to clear up, not us.
True, true. And the reason I picked R2 and S3 to compare there and not look at workers and
lambdas because I look at as R2 is S3 compatible from an API perspective, right?
So they're giving me something that I already use.
Everything else I'm using, I'm using inside Amazon.
So it's in a VPC, but just the idea.
Let me dream.
Let me dream that S3 egress will be free at some point.
I can dream.
That's like Christmas.
It's better than Christmas.
What I'm surprised about is how reasonable your pricing is
in turn. You wind up charging on a basis of ingest, which is basically the only thing that really
makes sense for how your company is structured. But it's predictable in advance. The free tier is
what, 500 gigs a month of ingest? And before people think, oh, that doesn't sound like a lot,
I encourage you to just go back and think how much data that really is in the context of logs
for any toy project. Well, our production environment spits out way more than that.
Yes. And by the word production that you just used, you probably shouldn't be using a free
trial of anything as your critical path observability tooling. Become a customer,
not a user. I'm a big believer in that philosophy personally.
For all of my toy projects that are ridiculous,
this is ample.
People always tend to overestimate
how much logs they're going to be sending.
So there's one thing.
What you said is right.
People already have something going on.
They already know how much logs
they'll be sending around.
But then eventually they're sending too much.
And that's why we're back here
and they're talking to us like, we want to try your tool, but we'll be sending more than that.
So if you don't like our pricing, go find something else. Because I think we're the
cheapest out there right now. We're the competitive, the cheapest out there right now.
If there is one that is less expensive, I'm unaware of it.
And I've been looking, let's be clear. That's not just me saying, well,
nothing is skittered across my desk. No, no, no. this space hey hey where's cory we're friends loyalty exactly if you find
something you tell me oh if i find something i'll tell everyone no no you tell me first
they tell me in a nice way so i can reduce the prices on my site this is how we start a price
war industry-wide i would love to see it. But there's
enough channels that we share
at this point across different slacks
and messaging apps that you should be able to
ping me if you find one.
Also, get me the name of the CEO and the CTO
while you're at it. And where they live.
Yes, yes, of course. The entire
implications will be awesome.
That was you, not me.
That was your suggestion.
Before we turn into a bit of an old thud and blunder, let's Contributions will be awesome. No, no, no, that was you, not me. That was your suggestion. Exactly. I will not.
Before we turn into a bit of this old thud and blunder,
let's talk about something else that I'm curious about here.
You've been working on Axiom for something like seven years now.
You come from a world of databases and events and the like.
Why start a company in the model of Axiom? Even back then when I looked around,
my big problem with the entire observability space could never have been described as, you know what we need? More companies that do exactly this. What was it that you saw that
made you say, yeah, we're going to start a company? Because that sounds easy.
So I'll be very clear.
I'm not going to sugarcoat this.
We kind of got in a position where I force-gumped our way into it.
And by that, I mean, we came from a company where we were dealing with logs.
We actually wrote an event, a crash analytics tool for a company. But then we ended up wanting to use stuff like Datadog but we didn't
have the budget for that because Datadog was killing us. So we ended up hosting our own Elastic
Search and Elastic Search it cost us more to maintain our Elastic Search cluster for the logs
than to actually maintain our own little infrastructure for the crash events when we were
getting like one billion crashes a month at this point. So eventually we just, that was the first burn.
And then you had alert fatigue.
And then you had consolidating events and timestamps and whatnot.
The whole thing just seemed very messy.
So we started off, after some company got sold, we started off by saying,
okay, let's go work on a new self-hosted version of Datalog
where we do metrics and logs.
And then that didn't go as well as we thought it would,
but we ended up, because from day one we were working on it,
because we were self-hosted, so we wanted to keep costs low,
we were working on making it stateless and work against object store.
And this is kind of how it started.
Then we realized, oh, we cost, we can host this and
make it scale and it won't cost us that much. So we did that and that started gaining more attention.
But the reason we started this was we wanted to start a self-hosted version of Datadog that is
not costly. And we ended up doing a software as a service. I mean, you can still come and self-host
it, but you'll have to pay money for it,
like proper money for that.
But we do a SaaS version of this.
And instead of trying to be a self-hosted Datadog,
we are now trying to compete
or we are competing with Datadog.
Is the technology that you've built this on top of
actually that different from everything else out there?
Or is this effectively what you see in a lot of places?
Oh yeah, we're just going to manage Elasticsearch for you because that's annoying. Do you have anything that
distinguishes you from, I guess, the rest of the field? Yeah, so very just bluntly, like I think
Scuba was the first thing that started standing out. And then Honeycomb came into the scene and
they started building something based on Scuba these are principles of scuba
then one of the authors of actual scuba reached out to me when i told him i'm trying to build something and we he gave me some ideas and i started building building that and from day
one i said okay everything in s3 all queries have to be serverless so all the queries run on
functions there's no real disks it's just all in S3 right now. And the biggest
issue achievement we got to lower our cost was to get rid of Kafka and have, let's say,
behind the scenes, we have our own coordination-free mechanism, but the idea is not to
actually have to use Kafka at all and thus reduce the cost incredibly. In terms of technology, no,
we don't use Elasticsearch. We wrote everything from the ground up from scratch.
Even the query language.
We have our own query language that's modeled after Kusto,
KQL by Microsoft.
So everything we have is built absolutely from the ground up.
And no Elastic.
I'm not using Elastic anymore.
Elastic is a horror for me.
Absolute horror.
People love the API, but no,
I've never met anyone who likes managing Elasticsearch
or OpenSearch or whatever we're calling
your particular flavor of it.
It is a colossal pain.
It is subject to significant trade-offs,
regardless of how you work with it.
And Amazon's managed offering doesn't make it better.
It makes it worse in a bunch of ways. And the green status of Elasticsearch is a myth. You only see it once. The first time you
start that cluster, that's when the Elasticsearch cluster is green. After that, it's just orange
or red. And you know what? I'm happy when it's orange. Elasticsearch kept me up for so long.
And we had actually a very interesting situation where we had Elasticsearch running
on Azure, on Windows
machines. And with
those servers, sorry. And I'd have to
log in every day. You remember
what's it called? RP
something. What was it called?
RDP? Remote Desktop Protocol? Or something else?
Yeah, where you have to log in.
It's actually a visual thing and you have to go in
and visually go in and say, please don't restart.
Every day, I'd have to do that.
Please don't restart.
Please don't restart.
And it was a lot of weird issues.
And also, at that point, Azure would decide to disconnect the pod
once you try to bring in a new pod.
And all these weird things were happening back then.
So eventually, you end up with a split-lane decision.
I'm talking 2013-14.
So it was back in the day when Elasticsearch was very young.
And so that was just a bad start for me.
I will say that Azure
is the most cost-effective cloud
because their security is so clown shoes.
You can just run whatever you want
in someone else's account.
It's free to you.
Problem solved.
Don't tell people how we save costs, okay?
I love that.
Don't tell people how we do that.
Like, Corey, come on.
You're exposing me here.
Let me tell you one thing, though.
Elasticsearch is the reason I literally used a shock collar or a shock bracelet on myself every time it went down, which was almost every day.
Instead of having Patriot duty ring, like ring my phone and, you know, I'd wake up and my partner back then would wake up. I bought a Bluetooth collar off of Alibaba
that would tase me every time I got a notification,
regardless of the notification.
So some things were false alarm,
but I got tased for at least two, three weeks
before I gave up.
Every night I'd wake up to a full discharge.
I would never hook myself up to a shocker tied to outages.
Even if I owned a company,
there are pleasant ways to wake up,
unpleasant ways to wake up, and even worse.
So you're getting shocked for someone else
can wind up effectively driving the future of the business.
You're more or less the monkey that gets shocked awake
to go ahead and fix the thing that just broke.
Well, the fix to that was moving from Azure to AWS without telling
anybody. That got us in a lot of trouble. Again, it wasn't my company.
They didn't notice that you did this or it caused a lot of trouble because suddenly nothing worked
where they thought it would? No, no. Everything worked fine on AWS. That's how my love story
began. But they didn't notice for like six months.
That's kind of amazing.
That was fantastic.
We rewrote everything from C Sharp to Node.js
and moved everything away from Elasticsearch,
started using Redshift, Redis, and you name it.
We went AWS all the way and they didn't even notice.
We took the budget from another department
to start filling that in.
But we cut the cost from 100,,000 down to like $40,000
and then eventually down to $30,000 a month.
That's more than a little wild.
Oh, God.
Yeah, good times.
Good times.
Next time, just ask Neil to tell you the full story about this.
I can't go into details in this podcast.
I think I'll get in trouble.
I didn't sign anything, though. Those are the best stories. But'll get in a lot. I think I'll get in trouble. I didn't sign anything though.
Those are the best stories.
But no, I hear you.
I absolutely hear you.
Saif, I really want to thank you
for taking the time to speak with me.
If people want to learn more,
where should they go?
So axiom.co, not.com,.co.
That's where they learn more about Axiom.
And other than that,
I think I have a Twitter somewhere.
And if you know how to write my name,
it's just one word and you'll find me on Twitter.
We will put that all in the show notes.
Thank you so much for taking the time to speak with me.
I really appreciate it.
Dude, that was awesome.
Thank you, man.
Saif Lotfi, co-founder and CTO of Axiom,
who has brought this promoted guest episode our way.
I'm cloud economist Corey Quinn, and this is Screaming in the Cloud.
If you've enjoyed this podcast, please leave a five-star review on your podcast platform of
choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast
platform of choice, along with an angry comment that one of these days I will get around to
aggregating in some horrifying custom homebrew logging system, probably built on top of our syslog.
If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business, and we get to the point.
Visit duckbillgroup.com to get started.