The Infra Pod - A new S3 is coming, for every data system to be built on S3! Chat with Ovais from Tigris Data
Episode Date: April 15, 2024Ian and Tim sat down with Ovais Tariq from Tigris Data, that's building a new S3 that's globally distributed by default, etc. Listen in if you are intrigued about why we need a new S3, and wh...at would the world look like if we see everything built on object storage!
Transcript
Discussion (0)
Hey, welcome back.
We're at yet another Infra Deep Dive podcast,
and we're ready to rock and roll.
I'm Tim from Essence VC, and let's go, Ian.
Let's go. You're ready for this rocket ship to take off.
This episode, I'm so excited.
We're going to be talking about S3 and the future of object storage.
With the CEO and co-founder of Tigger's Data, which just came out of stealth. Could you introduce yourself? Yeah, my name is Uwais Tariq. I'm co-founder and CEO of Tigger's Data. We have
been around for slightly over two years. Prior to that, I used to head storage infrastructure at
Uber. And my two other co-founders are also from Uber. They're also working on the similar stuff with me at Uber.
So, you know,
all three of us spent some of our time there and learn about the problems at
scale from a storage perspective. And that's how Digress started.
I actually met Tim in another podcast about a year ago.
So it's great to be back, but talking about something different.
Previously it was the, I think, Open Source founder.
Yeah, it was like your database thing.
Yeah.
Yeah, so going back to what you were asking me about the background.
So my background is storage infrastructure in general as well.
And I've been doing that for more than 17 years.
Six years of that was spent at Uber working on scaling the storage infrastructure over there.
But once that work was done,
then me and my co-founders decided to build something on our own and solve scale challenges from a storage perspective outside.
We initially started off building a distributed real-time data platform,
building a transactional WorldDP system.
The goal there was to consolidate as much of the real-time use cases as possible.
So, WorldTP transactional workload,
real-time search-related use cases,
and then streaming-related use cases.
So, that's what we're initially working on.
That's what I talked to him when we met the last time.
But then we decided to go one level deeper,
down or lower on the storage side
and focus more on the object storage piece
because we had been seeing a lot of architectural changes
happening over time where the storage is becoming
more and more crucial and important,
more functionality is getting pushed down,
which is interesting because that's how it used to be
in the past pre-cloud.
In the world of SAN and DRBD,
a lot of the reliability, replication,
durability pieces were handled by storage, but then things moved up the layer
and databases were handling more of replication.
All application layer was handling more of the replication,
you know, data distribution piece.
But, you know, I'm excited that in the cloud-based world,
we are going back to kind of making the foundation
more feature-rich.
What is the architectural change you see that's happening that made you say,
hey, we're going to build S3 API-compatible SaaS layer, like, infra layer?
Like, what is it that got you to say, this is worth the bet,
like, this is the bet we want to make?
Walk us through how you arrived at building what you're building.
Yeah, so one of the insights we had is around the complexity of
persistent systems that get built.
So whether it is the database
that we built at Uber or database that
we're building at Tigris
or it is the search platform that
was built at Uber or search platform that's being built
outside, there are certain
parts to it that always
needs to be done. So for example,
sharding of the data,
distributing the data to support a scale out,
replicating the data for availability and reliability.
These are things that are needed for every position system
and every system ends up building it themselves.
Writing a highly consistent replication engine
is a complex thing to do,
but every position system seems to be doing that by itself.
So we were thinking about why not push it down to storage?
If you look at some of the semantics that object storage provides and some of the guarantees they provide in object storage,
if we just talk about S3 within a single region concept, S3 does provide durability.
S3 does provide replication.
S3 charts the data so that it can scale out.
That is why systems will be built on top of it.
But there are some fundamental pieces that are still missing.
And when you're building a position system,
you still need to handle some of the similar complexities regardless.
So one of the things that comes to mind is multi-regional reliability,
replication, and data distribution.
That is something that S3 doesn't provide you.
So while S3 provides you all of the guarantees in one of the regions,
you now have to build a system that can replicate the data
across multiple regions, distribute the data across multiple regions.
So we thought that why not pull that complexity within storage as well.
So storage by default is smart enough to provide you multi-region applications
and data distribution that is optimized for the access patterns.
And if we do that, then we remove another layer of complexity
from the persistent system that we've built on top.
So imagine if you're building Kafka and you're building Kafka on top of object storage.
And because storage itself is multiregional, you don't have to do anything on the Kafka
layer, right?
You by default get that multiregional Kafka just because of the storage providing you
these semantics.
And same can apply to, you know, if you're building a database on top of object storage,
if the object storage itself is multiregional,
then you get those multiregion capabilities.
You don't have to write a replication engine
to replicate your data from the region.
So that's a multiregion part of it,
which we thought is such a common requirement these days
that we need to pull it down
and make it part of the storage platform.
So that was one part.
The second part was around more higher level use cases,
not necessarily database,
but there are several use cases
that involve storing data in objects
than tagging them in some way
or having some rich metadata about those objects.
But because object stores do not have rich running capabilities you end up storing
that metadata outside into a different system such as you know a database then you are responsible
for maintaining the consistency of objects itself and the metadata of the store outside
so again thinking about simplifying things and you know including experience, why not pull that within object storage as well?
So object storage provides you the querying capabilities where you can query the method.
You don't need a separate database just to be able to query the object storage in an efficient way.
That's awesome.
So, you know, one of the things you mentioned is like, you know, S3 is an awful lot of the box,
but if it did have multi-regionality, it would enable these new types of apps.
And I assume what you're talking is the type of app we're talking about is like apps are building up the edge, right?
Like the final issue we've had to date is kind of your app has been constrained to like the AWS region,
the GCP region, getting to work support multi-region is always very, very, very complicated.
It wasn't supported by the underlying primitives.
So now we have this new burgeoning layer at the edge that enables us to start pushing parts of the app. One of those is
Fly. Another one would be Vercel. I'm curious to understand what is it that you're building that
actually enables the multiregionality? What is the thing that you figured out that enables us to move,
make S3 multiregion, hide behind S3 Pavel API. And as a result of building that,
how does it unlock more of these edge use cases?
Yeah.
So like you said, applications today are more global in nature.
So having a multi-agent storage
makes that type of architecture very simplified.
But going back to what we are doing to make this happen,
essentially it is two things. One of the things
is, you know, just replication. Similar to how you have data-based replication, the same concept
comes here. This is more storage replication or more, you know, lower-level physical replication
where you're replicating the pages and not necessarily logical data. So that is one piece
that is absolutely needed for multi-usage and stuff. And then the other part is tied to efficiency as well.
You cannot replicate all the data everywhere.
So if you're using an object store and you're storing a petabyte of data,
you cannot just replicate petabyte everywhere, right?
So it cannot be some form of very simplified replication mechanism.
It has to be something that understands the request pattern
and then uses that to distribute the data.
So that is how we achieve the global replication piece
in an efficient way.
We figure out what the request pattern is,
and based on that, we only replicate the data
that needs to be replicated.
That's what I would say is the simplified principle
in our sense, that only replicate the data
that needs to be replicated
to whichever region where the data is accessed.
That's the simplified version of what the implementation is doing.
And then outside of that, once we figure out what data needs to be replicated,
that's just implementing the replication engine to take care of that.
One cool thing that's connected to it is that
you can actually write anywhere as well.
So data distribution should not just happen after the fact, One cool thing that's connected to it is that you can actually write anywhere as well.
So data distribution should not just happen after the fact, meaning that once the data is written.
It should also happen when the data is coming in.
So Tigris and object storage, it's active-active. So you can write in Sydney, you can write in San Jose, and it can store the data close to the user.
But then that's where the fancy part comes in,
that now when the same data is referenced somewhere else
or needs somewhere else,
it's able to figure that out and move the data over there.
So think of it as an active-active storage system
that stores data close to the user,
but then distributes the data intelligently
based on the request pattern.
Yeah, that's awesome.
I think, you know, we have a ton of questions, especially I think in the same line of what Ian has asked, is the trade-. Yeah, that's awesome. I think we have a ton of questions, especially
in the same line of what Ian has asked,
is the trade-offs almost
that needs to be acquired. When I read
your blog post about the philosophy
that resonates really well is a two-hour
problem that you mentioned from Kurt.
It's a beautiful way to summarize.
You just want to make something
take forever down
to a simple... It's not going to be seconds, but definitely less than two hours to be able to solve something really complicated.
And I feel like storage is one of those things, you can make it very complicated pretty quickly.
Because it's such a low-level infra, there's a couple of things you have to make right.
It sounds like you don't know what to be replicated.
And there's going to be consistency, there's going to be latency, there's a lot of things.
And that's why traditionally, storage
is hard to make it easy
to use while
have all the knobs.
What kind of trade-offs are we trying to make
to make the storage
piece easy to use so it becomes a
two-hour problem truly?
And it's not just for toy data.
Because you can do simple things in two hours
with no configuration whatsoever.
But if you're truly global,
you truly have a bunch of data,
it's hard to imagine something
actually becomes a two-hour problem.
How do you do it?
Yeah, yeah.
When it comes to the trade-off,
whether it is toys or database,
the trade-offs are the same.
There's no availability and consistency.
The semantic that we provide is single region,
strong consistency or serializable consistency,
multi-region eventual consistency.
So the trade-off we are making there
to making this global is that
if you're writing to the same object
in multiple locations,
eventually they are going to converge
into the same value,
but you're not going to pay the penalty at the right time. That's the choice or the balance that we have made from a consistency
perspective is, you know, eventual consistency globally and strong consistency in a single
region. And that kind of trade-off helps us with the latency part, but from someone who is using
storage, that is something that they have to, of course, keep in mind because nothing comes for free.
So if we were to make it strongly consistent globally, then of course, at the right time, you have to pay the latency of replicating the data to multiple locations.
And then the right latency would suffer quite a bit.
That's, I would say, the fundamental choice we made from a consistency perspective.
The other thing is actually tied to the beauty of object storage API.
The reason storage systems were so complicated in the past
is because they were trying to support the 4.6 API,
which doing that on a distributed, you know,
through a distributed file system is really hard.
And that's what enabled S3 to be successful,
or provide the kind of service it provided
by having very simple API, right?
So for example, if you compare Ceph to S3,
Ceph is trying to be everything from a storage perspective.
It's trying to be a file system as well,
a POSIX file system as well as an object store.
And that is what makes the system very complicated.
So we actually think that we are fortunate
that we are S3 API compatible,
or there's some standard
around objects to APIs
which are fundamentally
very simple APIs.
That automatically reduces
the complexity for us
building these things.
The API itself enables us
because of the simplified
constructs it provides.
I mean, these are
simplified constructs.
It's not as feature-rich
as a file system,
but those who are
building applications on top of it understand it.
They are not expecting it to be a file system.
So that produces a lot of complexity on the storage side
that comes into play when we would have to deal with data replication
and consistency.
And outside of that, we took the trade-off of favoring latency
or consistency for global needs.
So what type of apps do you enable people to build?
Like with this work that you've done, the new API,
if I'm a developer, pitch me on why I would use Tegris.
What is it today? What do I use it for?
And there's different types of developers.
You've got people building apps, you've got people building infra,
like WarpStream and the new versions of Kafka
and the new streaming tech and stuff.
Help us understand the value you bring to the app builders.
What are you able to do today out of the box?
Yeah, so if you are an infra person, if you're building something infra and you care about multi-region semantics,
then that's something that you don't need to handle on your application layer.
That's something that's handled by the storage layer. So as I gave an example of building a multi-gen
Kafka streaming system, you can rely on the storage itself making the data available in a different
region. So that cuts down a lot of complexity that you would have to deal with on your side.
The second thing that we provide and that is connected to the
multi-region nature of the product is caching of the data close to the user. So that improves the
performance for workloads overall. So any application that cares about performance and
low latency use cases, which pretty much all applications should, then they would benefit
from it. And that means that applications
that have low latency requirements,
which could not be built directly on object store in the past,
can now be built on top of object store now.
And without having to deal with CDN
and consistency issues that come with a CDN.
If you're building a global application
that needs to store data close to the user,
it could be a you know, a logging
platform or an observability platform, or it could be storing assets close to the user, or it could
be about, you know, running training jobs everywhere, wherever GPUs are available. All of those use
cases, which are global in nature, becomes accessible based on the feature that we provide.
I'd love to understand, like, you built this great tech,
you've got this stuff you're doing,
you're enabling Fly.io.
What comes next?
You built, like, an S3 API,
but you built higher-level APIs.
Like, what has the platform become over time?
The general challenge when you go and just build,
like, a generic S3 bucket API is you solve multi-regionality,
and that enables, like,
you just solve these core problems
for developers that are trying
to build global apps
and that's incredible.
But is there more that needs
to come out of the box
to turn it into a more full,
like do you go horizontal?
Do you go vertical?
Like help us understand,
like where do you think
this goes from here?
Yeah, we would still continue
going horizontal in terms
of adding more feature sets.
So S3 API is great to start with.
And, you know, it's a well-known API, kind of like the standard these days. And going with it also
makes it easier for people to adopt Igress as well as not have a vendor lock-in problem.
But it is constraining in nature. So one of the things that I mentioned previously is that what
if you wanted to have more richer functionality
on top of your object store,
meaning that you want to stream the latest 10 objects
or you want to dissect objects based on some kind of a metadata
or you want to store embeddings and then query the embeddings.
Like if you want to have any type of querying functionality,
that would require building something higher level.
That cannot simply be done by extending the S3 API. That would require us to build another API
on top of it that enables this functionality. But all of the things that I mentioned around
metadata querying capabilities, making small object access fast, or making it fast to access
random byte ranges within large objects to
continue to improve the performance.
Some of these things do not require API changes, but they are horizontal in nature, meaning
if we further improve the performance for small byte ranges, small objects, that applies
horizontally to all workloads.
But metadata acquiring capabilities and improving those capabilities or being
able to attach new type of metadata
to the objects,
that requires changing the
API or that requires
extending the API from what it is today.
So one thing I think
why it did, and I think very relevant
for you, is they own their own
hardware for the most part.
Because to keep their margins and to be more efficient,
I mean, there's a bunch of decisions why they want to do that.
You know, number one thing I think about storage is the cost.
You know, storing a whole lot of files, storing a whole lot of data.
I mean, S3, I guess the incumbents here,
has run so many physical and capacity at this point
to make that work for such a long time.
I guess, you know, as your startup,
you probably won't be trying to run as much hardware.
Are you running your own data centers
to try to do your cost efficiency
and those kinds of things as well?
Kind of like the fly philosophy.
I'm just curious how you think about
when it comes to actually running the storage themselves.
Yes.
For a service like us,
or for anyone building a platform,
there is no way that you can have good margins.
If you run on AWS,
it's not going to be possible.
And what's going to happen is what happened to Heroku.
That's what's going to happen, right?
And as you were mentioning,
from a storage perspective,
hardware optimization is so important that you need your own hardware. But what we are doing right now is we are partnered
with Fly. We are running on Fly hardware. And the goal is to continue partnering with them.
We think of Fly as easy to as a service. So they provide us with, you know, low-level machine
infrastructure that we use to build our platform
on top of it, including running Kubernetes and a bunch of other things. So we would continue doing
that. But of course, we would need to figure out the hardware optimization part together with them
that will allow us to make profit, essentially, or be profitable. The hardware piece is definitely
important. There's a lot of innovation that has happened on the storage side as well over the last few years.
For example, if you just look at SSDs,
QLC has greatly reduced the dollar per terabyte.
Fast storage costs will continue going down.
Hard disk costs will continue going down.
Then there's a bunch of other innovations
that are happening on the storage side,
including controller-level compression of data
that is getting written or transparent compression of data.
So we will be able to utilize all of that
only by having custom hardware.
That's definitely something that we'll have to do.
But we'll continue to partner with Fly on doing it.
There's a part of the infrastructure
that Fly has figured out,
so it doesn't make sense for us to go down there and do that.
We'd rather focus on building
slightly higher level
product that we are building.
So are there like specific communities
you think you like targeting
with your release now?
We are mainly focusing on
application developers right now.
We haven't focused yet
on big data use cases.
Actually, if I take a step back, when I classify the use cases of storage, I think of it as three distinct use cases.
One is, of course, apps.
Like if you're building an application, you need to store assets or logs.
You know, that's what I classify as application use cases.
Then there's big data use cases.
If you think of Snowflake and other systems.
And then the third one I think of is the AI-related use cases.
There's a lot of unstructured data that needs to be processed and accessed and stored because AI needs it.
So I classify the use cases in three distinct types this way.
So we are focusing on enabling the application developers.
And Fly has more than a quarter million developers
and we have made our product available to them.
And we are seeing really good adoption
in terms of people using us to power their applications.
So that's one specific use case that we are targeting.
We also have AI-related use cases using Tigris.
So we'll continue powering them
and also adding more and more features
to enable those use cases to succeed.
We haven't yet started looking into
big data-related use cases
because that's another beast
and there's another set of optimizations
that we can do on the storage side to enable it.
All right, so now we want to jump into
our favorite section called a spicy future.
Spicy futures.
I think working on another S3 is already pretty spicy, personally,
but I want you to maybe even go further.
What is your belief or spicy take on storage?
And what do you believe should happen in the next three to five years that's not obvious? In the next three or five years, we will see all persistent systems basically being built on top
of object storage. We still see a mix of systems, especially on the database side, built using
local disks or handling their own storage. In the future, I don't see any of that happening.
I see the databases more as high-level applications.
Whether it is databases or any other systems,
they'll be focusing more on the API and the developer experience,
and the rest of the storage infrastructure piece
will just be handled by object storage.
That's what I think it will be in the future.
And I don't think it will take three to five years.
I think that in the next two to three years, we'll see that.
It's a fascinating question to kind of move in the future.
Because I mean, we are seeing people already building,
like WarpStream was on our pod earlier.
It was a fun chatter how they've done things
and there's more systems like that.
But also I always feel like there's a fundamental question.
I feel like everybody is fighting the interface S3.
Like what can you do is so much based on
how easy it is to do certain things on top of S3 as well.
Because I feel like S3 has evolved,
but not changed drastically, I would say,
the last 10 or 15 years.
It's still mostly the same mechanism.
You can add a bunch of tags and do certain things now,
but it still feels not as
intuitive. I personally feel like there's an S3 API and there's a POSIX, right? You have to pick
one of the two extreme ends. And that's sort of like, there's nothing in between. Do you feel like
there is a need for almost like a new interface to make building on top of S3 or any global distributed storage for everybody able to
build on top of requires almost like a rethinking to how you actually interact with the storage
directly. Of course there is rethinking needed on the API side. I don't think we can go all the way
to POSIX extreme. I don't think that's going to work in a distributed environment, and that's what the cloud-native
environment is, essentially a distributed environment.
It would have to be somewhere in the middle.
There is
some level of functionality that S3 provides
through its APIs. That
is not going to change. But beyond
that, yes, one of the
key things that's missing is transactions
or transactional APIs.
That, I think, would have to come. And then querying
the objects, that would require changes to the API. So that
would definitely have to happen. The changes to the API would definitely have to happen.
But I would see it more as evolution of the API. And I don't see it going all the way to
POSIX. I think all the companies, maybe you or CloudFlare,
have the opportunity
to like
redefine
the object store
layer though right
like I think
one of the points
maybe we don't
have the breadth
or depth of POSIX
API for file access
right for file systems
but like certainly
you know in the new
cloud world
S3's API has become
the new you know
file system API
does it have the same
breadth and depth
does it have all
the same features
not exactly but certainly if you look at the new version like the way we system API. Does it have the same breadth and depth? Does it have all the same features? Not exactly.
But certainly, if you look
at the new version, like the way we built
our cloud apps and the new infracomming,
like S3 is the default API.
Do you see this as your opportunity
to drive this API forward?
I guess you're kind of stuck with this sort of
translational layer problem, but
there's a new POSIX API already, right?
In some way, it's just hyper-constrained. Yeah, the reason I don't say it of a new POSIX API already, right? In some way, it's just hyper-constrained.
Yeah, the reason I don't say
it is a new POSIX API
is because of the challenges I've seen
and the nightmares I've seen
on distributed file systems for POSIX APIs.
I just want to not even
think about that.
But yeah, you're right.
S3 APIs have become the standard.
When we say that every system is going to focus more on the compute and higher level abstractions
and what experience they provide to the user rather than focus on doing storage things,
it's all that is going to get pushed down to object storage.
That, of course, means that the API has to evolve.
And I see this as an opportunity for us.
So we're already thinking about how to improve on the API, as I was talking before,
to just to add more metadata querying capabilities.
We have to have a more higher level API.
The other challenge, though, is that you have to start somewhere.
And the reason for starting from S3 compatible API is because that is the standard right now.
That's one.
The other thing is that it gives confidence to people, especially if you're a new company like us.
Isn't it easier for you to adopt
if the API is something you already know about?
Normally, you have to learn the new system,
you have to learn the new API.
So it has to be a natural progression for us
where we continue supporting the S3 API,
but then we work on a higher level API
based on the use cases we are seeing now
or the use cases we have seen in the past,
build a higher level API that is more functional
and that can support the use cases of systems
in the future in a better way.
I think fundamentally, there's usually different kinds
of developers building different things, right?
We talk about CDN and apps,
and we almost treat them as somewhat static
to maybe a little bit dynamic content,
but it's not really high transactional, high QPS, that kind of thing.
When we talk about the work streams or everybody else
who are building systems on top of S3,
that requires so much more nuance.
And they went to great lengths to build a brand new system on top of S3
because it requires a re-architecture.
I think it's a really interesting sort of like state we're in. And you have the opportunity to
kind of redesign this. There is also an opportunity for you in the short term to help system developers
trying to build on top of you. And what would that, is there anything specific you want to empower them
and how? Application use cases are going to empower them and how?
Application use cases are going to be different.
And people who are building platforms on top,
like WarpStream or OnNion or other folks,
their requirements are going to be different for sure.
So our focus is on building a general purpose-to-weight system,
but bringing an elevated level of experience by initially
just improving the API, but fundamentally making fundamental changes to the API.
So that does require us working with the platform folks as well.
So we are actually working with some database platform folks as well.
And that would enable us to cross-check the kind of API changes or functionality we were
thinking in terms of enhancing and what they need. So yeah, definitely both are the focus.
But as I was saying before, we do need to have this broader adoption first, which requires us
to be general purpose in nature and make sure we cover all the typical use cases.
And at the same time, we are working with the platform folks as well, and even with
some AI platforms as well, because they also have some distinct requirements from object
tools.
So all of that combined will result in a much enhanced API.
The workloads are different.
So the type of API needs that they are going to have is going to be very different.
Yeah, totally.
That makes a lot of sense.
I'm really curious, you know,
what's the future of how we build apps?
Like if I'm building a CRUD app,
I'm building something with an LLM, whatever,
what does it look like, right?
What do you believe the future is and why?
You know, certainly you've made a bet here
in starting a company around, you company around building multi-regional
S3. Paint a picture for us what the future looks like and why you believe that to be true.
So from app development perspective, the future is going to be global in nature.
We are already seeing global app adoption with Fly and Vultcell and Cloudflare. We'll see more
applications being built that way and that again, again, ties in with our thesis around requiring multi-region application.
We will see LLMs more integrated, and we'll
continue to see that. We'll probably see more higher-level frameworks
that make it even easier to build applications, so you're focusing more on
the business logic and not so much on application orchestration and
other things that you have to deal with today.
So application development, to me, will become more higher level in nature.
And the platforms underneath, whether it is the compute platform
or it is the storage platform, they will have to deal with the general
infrastructure pieces and make it essentially infrastructure-free
for the applications.
It's a really good take. And I'm really excited about all the future developments in this space.
So where do people find more about you and even get to play and try to use Tigris Data?
At tigrisdata.com, folks can find more about us there.
And from there, there's a Getting Started link, which will allow them to access the platform.
It's very easy to get started. Just need to have a flyer account and that's all.
Amazing. Thank you so much. It's been such a pleasure. And I learned so much
about what you're building today and why it is absolutely necessary.
Awesome. Thank you so much for having me here. It was a pleasure to be here. And,
you know, a lot of thoughts coming into mind
based on the questions that you were asking.