The Infra Pod - Future of File Storage for AI ( Chat with Hunter, CEO of Archill)
Episode Date: September 8, 2025Ian (Keycard) and Tim Essence VC) engage in an insightful discussion with Hunter Leath, CEO of Archil. The episode delves into the motivations behind Archil, a new data startup focused on revolutioniz...ing data storage for cloud applications. Hunter explains the limitations of existing data storage paradigms like S3 and traditional block storage, advocating for a new approach that leverages SSDs and custom protocols to offer high-performance, infinite storage that can support modern workloads, including AI and CI/CD applications. The conversation also touches on the trust and complexity of implementing such a system, the future vision for file storage, and practical use cases ranging from serverless Jupyter notebooks to large-scale CI/CD operations.00:17 The Gap in Cloud Data Storage01:48 Understanding Unstructured Data02:41 Building Modern Data Systems04:15 Challenges and Innovations in File Storage11:12 Targeting CI/CD and AI Workloads23:00 The Future of File Storage
Transcript
Discussion (0)
Welcome to the InfraPod.
This is Tim from Essence, VC, and Ian, let's go.
Hey, this is Ian Livingston, lover of new data things.
Couldn't be more excited to have the CEO and founder of Arkeel, Hunter Leith, on the pod today.
What in the world convinced you to say, hey, you know what we need?
Another data startup.
Well, you know, it was not my intention to do a startup, actually.
I just progressively became radicalized working closely with engineering.
seeing what they needed, and found this huge gap that no one was building and had to do it,
basically.
So what's the gap, help us understand what you saw?
So fundamentally, data storage in the cloud has not changed since the launch of S3.
The cloud got us on-demand servers and on-demand object storage, but the devices that are
attached to these servers are fundamentally still hard drives.
SSDs, technology from the 70s and the 80s.
And the cloud is supposed to be this infinite on-demand place.
So why does everyone have to plan for the capacity of data attached to each server,
move data onto each server, and do all that synchronization themselves?
We think that the right way to approach this is with a primitive that gives people infinite storage
attached to their servers
that automatically synchronizes
from the data sources
where data is supposed to live.
That's incredibly interesting.
And so when you say data is supposed to live,
you mean like what type of data
are you talking about here?
Are we talking about like a binary blob file?
Are we talking about like Salesforce and CRM systems?
Like what's data mean in this context?
So for me, because I've spent too long
working on file systems,
when I talk about data, I talk about unstructured data.
So stuff that's not sitting in,
databases. This is a ton of image files that you might be using for computer vision. This could
be a ton of PDFs that you're doing with RAG throwing into a model. All of these kinds of data
that people generally want to store an S3, but have to download to a server to use.
Okay, cool. And I mean, I think this is like a very interesting, like, a statement in the world.
I just would love to understand, like, how the hell do you make this actually happen, right?
Like, this is sort of the 2000-era block storage device, you know, HPC cluster-style solution, right?
So I'd love to understand, like, how do you build those types of systems, Gassan, for the modern era and the modern cloud data landscape?
Well, I think that there have been a lot of attempts over the past 20 years to build something that does this, that exposes object storage locally.
And for the most part, those attempts have failed
because either they don't expose that data
in a way that's truly POSIX compatible
so that regular applications like FFMPEG or KDB
or even SQLite can really use it.
And they are using object storage
as the next hop from the machine.
And object storage notoriously is very slow
compared to EBS, block devices,
NVME. And so what we've done is kind of like you mentioned, built a sand of sorts where we
manage a large fleet of servers that have NVME solid state drives attached to them. And then we've
built a custom protocol that allows us to communicate from our customers' instances to this
sand in a block-like high-speed, high-performance way that gets you those local like speed.
needs.
Interesting.
So I mean, like I have a thousand questions, right?
Because underneath this is like, you know, you have data locality questions.
You have like, how do you even get like the data of the downstream represented to the
file format questions?
I have questions about caching.
I have questions about file format.
I have endless questions.
Like what is the fundamental thing that we've been missing from doing this, right?
Ignoring like the ecosystem problem, but like what is there like some type of fun model
breakthrough or something that has made this the moment where we can say, you know,
what? Now we can create sort of like this unophile, unified, a new version of a new unified
file system that maps better to like how computing works today versus how computing worked in,
you know, the 1970s when Unix and Co. came around. Frankly, no. I wish that I could come on and tell
you that we've made this incredible breakthrough that has enabled this. And this has actually
been very disappointing to my friends and family hearing that I'm going on this journey and that
There's not some magical invention.
But ultimately, it is going back to file storage as the primitive protocol to use.
And the problem is that historically, NFS and these file storage protocols have this bad rap
about their performance and chattyness that makes them unsuitable for cloud workloads today.
I spent eight years on an AWS product called the Elastic File System,
which is effectively infinite NFS storage in the cloud.
And during that time,
I was able to really analyze how people used the file system
and how NFS the protocol works.
And the big leap for us was realizing
that we could build something that kept the file semantics of NFS.
So you could continue to understand what was stored,
do the synchronization on the back end,
in a native format to a place like S3
and provide the infinite capacity
that ultimately block devices can't do
because they are designed to have a finite capacity
with some small semantic changes
that allows us to sidestep
many of the performance problems
that NFS has historically had.
But ultimately, these techniques are as old as time.
What do you think is a reason
that no one has done
this before, right? Like, if we've always had these primitives, we go into positive file systems,
like, what has been the primary blocker? Is it been, you know, some cultural shift in the same way
that, like, every 10 years we reinvent the stack and everything's different? Or, you know,
was there some type of, like, hey, we had to wait for a certain amount of, like, data to, like,
be online and have those standards sort of, like, kind of coalesce to something that makes this
a more of a tractable problem. If you were to, you know, sit down and to kind of say the why now,
What is the why now?
I think that there are two reasons that this hasn't happened before.
One track of reasons is just around the difficulty of building something like this,
where fundamentally building a block storage device is not so scary.
I can store a block and retrieve a block.
Building an object storage device is not so scary.
Building an NFS device is also not so scary.
These are understood problems that the industry has coalesced around.
over the past 40 or 50 years.
So combining them in this novel way
is something that is not natural
to a lot of these traditional storage providers.
And then I think the second trend that we see
that makes this necessary
is, to your point,
the rising size of data sets
that people have to deal with on a daily basis.
Whereas in 2010,
maybe you would have a database that had a million records,
like Facebook had some scale,
Twitter had some scale, but it was really localized.
And now there are so many companies that are building data infrastructure
dealing with things in the petabytes or even exabytes,
which requires extremely careful consideration,
that this now becomes a very important problem for a lot of people to solve.
I have so many questions, sir, because I am a user for EFS for quite some time.
Actually, I had to build another file system caching for,
or the startup I worked at called Pytrus Lightning.
You know, just be able to actually figure out how to cache machine learning AI workloads data,
much for performance.
And I feel like, like I said, it's not a new problem.
This is a problem that existed forever.
But the nature of using this data to patterns, I think,
and the sort of number of people doing this new pattern definitely has changed now.
I don't think we have that many people trying to do like picketing,
picketing feature data on PyTorch
and then trying to cache some level of it
and trying to make it as fast as possible
and you don't even know what data set you would iterate from.
And I'm very curious because when you build a generic file system,
you're trying to figure out how to do the best sort of trade-offs, right?
Because every file is unfortunately has trade-offs all the time.
And why POSIX is good, also bad sometimes,
is like the trade-offs is not always most suitable for everybody.
So everybody trying to build a system on top of these files systems
and we all have to make a choice.
Okay, do I play simple?
Do I eat the cost of the file system?
Or do I have to kind of like do something else, right?
And for your point of view, because you work at Netflix, right?
I saw you work at EFS.
Working Netflix versus working EFS is so different, right?
Because Netflix, you have much more special purpose.
You know what you're using it for.
When you work on a product, you kind of know what your customers are using it for,
but you kind of have to make a choice, right?
Now you're building a company again, right?
You're not working at any specialized products.
at Netflix as a user, how do you know what kind of users you are trying to focus on now?
You know, because like the access patterns, the scale and what do you even test for?
How do I make choices?
There's a lot of choices to make here.
I'm very curious, like, what is the typical patterns you're really optimized for?
So for folks that knowing that side of the world, like they have this in mind, right?
Think of what you're doing here.
Yeah, I think that that's a good point.
And it was extremely interesting moving to Netflix from EFF.
I had been a product manager at AWS for three years doing this file system thing, talking to
customers, and I remember being at Netflix about four weeks and thinking that my entire
worldview had changed, just being able to now be on the purchase side of AWS as opposed
to the selling side.
And I also think you're right about choices.
One of the things that is challenging about doing a startup is that the amount of time we
have to do engineering is so limited compared to the places like AWS and Netflix.
And so while we ultimately want to become this general purpose data storage that replaces
the block storage layer in the cloud, we do have to prioritize who we're able to serve
in the short term. So today, we're looking at a couple of different areas. One area is CICD,
companies which are running GitHub actions, doing caching.
Traditionally, these workloads are very poorly served by shared storage
because they have so many small files.
And if anyone's ever used NFS before, knows, does not work with lots of small files.
We also see a lot of enterprises trying to quickly build infrastructure
to support their AI organizations.
And traditionally, this infrastructure is very stateful.
You have to download very large models and work with very large training sets,
which requires either expensive hardware or some kind of custom caching solution
or researchers to get really good at synchronizing data manually to a place like S3.
And so we're able to come into these companies and offer them a way to build
serverless Jupyter notebooks or serverless training or serverless inference on large data sets
because we take care of moving the data to the instances.
Interesting.
We just had a blacksmith on our pod earlier
to talk about CICD and his specialized hardware
and stuff like that.
I'm very curious.
Maybe talk about more about the CICD problem.
Because, you know, as engineers, we all use CICD, right?
And you kind of inherently understand
the file access patterns is not super fast rights,
typically, unless you're doing crazy stress.
stress testing on the database or something.
But most of the time, it's more reading and writing, I assume, depending on what you're doing.
But a lot of dependencies loading, right?
A lot of quarter, like, getting just environments set up.
And then you're executing a bunch of things, which writes some stuff, but not usually
majority of time writing, right?
And can you talk about, like, why do you pick CICD as this first use case in mind?
Is it because that you want to speed up things?
That's actually is a very big pain point.
And also, like, what's the typical file system thing that requires more attention to?
Like, what's been the particular type of problem on a file?
I think small files is a good point.
But, like, what other things are you have to keep in mind to solve a CICD problem really well?
Yeah.
So it's an interesting problem because I think it manifests in so many different ways across the industry.
For example, many of the CICD companies that we're working with today,
are effectively doing Docker builds.
And the problem with that is you have to have this cache
of all of these layers that appears on every instance where you're doing the build.
That is actually the same problem as running a container platform at a company
and trying to deal with cold starts when you have an extremely large layer
that you need to download that delays starts.
Or you have a layer that has a ton of someone's packaged the node modules directory in the layer
and suddenly it takes all this time to unzip
leading to latency-starting containers.
So these companies that are doing CICD
are facing these same problems
because they're dealing with the node modules directory.
And you're right, it is a very read-heavy workload
of how do we put, not to just focus on node modules,
but how do we put that node modules directory
somewhere centrally so that we can fan out
and run the build in parallel
across many instances
sharing that same package cache.
Is this only an organization that applies to CICD
or is an organization that you've carried into runtime as well
for like container container?
And I'm kind of interested why you chose CICD versus potentially like other places,
especially given the fact that a lot of the data stack today is on top of like object storage.
I think that it's about who has the most pain.
And when we come back to like we have such limited engineering
resources. It's not that we don't plan to address all of these things in the next 12 months,
let's say. It's about where are we in the next 60 days and how can we best get there?
And because these CICD companies are dealing with this problem and ultimately have no choices
that work for storage, that's where we decided to start. But yes, I think what is great about
being a storage provider or a data provider in some sense is that the level of the level
leverage in the stack at this place is immense.
So when we talk about doing these optimizations for CICD
and you asked about carrying them over to runtime,
yes, that just works because our storage is designed to do that.
And then when you ask about how do we focus on
maybe larger data sets, which are stored in object storage
that need to come down, that's something that we also do
pretty well and we'll continue to get better at over time.
So it's a good place for us to start,
but it's not where we'll be ending.
Gotcha. Okay.
And it'd be interesting to understand.
So are you going and selling to individual companies using SICD or directly the CICD providers?
Two providers.
Generally, in all cases, we're dealing with the layer of companies who have enough data that these problems manifest.
So either CICD providers or not a customer, but an investor, modal, like that kind of company,
which is dealing with the data movement on behalf of their customers,
we can solve the large data problem for them.
That's fascinating.
And so I'm curious because, like I said,
I was a user of these kind of stuff before.
And I feel like there's always a dilemma between how much transparent this feels, right?
Because I want a file system that just works as positives, right?
I just wanted to write.
I just wanted to read.
I just want to list files.
I don't want to have extra things.
But when we're building a solution, I remember ourselves, it's so much importance that we know, hey, this is a node module data that we want to be caching.
And I don't always want to make, like, you go through like 10 runs and to figure that out, right?
Because then you really are sacrificing 10 runs latency to kind of like notice a pattern.
And so oftentimes, you know, like, like Colonel has cash hints, right?
Everybody has like certain like ways to tag things to give hints to the underlying systems.
I'm curious on your side, how do you think about this, like, read-heavy workload?
Because I saw you have a caching layer on top.
That's really, like, a big part of how you get the performance.
You can't cache everything, right?
Because you just really blow up your cache sizes.
Is there certain things you are trying to just observe from the workloads
to make your caching doesn't go out of bounds that often?
Or are you trying to get the user to tell you certain things?
Like, how do you think of that trade-off here?
Because I think traditionally, NFS is the world, we're just, just writes, right?
Just read.
And you sort of just figure it out with very poor, I would argue, poor pattern like matching here.
So how do you think about that problem in general for you?
Yeah, I think that that is effectively the core thing that we need to solve as a company.
And our job, when we create a product, is how we create something that has progressive complexity
for our users that they can opt into as needed.
So I'm not sure when you plan to put this up,
but I believe in August we're planning to launch something called disk.new,
which is going to be the very simple,
any individual developer can log in,
you can attach an S3 bucket,
you can attach a Hugging Face model,
and you get access to a shared device
that has all of this data on it for your entire application to use.
But then for larger companies or people who have more performance needs,
we want to be able to expose their ability to tell us what's going on.
Maybe they tell us that they're building a Docker layer cache at a certain part of the file system,
or they tell us that a certain part of the file system is an extremely large, immutable reference data set.
And knowing that semantic information allows us to do performance optimizations that other providers can't do.
And this is one of the key reasons why it's so important for us to do this at the file level.
Because a block device, of course, doesn't know what's stored where or semantically what's happening, but a file system does.
And so being able to, for example, like a Mosaic ML kind of run, have the application tell the file system what data is needed in the future allows us to almost preemptive.
the caching and send it to the client early such that they're not waiting on us.
And there's a million optimizations like that that we hope to explore.
That's super cool.
I think actually that is probably one of the hardest things to find in the market back
then when I was doing this because most file systems are like the lustres of the worlds
are so academic focused almost.
Like it's built for larger labs kind of feel.
And even the companies using these products are very bespoke type of thing.
They don't really roll out everywhere, you know, because, like, file system is not something you want to mess with, usually.
Like, it's really, really bad.
Everything goes wrong.
So I'm very curious, giving you, I know you've worked on EFS for quite a while, right?
You've done this before.
But I assume any customer here, it's like, oh, here's a brand new file system for you to try.
You never heard of us.
You never know who we are.
You know, go, go, go yolo.
It's never that easy to trust a file system because it's fundamentally where your data sits.
And any corruption or any bad operation that happens here is very,
very tricky. What is sort of your path to show that you are ready? I don't think that's
the easy task at all. Do you have to do a certain type of testing? How do you earn trust for your
customers? Start early with you guys. Yeah, I think that these are all great points. And if you
look even at what DeepSeek released with 3FS a couple of months ago, it's something very
exciting, but ultimately it is so specific to what AI model training needs, that it can't be used
for all of these things. And one of the interesting comments that I got when I spoke to a large
GPU cloud that uses VAST under the hood is that their customers love VAST for storing
these reference data sets, but they keep running into problems because the developers are
trying to pip install packages on it, and it's not made for that. So this is why I think being
able to go farther than other storage providers by effectively switching based on the directory
and based on what customers have told you, the performance of the file system is what's going
to make the best experience for the end user. Now, you also asked about trust, and I think that
that is the most important problem for storage, database, infrastructure companies,
in general. And ultimately, the only way to earn trust is by doing the right thing over and over
again. This is one of the reasons why I mentioned CICD as a starting point for us, because when
you're dealing with caching workloads or you're dealing with node modules, well, customers
are less fearful about putting a ultimately package cache on a new file system. And by earning trust
in that domain, we're allowed to then earn trust with more and more production critical workloads
until ultimately we're able to run anything.
But it's something that takes time.
I mean, I think trust is like the most, I mean, in any infrastructure situation,
trust is the key enabler, right?
Like, if you can't trust it, it doesn't matter.
And then there's a layer of things below it.
I'm curious, how does the rise of like effemial AI workloads kind of play into the way that you think about
the future of this type of layer you're trying to build?
But how are those differentiated? Does it make things like what you're building more needed?
Like, help us understand how you fit into that broader sea change that's going on, right?
Like, broadly speaking, AI will effectively force us to terraform our stacks.
We're going to go through a large up-leveling, a bunch of new technology going to be brought in as a part of that.
I'm curious to understand how what you're working on fits into that story.
I think that it is very clear that more applications are.
going to be written in the next five years than in the past 20 years. And I think if you look
at these platforms like Replit and like even neon most recently, the core infrastructure that
these new applications need is something that is serverless in scales to zero. And what I'm so
excited about for our company and for our customers is that this kind of infrastructure has
never existed for unstructured file storage.
S3 does it for inactive, almost archive data,
but there's nothing that does it for the active drive
that's attached to your server.
And we hope to fill that void for the AI applications
that are being written.
The other interesting thing that I saw recently
is Fly.io made a post about the robots
that were coming to take over their stack
and how they've seen more usage from AI agents
in the past six months.
than humans.
And there is this one paragraph
that speaks about how
as people at fly.io,
they imagine that every customer just wants to store data
in a post-grace database.
And that is like the fundamental building block of applications.
But the agents that they see
prefer to store data in files
because ultimately it is a more stable,
easier to use interface that's accessible anywhere.
And so if we can provide,
the serverless unstructured POSIX file storage for AI agents,
we expect to become the building block of this next generation of applications.
That's very interesting.
And I mean, ultimately at the end of the day,
it's like iteration speed, caching, and decentralization are like the core
tenets of any like AI workload, right?
I mean, everything's going to be hyper-phemeral.
Everything needs to be very fast because your validation loop has to be very quick.
And you also, you know, data centrality,
and then also reducing a number of tool choices, right?
Like, it's very easy to build an agent or, you know,
some type of autonomous workload that can understand POSIX versus, like,
having it understand like a thousand different APIs, right?
Like, the tool surface area, there's actually quite simple,
especially if the file format upon which that data is revealed is actually quite basic, right?
Like, this is kind of the part of how ultimately this next generation of AHI
actually represents a massive layer of consolidation and reduction of moat and destruction of moat,
than anything else.
So I'm kind of curious, like, what do you think,
what's this going to do to the future of people's staff?
Like, what's the future of buying and building
and creating software looks like
as a result of some of these C changes?
Well, I think from the AI trend at large,
obviously people have many strong opinions
about what's going to happen over the next couple of years.
My personal belief is that we're not going to get to a place
where every person becomes a developer.
We've talked about that in the late 2010s, ultimately it didn't pan out.
I think most people don't want to write their own software,
and most people want to offload this ability to make decisions about design
and maintainability and operations to someone else.
So I think that continues to be the case.
And then when you ask about what the teams who are building software need,
I think, like you said, it's how do we build these offensive?
seminal stateless apps easier and easier by using tools like serverless databases,
object storage.
I think our databases become more interesting and complex, like these vector databases
and graph databases.
And then ultimately, all of that data needs to be stored somewhere, which hopefully will
be us.
We might not touch this a real bit, but maybe it'll be clear to tell folks, because I think
a lot of people that think of S3 and mounting locally, we think of all these
open source projects as available for us, you know, to use, like, S3FS and some variants of it, right?
Like, there is, like, existing projects out there to help you do that.
But obviously, it's not, it doesn't solve all problems.
So maybe talk, can you talk about, like, the problems exist?
Why can't I just use an off-the-shelf open source project to, like, just mounts S3?
And what do you provide that really is, like, a big 10x differences for folks?
I think that the adapters that you see today fall into one of two camps.
You have adapters which are POSIX compliant, which would be things like JuiceFS or Objective FS,
that store data in a format that you cannot read in S3.
So ultimately, it is a price play.
And then there are adapters which are not POSIX compliant.
This would be things like S3 mount point or things,
like S3FS or GoofyFS, which are closer, but not quite there.
And as a result, you can't run applications or databases on top of them.
We want to build something that is both POSIX compliant
and allows users to have that data natively in S3 so that they own it
and can use it with the rest of their S3 ecosystem.
And in addition to that, by providing
this middle layer of SSDs,
we're able to do it with much higher performance
than any of these existing tools
that are ultimately just libraries
that you run on your instance that then talk to S3
and suffer the penalties of that for every request,
especially if that data is being accessed multiple times
across multiple instances.
And so since you've been working on this layer,
I'm very curious, to solve this problem well,
you know, we talked about we have to
figure out what is the right trade-offs and, you know, there's a lot of infrastructure in the
middle here, like SSDs and the caching. There's a lot of things to do here. What is maybe a
challenge that you have to solve here that you didn't even have to solve back in Netflix
or EFS times? That's maybe a unique infrastructure challenge for our career.
Well, I think that whenever you combine multiple storage systems, so this would be our SSDs and
and then something like S3, it becomes extremely important to make sure that changes are replicated
in the sort of replica system in the correct order and in the correct way to avoid things like
corruption so that users who are trying to access data from both sides get something like
a consistent view.
And these are non-trivial problems that we did not have to solve at AWS because we were
building a primary storage system that was EFS. And we did not have to solve at Netflix
because we were dealing with specialty infrastructure that was very purpose built to the problem
at hand. So as always, you know, when you make things generic and when you combine things
and when you add performance requirements, the complexity explodes in how you build these
things. And I'm very thrilled with what the team has been able to put together in such a
short amount of time that we can get out to our customers. Very cool. All right, we want to jump
into our favorite section of this podcast called a spicy future. Spicy futures. So obviously,
tell us something that you believe that you think most people don't believe yet. So, you know,
my spicy take
is that file storage is
this may be surprising
based on what I've said
file storage is the future
it is the future storage interface
that I think is going to take over the world
more so than S3 and more so than
blocks
all right I guess we had to
elaborate more here
I think one
maybe not everybody may not
understand the differences between blocks
and files as S if you're not
into the storage space much
So give us a little bit rundown.
Like, what is the problem of files versus, I guess, object storage here, right?
And why do you think file is the future?
Like, what does the fundamental thing file give you better?
And it would just continue to be the better choices for folks.
So the three fundamental kinds of storage that providers sell and cloud sell is object storage,
which is like S3, and allows users the ability to create an immutable,
entire object, and then have a pointer on the side that has a key that allows you to then
reference that object later. S3 does a good job of making it look like a file system, but
ultimately there's no such thing as directories in S3. It's just a flat key space, and
there's no ability to overwrite like a single byte within an object. You would have to
re-upload an entire 10-gigabyte object. A block storage device is one that works like a hard
that's attached to a computer.
It has a fixed array of blocks.
And the API that it exposes is
re-block at this offset and write block at this offset.
And then, of course, the Linux kernel, or the Mac kernel,
or the Windows kernel, provides this ability
to add a file system on top of a block device,
so it's easy to see what's going on,
and everything is built on this file abstraction at this point.
at this point.
But notably, you can't take a hard drive
and plug it into multiple machines at one time.
That would cause corruption.
And also, these hard drives have a finite length
because you are ultimately just an array of blocks.
On the other hand, file storage is the idea
that you expose an API that actually matches
all of the things that you can do on files and folders.
Create file, delete file, rename file,
create folder, write to the middle of a file,
read from the middle of the file.
It's a much richer and much more complex interface to implement.
But I think it allows you to gain this 10x improvement in power
because it allows you to unlock both the ability for the server
to have infinite capacity like S3.
It unlocks the ability to run with all of the tools,
the binaries that run on Linux today if your POSIX compatible.
And I think it unlocks one of these core tenets of Linux that everything is at the end of the day a file.
The Linux kernel exposes through the dev file system, the SIS file system, the PROC file system,
all of these control structures for the machine that make the file system the universal interface
to everything that you might need to do.
And where I think we come in and where I think the world is going is how we make the file system
a universal interface to all of the world's data.
How do we make the file system a universal interface to something like a vector database,
where we can create the database for you on the side if you put all of the files in a special
folder for us?
And again, I think there are hundreds of things that the file system could be used for
in this way that vastly simplifies the life of developers.
if we have a layer that is flexible enough to expose them to the developers.
Interesting.
I mean, I'm sitting here thinking,
I've been thinking a lot about object storage and LTP Workload just in the last week.
Is your vision that like this enables us to actually build true LLP databases on top of object storage?
That's right.
Is that the idea?
And can you, for the audience, can you help people understand like today with existing object storage APIs, like why?
that's hard or maybe I don't understand but like how does this open up more
transaction workloads versus analytics workloads that could live on top of object
storage well if if you look at what the company neon had to do to make OLTP
possible on top of object storage they had to fork post grace they had to write
their own storage server then they had to run a fleet of SSDs to stash that
data object storage on its own doesn't work because like
we spoke about with Mount Point or S3FS or GoofyFS, it's too slow for what transactional workloads need.
So for rights, you need a place that's very, very fast, like a local disk to put them.
And for Reeves, you need to have the ability to predict what the user needs and then bring it into that fast storage before the user is waiting on it.
So yes, this layer becomes a critical component for companies like Databricks and Snowflake and Neon in order to
get the economics that they need to offer their either OLAP or OLTP databases to their
customers that we make no longer necessary.
Interesting.
And then, like, what type of speedups and use cases do you think we get from the ability
to, like, modified and file directly instead of having to do replacements?
Like, is it both a conceptual, like, simplicity from, like, a file standpoint, like, the ability
to modify, append only?
And then also, I guess the other component of my question is, what use cases become possible,
that weren't or in what use cases are possible today but actually are unadvisable that become
highly like that give that like 10x where it's like okay that we could have done it before but like it
would have been awful so like now we can really do it and this is amazing i think that the
idea around serverless jupiter notebooks is effectively that shape of problem where today if
you're a company like hex what you have to do is run a jupiter notebook
give users some amount of storage space on a local device,
because a Jupyter notebook environment ultimately is a local file system,
give them the ability to download data to that environment
so they can use it frequently.
But then at some point, the user is no longer using the Jupyter notebook,
and you need to shut it down.
And the question remains, what do you do with that storage?
Do you snapshot it and upload it to S3?
That takes time.
And then it also takes time when you do you do that storage.
also takes time when you try to start it again. Do you leave it up to the researchers to do
and tell them that the drive is ephemeral? That also can work, but researchers generally don't
like interfacing with S3 and manually synchronizing the data themselves. So if we build something
that is so POSIX compliant that you could run Linux on it, then any of these stateful applications
that you might run on an instance immediately become serverless because we have a
a storage device that doesn't charge our customers when people aren't using it.
That data flows to S3.
So if a researcher stops using a Jupyter notebook, the company can just shut down the instance
and know that the data is both safe and stored in a way that's not crazy expensive.
And so as a result, do you think, like, what do you think this does to the future of, like,
people stack, right?
Like, does this mean we end up with, like, really fat clients and no servers that talk directly
to like these APIs or do the servers that do exist no longer like actually they're not
really stateful they're actually like basically a proxy proxy bypasses in the same way that something
like a warp stream is for for catholic portability on top of object storage like what's what's
the future of infrastructure look like if this is assume this layer exists what's what's that
meaning for us yeah i i think that's right like i think that it is a a world in which servers
don't have to stay around if nobody's using them.
It's a world in which the Lambda function as a service model works,
but you don't have to rewrite your application into a function.
You just launch a serverless full Linux box,
and then you can run software as complex as an ERP
in a way that scales down to zero when nobody's using it.
That's real cool.
So I guess maybe the last question,
for me is I think this sort of like future as files is super intriguing. What is the way for folks
to even get started? Like, are you ready for folks to even get that magic in some little fashion?
Is it truly trying it on a CICD way? Is he usually made one of the CICD products he already
partnered with? Like, where do we get to hands on this sort of like glimpse of future, you know?
So again, depending on when this airs, I would love for people to go to Arkell.com and sign up for
our wait list. Tell us a little bit about the products that you're trying to build,
the servers, the applications you're trying to run, and we will get back to you to understand
more and make sure that we're a good fit for you. But in mid or early August, we're planning to
launch that disk.new experience, which will allow customers to go out and try a file system
that is maybe not as high performance as we're able to give our enterprise customers on the
back end, but just a taste of what it's like to be able to do anything on top of objects,
storage in a stateless way.
Cool. All right.
I think we'll make sure this come and goes out in August times to make that happen.
But this time you sounds like a very cool way to actually get a way to run AI stuff
at a very, at least more performing than what's possible, right?
With like even just open source options out there.
So super cool.
I guess where do you ask about the usage of this.comnew?
Where can people even find you?
Do you have social channels?
What are a typical way for folks to hear more about your spicy hot takes?
Where do we get more of that?
Oh, yeah.
I'm always posting about files on Twitter.
So you can follow me on Twitter at J.H. Leith.
I'm on LinkedIn posting content about this as well.
But yeah, feel free to follow.
Send me angry messages.
If you don't like files, I totally understand it.
The people I like talking to most are the people who think this is a terrible idea.
So if you're out there listening and you've heard this podcast and you think this is insane, please reach out to me.
I would love to talk to you more.
So for all the angry storage people, Hunter is here for you.
That's right.
And sorry of questions.
Awesome.
Thanks so much to being on our pod.
This is super fun and thank you of being here.
Thanks so much for having me.
Awesome.
Thanks, you.