The Infra Pod - Future of File Storage for AI ( Chat with Hunter, CEO of Archill)

Starting point is 00:00:00 Welcome to the InfraPod. This is Tim from Essence, VC, and Ian, let's go. Hey, this is Ian Livingston, lover of new data things. Couldn't be more excited to have the CEO and founder of Arkeel, Hunter Leith, on the pod today. What in the world convinced you to say, hey, you know what we need? Another data startup. Well, you know, it was not my intention to do a startup, actually. I just progressively became radicalized working closely with engineering.

Starting point is 00:00:30 seeing what they needed, and found this huge gap that no one was building and had to do it, basically. So what's the gap, help us understand what you saw? So fundamentally, data storage in the cloud has not changed since the launch of S3. The cloud got us on-demand servers and on-demand object storage, but the devices that are attached to these servers are fundamentally still hard drives. SSDs, technology from the 70s and the 80s. And the cloud is supposed to be this infinite on-demand place.

Starting point is 00:01:08 So why does everyone have to plan for the capacity of data attached to each server, move data onto each server, and do all that synchronization themselves? We think that the right way to approach this is with a primitive that gives people infinite storage attached to their servers that automatically synchronizes from the data sources where data is supposed to live. That's incredibly interesting.

Starting point is 00:01:36 And so when you say data is supposed to live, you mean like what type of data are you talking about here? Are we talking about like a binary blob file? Are we talking about like Salesforce and CRM systems? Like what's data mean in this context? So for me, because I've spent too long working on file systems,

Starting point is 00:01:52 when I talk about data, I talk about unstructured data. So stuff that's not sitting in, databases. This is a ton of image files that you might be using for computer vision. This could be a ton of PDFs that you're doing with RAG throwing into a model. All of these kinds of data that people generally want to store an S3, but have to download to a server to use. Okay, cool. And I mean, I think this is like a very interesting, like, a statement in the world. I just would love to understand, like, how the hell do you make this actually happen, right? Like, this is sort of the 2000-era block storage device, you know, HPC cluster-style solution, right?

Starting point is 00:02:33 So I'd love to understand, like, how do you build those types of systems, Gassan, for the modern era and the modern cloud data landscape? Well, I think that there have been a lot of attempts over the past 20 years to build something that does this, that exposes object storage locally. And for the most part, those attempts have failed because either they don't expose that data in a way that's truly POSIX compatible so that regular applications like FFMPEG or KDB or even SQLite can really use it. And they are using object storage

Starting point is 00:03:12 as the next hop from the machine. And object storage notoriously is very slow compared to EBS, block devices, NVME. And so what we've done is kind of like you mentioned, built a sand of sorts where we manage a large fleet of servers that have NVME solid state drives attached to them. And then we've built a custom protocol that allows us to communicate from our customers' instances to this sand in a block-like high-speed, high-performance way that gets you those local like speed. needs.

Starting point is 00:03:52 Interesting. So I mean, like I have a thousand questions, right? Because underneath this is like, you know, you have data locality questions. You have like, how do you even get like the data of the downstream represented to the file format questions? I have questions about caching. I have questions about file format. I have endless questions.

Starting point is 00:04:12 Like what is the fundamental thing that we've been missing from doing this, right? Ignoring like the ecosystem problem, but like what is there like some type of fun model breakthrough or something that has made this the moment where we can say, you know, what? Now we can create sort of like this unophile, unified, a new version of a new unified file system that maps better to like how computing works today versus how computing worked in, you know, the 1970s when Unix and Co. came around. Frankly, no. I wish that I could come on and tell you that we've made this incredible breakthrough that has enabled this. And this has actually been very disappointing to my friends and family hearing that I'm going on this journey and that

Starting point is 00:04:49 There's not some magical invention. But ultimately, it is going back to file storage as the primitive protocol to use. And the problem is that historically, NFS and these file storage protocols have this bad rap about their performance and chattyness that makes them unsuitable for cloud workloads today. I spent eight years on an AWS product called the Elastic File System, which is effectively infinite NFS storage in the cloud. And during that time, I was able to really analyze how people used the file system

Starting point is 00:05:29 and how NFS the protocol works. And the big leap for us was realizing that we could build something that kept the file semantics of NFS. So you could continue to understand what was stored, do the synchronization on the back end, in a native format to a place like S3 and provide the infinite capacity that ultimately block devices can't do

Starting point is 00:05:56 because they are designed to have a finite capacity with some small semantic changes that allows us to sidestep many of the performance problems that NFS has historically had. But ultimately, these techniques are as old as time. What do you think is a reason that no one has done

Starting point is 00:06:18 this before, right? Like, if we've always had these primitives, we go into positive file systems, like, what has been the primary blocker? Is it been, you know, some cultural shift in the same way that, like, every 10 years we reinvent the stack and everything's different? Or, you know, was there some type of, like, hey, we had to wait for a certain amount of, like, data to, like, be online and have those standards sort of, like, kind of coalesce to something that makes this a more of a tractable problem. If you were to, you know, sit down and to kind of say the why now, What is the why now? I think that there are two reasons that this hasn't happened before.

Starting point is 00:06:55 One track of reasons is just around the difficulty of building something like this, where fundamentally building a block storage device is not so scary. I can store a block and retrieve a block. Building an object storage device is not so scary. Building an NFS device is also not so scary. These are understood problems that the industry has coalesced around. over the past 40 or 50 years. So combining them in this novel way

Starting point is 00:07:22 is something that is not natural to a lot of these traditional storage providers. And then I think the second trend that we see that makes this necessary is, to your point, the rising size of data sets that people have to deal with on a daily basis. Whereas in 2010,

Starting point is 00:07:41 maybe you would have a database that had a million records, like Facebook had some scale, Twitter had some scale, but it was really localized. And now there are so many companies that are building data infrastructure dealing with things in the petabytes or even exabytes, which requires extremely careful consideration, that this now becomes a very important problem for a lot of people to solve. I have so many questions, sir, because I am a user for EFS for quite some time.

Starting point is 00:08:14 Actually, I had to build another file system caching for, or the startup I worked at called Pytrus Lightning. You know, just be able to actually figure out how to cache machine learning AI workloads data, much for performance. And I feel like, like I said, it's not a new problem. This is a problem that existed forever. But the nature of using this data to patterns, I think, and the sort of number of people doing this new pattern definitely has changed now.

Starting point is 00:08:42 I don't think we have that many people trying to do like picketing, picketing feature data on PyTorch and then trying to cache some level of it and trying to make it as fast as possible and you don't even know what data set you would iterate from. And I'm very curious because when you build a generic file system, you're trying to figure out how to do the best sort of trade-offs, right? Because every file is unfortunately has trade-offs all the time.

Starting point is 00:09:06 And why POSIX is good, also bad sometimes, is like the trade-offs is not always most suitable for everybody. So everybody trying to build a system on top of these files systems and we all have to make a choice. Okay, do I play simple? Do I eat the cost of the file system? Or do I have to kind of like do something else, right? And for your point of view, because you work at Netflix, right?

Starting point is 00:09:27 I saw you work at EFS. Working Netflix versus working EFS is so different, right? Because Netflix, you have much more special purpose. You know what you're using it for. When you work on a product, you kind of know what your customers are using it for, but you kind of have to make a choice, right? Now you're building a company again, right? You're not working at any specialized products.

Starting point is 00:09:46 at Netflix as a user, how do you know what kind of users you are trying to focus on now? You know, because like the access patterns, the scale and what do you even test for? How do I make choices? There's a lot of choices to make here. I'm very curious, like, what is the typical patterns you're really optimized for? So for folks that knowing that side of the world, like they have this in mind, right? Think of what you're doing here. Yeah, I think that that's a good point.

Starting point is 00:10:11 And it was extremely interesting moving to Netflix from EFF. I had been a product manager at AWS for three years doing this file system thing, talking to customers, and I remember being at Netflix about four weeks and thinking that my entire worldview had changed, just being able to now be on the purchase side of AWS as opposed to the selling side. And I also think you're right about choices. One of the things that is challenging about doing a startup is that the amount of time we have to do engineering is so limited compared to the places like AWS and Netflix.

Starting point is 00:10:53 And so while we ultimately want to become this general purpose data storage that replaces the block storage layer in the cloud, we do have to prioritize who we're able to serve in the short term. So today, we're looking at a couple of different areas. One area is CICD, companies which are running GitHub actions, doing caching. Traditionally, these workloads are very poorly served by shared storage because they have so many small files. And if anyone's ever used NFS before, knows, does not work with lots of small files. We also see a lot of enterprises trying to quickly build infrastructure

Starting point is 00:11:38 to support their AI organizations. And traditionally, this infrastructure is very stateful. You have to download very large models and work with very large training sets, which requires either expensive hardware or some kind of custom caching solution or researchers to get really good at synchronizing data manually to a place like S3. And so we're able to come into these companies and offer them a way to build serverless Jupyter notebooks or serverless training or serverless inference on large data sets because we take care of moving the data to the instances.

Starting point is 00:12:18 Interesting. We just had a blacksmith on our pod earlier to talk about CICD and his specialized hardware and stuff like that. I'm very curious. Maybe talk about more about the CICD problem. Because, you know, as engineers, we all use CICD, right? And you kind of inherently understand

Starting point is 00:12:35 the file access patterns is not super fast rights, typically, unless you're doing crazy stress. stress testing on the database or something. But most of the time, it's more reading and writing, I assume, depending on what you're doing. But a lot of dependencies loading, right? A lot of quarter, like, getting just environments set up. And then you're executing a bunch of things, which writes some stuff, but not usually majority of time writing, right?

Starting point is 00:13:00 And can you talk about, like, why do you pick CICD as this first use case in mind? Is it because that you want to speed up things? That's actually is a very big pain point. And also, like, what's the typical file system thing that requires more attention to? Like, what's been the particular type of problem on a file? I think small files is a good point. But, like, what other things are you have to keep in mind to solve a CICD problem really well? Yeah.

Starting point is 00:13:26 So it's an interesting problem because I think it manifests in so many different ways across the industry. For example, many of the CICD companies that we're working with today, are effectively doing Docker builds. And the problem with that is you have to have this cache of all of these layers that appears on every instance where you're doing the build. That is actually the same problem as running a container platform at a company and trying to deal with cold starts when you have an extremely large layer that you need to download that delays starts.

Starting point is 00:14:03 Or you have a layer that has a ton of someone's packaged the node modules directory in the layer and suddenly it takes all this time to unzip leading to latency-starting containers. So these companies that are doing CICD are facing these same problems because they're dealing with the node modules directory. And you're right, it is a very read-heavy workload of how do we put, not to just focus on node modules,

Starting point is 00:14:28 but how do we put that node modules directory somewhere centrally so that we can fan out and run the build in parallel across many instances sharing that same package cache. Is this only an organization that applies to CICD or is an organization that you've carried into runtime as well for like container container?

Starting point is 00:14:47 And I'm kind of interested why you chose CICD versus potentially like other places, especially given the fact that a lot of the data stack today is on top of like object storage. I think that it's about who has the most pain. And when we come back to like we have such limited engineering resources. It's not that we don't plan to address all of these things in the next 12 months, let's say. It's about where are we in the next 60 days and how can we best get there? And because these CICD companies are dealing with this problem and ultimately have no choices that work for storage, that's where we decided to start. But yes, I think what is great about

Starting point is 00:15:30 being a storage provider or a data provider in some sense is that the level of the level leverage in the stack at this place is immense. So when we talk about doing these optimizations for CICD and you asked about carrying them over to runtime, yes, that just works because our storage is designed to do that. And then when you ask about how do we focus on maybe larger data sets, which are stored in object storage that need to come down, that's something that we also do

Starting point is 00:15:59 pretty well and we'll continue to get better at over time. So it's a good place for us to start, but it's not where we'll be ending. Gotcha. Okay. And it'd be interesting to understand. So are you going and selling to individual companies using SICD or directly the CICD providers? Two providers. Generally, in all cases, we're dealing with the layer of companies who have enough data that these problems manifest.

Starting point is 00:16:25 So either CICD providers or not a customer, but an investor, modal, like that kind of company, which is dealing with the data movement on behalf of their customers, we can solve the large data problem for them. That's fascinating. And so I'm curious because, like I said, I was a user of these kind of stuff before. And I feel like there's always a dilemma between how much transparent this feels, right? Because I want a file system that just works as positives, right?

Starting point is 00:16:54 I just wanted to write. I just wanted to read. I just want to list files. I don't want to have extra things. But when we're building a solution, I remember ourselves, it's so much importance that we know, hey, this is a node module data that we want to be caching. And I don't always want to make, like, you go through like 10 runs and to figure that out, right? Because then you really are sacrificing 10 runs latency to kind of like notice a pattern. And so oftentimes, you know, like, like Colonel has cash hints, right?

Starting point is 00:17:21 Everybody has like certain like ways to tag things to give hints to the underlying systems. I'm curious on your side, how do you think about this, like, read-heavy workload? Because I saw you have a caching layer on top. That's really, like, a big part of how you get the performance. You can't cache everything, right? Because you just really blow up your cache sizes. Is there certain things you are trying to just observe from the workloads to make your caching doesn't go out of bounds that often?

Starting point is 00:17:52 Or are you trying to get the user to tell you certain things? Like, how do you think of that trade-off here? Because I think traditionally, NFS is the world, we're just, just writes, right? Just read. And you sort of just figure it out with very poor, I would argue, poor pattern like matching here. So how do you think about that problem in general for you? Yeah, I think that that is effectively the core thing that we need to solve as a company. And our job, when we create a product, is how we create something that has progressive complexity

Starting point is 00:18:26 for our users that they can opt into as needed. So I'm not sure when you plan to put this up, but I believe in August we're planning to launch something called disk.new, which is going to be the very simple, any individual developer can log in, you can attach an S3 bucket, you can attach a Hugging Face model, and you get access to a shared device

Starting point is 00:18:50 that has all of this data on it for your entire application to use. But then for larger companies or people who have more performance needs, we want to be able to expose their ability to tell us what's going on. Maybe they tell us that they're building a Docker layer cache at a certain part of the file system, or they tell us that a certain part of the file system is an extremely large, immutable reference data set. And knowing that semantic information allows us to do performance optimizations that other providers can't do. And this is one of the key reasons why it's so important for us to do this at the file level. Because a block device, of course, doesn't know what's stored where or semantically what's happening, but a file system does.

Starting point is 00:19:40 And so being able to, for example, like a Mosaic ML kind of run, have the application tell the file system what data is needed in the future allows us to almost preemptive. the caching and send it to the client early such that they're not waiting on us. And there's a million optimizations like that that we hope to explore. That's super cool. I think actually that is probably one of the hardest things to find in the market back then when I was doing this because most file systems are like the lustres of the worlds are so academic focused almost. Like it's built for larger labs kind of feel.

Starting point is 00:20:19 And even the companies using these products are very bespoke type of thing. They don't really roll out everywhere, you know, because, like, file system is not something you want to mess with, usually. Like, it's really, really bad. Everything goes wrong. So I'm very curious, giving you, I know you've worked on EFS for quite a while, right? You've done this before. But I assume any customer here, it's like, oh, here's a brand new file system for you to try. You never heard of us.

Starting point is 00:20:40 You never know who we are. You know, go, go, go yolo. It's never that easy to trust a file system because it's fundamentally where your data sits. And any corruption or any bad operation that happens here is very, very tricky. What is sort of your path to show that you are ready? I don't think that's the easy task at all. Do you have to do a certain type of testing? How do you earn trust for your customers? Start early with you guys. Yeah, I think that these are all great points. And if you look even at what DeepSeek released with 3FS a couple of months ago, it's something very

Starting point is 00:21:17 exciting, but ultimately it is so specific to what AI model training needs, that it can't be used for all of these things. And one of the interesting comments that I got when I spoke to a large GPU cloud that uses VAST under the hood is that their customers love VAST for storing these reference data sets, but they keep running into problems because the developers are trying to pip install packages on it, and it's not made for that. So this is why I think being able to go farther than other storage providers by effectively switching based on the directory and based on what customers have told you, the performance of the file system is what's going to make the best experience for the end user. Now, you also asked about trust, and I think that

Starting point is 00:22:07 that is the most important problem for storage, database, infrastructure companies, in general. And ultimately, the only way to earn trust is by doing the right thing over and over again. This is one of the reasons why I mentioned CICD as a starting point for us, because when you're dealing with caching workloads or you're dealing with node modules, well, customers are less fearful about putting a ultimately package cache on a new file system. And by earning trust in that domain, we're allowed to then earn trust with more and more production critical workloads until ultimately we're able to run anything. But it's something that takes time.

Starting point is 00:22:51 I mean, I think trust is like the most, I mean, in any infrastructure situation, trust is the key enabler, right? Like, if you can't trust it, it doesn't matter. And then there's a layer of things below it. I'm curious, how does the rise of like effemial AI workloads kind of play into the way that you think about the future of this type of layer you're trying to build? But how are those differentiated? Does it make things like what you're building more needed? Like, help us understand how you fit into that broader sea change that's going on, right?

Starting point is 00:23:19 Like, broadly speaking, AI will effectively force us to terraform our stacks. We're going to go through a large up-leveling, a bunch of new technology going to be brought in as a part of that. I'm curious to understand how what you're working on fits into that story. I think that it is very clear that more applications are. going to be written in the next five years than in the past 20 years. And I think if you look at these platforms like Replit and like even neon most recently, the core infrastructure that these new applications need is something that is serverless in scales to zero. And what I'm so excited about for our company and for our customers is that this kind of infrastructure has

Starting point is 00:24:06 never existed for unstructured file storage. S3 does it for inactive, almost archive data, but there's nothing that does it for the active drive that's attached to your server. And we hope to fill that void for the AI applications that are being written. The other interesting thing that I saw recently is Fly.io made a post about the robots

Starting point is 00:24:29 that were coming to take over their stack and how they've seen more usage from AI agents in the past six months. than humans. And there is this one paragraph that speaks about how as people at fly.io, they imagine that every customer just wants to store data

Starting point is 00:24:49 in a post-grace database. And that is like the fundamental building block of applications. But the agents that they see prefer to store data in files because ultimately it is a more stable, easier to use interface that's accessible anywhere. And so if we can provide, the serverless unstructured POSIX file storage for AI agents,

Starting point is 00:25:12 we expect to become the building block of this next generation of applications. That's very interesting. And I mean, ultimately at the end of the day, it's like iteration speed, caching, and decentralization are like the core tenets of any like AI workload, right? I mean, everything's going to be hyper-phemeral. Everything needs to be very fast because your validation loop has to be very quick. And you also, you know, data centrality,

Starting point is 00:25:36 and then also reducing a number of tool choices, right? Like, it's very easy to build an agent or, you know, some type of autonomous workload that can understand POSIX versus, like, having it understand like a thousand different APIs, right? Like, the tool surface area, there's actually quite simple, especially if the file format upon which that data is revealed is actually quite basic, right? Like, this is kind of the part of how ultimately this next generation of AHI actually represents a massive layer of consolidation and reduction of moat and destruction of moat,

Starting point is 00:26:06 than anything else. So I'm kind of curious, like, what do you think, what's this going to do to the future of people's staff? Like, what's the future of buying and building and creating software looks like as a result of some of these C changes? Well, I think from the AI trend at large, obviously people have many strong opinions

Starting point is 00:26:26 about what's going to happen over the next couple of years. My personal belief is that we're not going to get to a place where every person becomes a developer. We've talked about that in the late 2010s, ultimately it didn't pan out. I think most people don't want to write their own software, and most people want to offload this ability to make decisions about design and maintainability and operations to someone else. So I think that continues to be the case.

Starting point is 00:26:58 And then when you ask about what the teams who are building software need, I think, like you said, it's how do we build these offensive? seminal stateless apps easier and easier by using tools like serverless databases, object storage. I think our databases become more interesting and complex, like these vector databases and graph databases. And then ultimately, all of that data needs to be stored somewhere, which hopefully will be us.

Starting point is 00:27:27 We might not touch this a real bit, but maybe it'll be clear to tell folks, because I think a lot of people that think of S3 and mounting locally, we think of all these open source projects as available for us, you know, to use, like, S3FS and some variants of it, right? Like, there is, like, existing projects out there to help you do that. But obviously, it's not, it doesn't solve all problems. So maybe talk, can you talk about, like, the problems exist? Why can't I just use an off-the-shelf open source project to, like, just mounts S3? And what do you provide that really is, like, a big 10x differences for folks?

Starting point is 00:28:01 I think that the adapters that you see today fall into one of two camps. You have adapters which are POSIX compliant, which would be things like JuiceFS or Objective FS, that store data in a format that you cannot read in S3. So ultimately, it is a price play. And then there are adapters which are not POSIX compliant. This would be things like S3 mount point or things, like S3FS or GoofyFS, which are closer, but not quite there. And as a result, you can't run applications or databases on top of them.

Starting point is 00:28:42 We want to build something that is both POSIX compliant and allows users to have that data natively in S3 so that they own it and can use it with the rest of their S3 ecosystem. And in addition to that, by providing this middle layer of SSDs, we're able to do it with much higher performance than any of these existing tools that are ultimately just libraries

Starting point is 00:29:11 that you run on your instance that then talk to S3 and suffer the penalties of that for every request, especially if that data is being accessed multiple times across multiple instances. And so since you've been working on this layer, I'm very curious, to solve this problem well, you know, we talked about we have to figure out what is the right trade-offs and, you know, there's a lot of infrastructure in the

Starting point is 00:29:35 middle here, like SSDs and the caching. There's a lot of things to do here. What is maybe a challenge that you have to solve here that you didn't even have to solve back in Netflix or EFS times? That's maybe a unique infrastructure challenge for our career. Well, I think that whenever you combine multiple storage systems, so this would be our SSDs and and then something like S3, it becomes extremely important to make sure that changes are replicated in the sort of replica system in the correct order and in the correct way to avoid things like corruption so that users who are trying to access data from both sides get something like a consistent view.

Starting point is 00:30:25 And these are non-trivial problems that we did not have to solve at AWS because we were building a primary storage system that was EFS. And we did not have to solve at Netflix because we were dealing with specialty infrastructure that was very purpose built to the problem at hand. So as always, you know, when you make things generic and when you combine things and when you add performance requirements, the complexity explodes in how you build these things. And I'm very thrilled with what the team has been able to put together in such a short amount of time that we can get out to our customers. Very cool. All right, we want to jump into our favorite section of this podcast called a spicy future. Spicy futures. So obviously,

Starting point is 00:31:15 tell us something that you believe that you think most people don't believe yet. So, you know, my spicy take is that file storage is this may be surprising based on what I've said file storage is the future it is the future storage interface that I think is going to take over the world

Starting point is 00:31:35 more so than S3 and more so than blocks all right I guess we had to elaborate more here I think one maybe not everybody may not understand the differences between blocks and files as S if you're not

Starting point is 00:31:49 into the storage space much So give us a little bit rundown. Like, what is the problem of files versus, I guess, object storage here, right? And why do you think file is the future? Like, what does the fundamental thing file give you better? And it would just continue to be the better choices for folks. So the three fundamental kinds of storage that providers sell and cloud sell is object storage, which is like S3, and allows users the ability to create an immutable,

Starting point is 00:32:20 entire object, and then have a pointer on the side that has a key that allows you to then reference that object later. S3 does a good job of making it look like a file system, but ultimately there's no such thing as directories in S3. It's just a flat key space, and there's no ability to overwrite like a single byte within an object. You would have to re-upload an entire 10-gigabyte object. A block storage device is one that works like a hard that's attached to a computer. It has a fixed array of blocks. And the API that it exposes is

Starting point is 00:32:58 re-block at this offset and write block at this offset. And then, of course, the Linux kernel, or the Mac kernel, or the Windows kernel, provides this ability to add a file system on top of a block device, so it's easy to see what's going on, and everything is built on this file abstraction at this point. at this point. But notably, you can't take a hard drive

Starting point is 00:33:23 and plug it into multiple machines at one time. That would cause corruption. And also, these hard drives have a finite length because you are ultimately just an array of blocks. On the other hand, file storage is the idea that you expose an API that actually matches all of the things that you can do on files and folders. Create file, delete file, rename file,

Starting point is 00:33:47 create folder, write to the middle of a file, read from the middle of the file. It's a much richer and much more complex interface to implement. But I think it allows you to gain this 10x improvement in power because it allows you to unlock both the ability for the server to have infinite capacity like S3. It unlocks the ability to run with all of the tools, the binaries that run on Linux today if your POSIX compatible.

Starting point is 00:34:17 And I think it unlocks one of these core tenets of Linux that everything is at the end of the day a file. The Linux kernel exposes through the dev file system, the SIS file system, the PROC file system, all of these control structures for the machine that make the file system the universal interface to everything that you might need to do. And where I think we come in and where I think the world is going is how we make the file system a universal interface to all of the world's data. How do we make the file system a universal interface to something like a vector database, where we can create the database for you on the side if you put all of the files in a special

Starting point is 00:35:02 folder for us? And again, I think there are hundreds of things that the file system could be used for in this way that vastly simplifies the life of developers. if we have a layer that is flexible enough to expose them to the developers. Interesting. I mean, I'm sitting here thinking, I've been thinking a lot about object storage and LTP Workload just in the last week. Is your vision that like this enables us to actually build true LLP databases on top of object storage?

Starting point is 00:35:33 That's right. Is that the idea? And can you, for the audience, can you help people understand like today with existing object storage APIs, like why? that's hard or maybe I don't understand but like how does this open up more transaction workloads versus analytics workloads that could live on top of object storage well if if you look at what the company neon had to do to make OLTP possible on top of object storage they had to fork post grace they had to write their own storage server then they had to run a fleet of SSDs to stash that

Starting point is 00:36:08 data object storage on its own doesn't work because like we spoke about with Mount Point or S3FS or GoofyFS, it's too slow for what transactional workloads need. So for rights, you need a place that's very, very fast, like a local disk to put them. And for Reeves, you need to have the ability to predict what the user needs and then bring it into that fast storage before the user is waiting on it. So yes, this layer becomes a critical component for companies like Databricks and Snowflake and Neon in order to get the economics that they need to offer their either OLAP or OLTP databases to their customers that we make no longer necessary. Interesting.

Starting point is 00:36:53 And then, like, what type of speedups and use cases do you think we get from the ability to, like, modified and file directly instead of having to do replacements? Like, is it both a conceptual, like, simplicity from, like, a file standpoint, like, the ability to modify, append only? And then also, I guess the other component of my question is, what use cases become possible, that weren't or in what use cases are possible today but actually are unadvisable that become highly like that give that like 10x where it's like okay that we could have done it before but like it would have been awful so like now we can really do it and this is amazing i think that the

Starting point is 00:37:28 idea around serverless jupiter notebooks is effectively that shape of problem where today if you're a company like hex what you have to do is run a jupiter notebook give users some amount of storage space on a local device, because a Jupyter notebook environment ultimately is a local file system, give them the ability to download data to that environment so they can use it frequently. But then at some point, the user is no longer using the Jupyter notebook, and you need to shut it down.

Starting point is 00:38:02 And the question remains, what do you do with that storage? Do you snapshot it and upload it to S3? That takes time. And then it also takes time when you do you do that storage. also takes time when you try to start it again. Do you leave it up to the researchers to do and tell them that the drive is ephemeral? That also can work, but researchers generally don't like interfacing with S3 and manually synchronizing the data themselves. So if we build something that is so POSIX compliant that you could run Linux on it, then any of these stateful applications

Starting point is 00:38:35 that you might run on an instance immediately become serverless because we have a a storage device that doesn't charge our customers when people aren't using it. That data flows to S3. So if a researcher stops using a Jupyter notebook, the company can just shut down the instance and know that the data is both safe and stored in a way that's not crazy expensive. And so as a result, do you think, like, what do you think this does to the future of, like, people stack, right? Like, does this mean we end up with, like, really fat clients and no servers that talk directly

Starting point is 00:39:08 to like these APIs or do the servers that do exist no longer like actually they're not really stateful they're actually like basically a proxy proxy bypasses in the same way that something like a warp stream is for for catholic portability on top of object storage like what's what's the future of infrastructure look like if this is assume this layer exists what's what's that meaning for us yeah i i think that's right like i think that it is a a world in which servers don't have to stay around if nobody's using them. It's a world in which the Lambda function as a service model works, but you don't have to rewrite your application into a function.

Starting point is 00:39:49 You just launch a serverless full Linux box, and then you can run software as complex as an ERP in a way that scales down to zero when nobody's using it. That's real cool. So I guess maybe the last question, for me is I think this sort of like future as files is super intriguing. What is the way for folks to even get started? Like, are you ready for folks to even get that magic in some little fashion? Is it truly trying it on a CICD way? Is he usually made one of the CICD products he already

Starting point is 00:40:23 partnered with? Like, where do we get to hands on this sort of like glimpse of future, you know? So again, depending on when this airs, I would love for people to go to Arkell.com and sign up for our wait list. Tell us a little bit about the products that you're trying to build, the servers, the applications you're trying to run, and we will get back to you to understand more and make sure that we're a good fit for you. But in mid or early August, we're planning to launch that disk.new experience, which will allow customers to go out and try a file system that is maybe not as high performance as we're able to give our enterprise customers on the back end, but just a taste of what it's like to be able to do anything on top of objects,

Starting point is 00:41:04 storage in a stateless way. Cool. All right. I think we'll make sure this come and goes out in August times to make that happen. But this time you sounds like a very cool way to actually get a way to run AI stuff at a very, at least more performing than what's possible, right? With like even just open source options out there. So super cool. I guess where do you ask about the usage of this.comnew?

Starting point is 00:41:27 Where can people even find you? Do you have social channels? What are a typical way for folks to hear more about your spicy hot takes? Where do we get more of that? Oh, yeah. I'm always posting about files on Twitter. So you can follow me on Twitter at J.H. Leith. I'm on LinkedIn posting content about this as well.

Starting point is 00:41:44 But yeah, feel free to follow. Send me angry messages. If you don't like files, I totally understand it. The people I like talking to most are the people who think this is a terrible idea. So if you're out there listening and you've heard this podcast and you think this is insane, please reach out to me. I would love to talk to you more. So for all the angry storage people, Hunter is here for you. That's right.

Starting point is 00:42:07 And sorry of questions. Awesome. Thanks so much to being on our pod. This is super fun and thank you of being here. Thanks so much for having me. Awesome. Thanks, you.

The Infra Pod - Future of File Storage for AI ( Chat with Hunter, CEO of Archill)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.