The Infra Pod - Why do we need a cloud native data lake for geospatial data? Chat with cofounders of Earthmover (Ryan and Joe)

Starting point is 00:00:00 Well, welcome to the InfraPod. This is Tim from Essence and Ian. Let's go. Man, I am crazy excited. I think I said every pod, so I don't know how valuable that is at this point. But we're excited. We have Ryan and Joe from EarthMover here today to talk about the future of data for climate and earth sciences, geoengineering, and basically everything about the analog world in which we live. So I couldn't be more excited to learn this. This is very novel

Starting point is 00:00:31 and new and also super cool, but also unique. So Ryan, tell us a little about yourself. Give us a little background and we'd love to learn more about you and about your company and how you started. Yeah, awesome. Super excited to be here. Thanks so much for the invitation, you guys. Yeah, so I'm Ryan Abernathy. I'm the CEO and co-founder of Earthmover. And my background, the way I got into this world of startups and data is really through my career as a scientist. So I am a computational oceanographer and climate modeler. What that means is I spent most of my career trying to understand how the ocean works and how it interacts with the rest of the earth system through the use of theory, you know, basic physics, observational data, and hardcore numerical simulation.

Starting point is 00:01:28 And that work over the course of my career became increasingly data intensive up to the point where we were now looking at data archives and file sets that were measured in petabyte size. And something happened. And I just switched to that lower level meta problem of how are we going to process all of this data? How are we going to analyze all of this data? How are we actually going to make sense of all these petabytes that are coming to us at an accelerating rate within earth system science. That's a basic journey that I've been on. And from a sort of infrastructure point of view, that's taken me from the world of like traditional academic research computing environment, which is basically like a hodgepodge of servers, you know, on-prem in the

Starting point is 00:02:21 basement and national level supercomputing facilities, so like traditional HPC technology to the cloud and trying to figure out how are we going to work with this type of data in the cloud. And so, you know, the overall motivation for our work is that this domain of scientific and engineering work that involves large earth system data sets is extremely painful for the practitioners who are working there. There are very few of these sort of high-level services and databases and tooling that make data analytics easy and pleasant to a data scientist or a developer. And there are tons of tech debt and old fashioned architecture and inefficient workflows. And in the research world, that's a problem for, you know, grad students and postdocs who are just used to having to deal with that pain as the cost of doing business. But as we see an increasing number of commercial organizations, you know, trying to build products and innovative

Starting point is 00:03:30 solutions around climate, environmental resilience, those problems are now becoming business problems, and they have a real cost for these type of companies. And so that's creating a market and a demand for solutions. And that's where Earthmover comes in, basically. That's really cool. Joe, so how did you get involved with Ryan and this endeavor around Earthmover to try and solve this? How did I get sucked into this whole thing? Earth sciences and all the fun that's related to data. Yeah, so Ryan and I share part of our origin story here in the sense that we're both trained scientists.

Starting point is 00:04:07 So Ryan's I share, you know, part of our origin story here in the sense that we're both trained scientists. So, you know, Ryan's an oceanographer. I did a PhD in civil engineering, but focused on climate model development and climate modeling, really trying to understand like what the future climate is going to mean for, you know, the water cycle, the built environment, you know, the places that people live and that sort of stuff. And yeah, so the part of the story that overlaps is I was also, you know, the places that people live and that sort of stuff. And yeah, so the part of the story that overlaps is I was also, you know, generating terabytes to petabytes of data early in my research career and very quickly came to the conclusion that like, we don't have the tools for this amount of data. Like, it's just, this is not, I can't see myself just trudging through data

Starting point is 00:04:42 like, like this, like we need to do something about this. So I somewhat through happenstance got connected with some open source projects and started working on one in particular called X-Ray, which is how I met Ryan. And, you know, X-Ray kind of forms the basis of a lot of this. So X-Ray is an open source Python project. Think of it like Andis, but for multidimensional arrays. So it's multidimensional arrays, metadata, indexes. And so I started working on that. And over time, like the impact of working on open source and building tools around data trumped my desire career as an academic. And I got just pretty addicted to having that sort of real-time impact in the work of many more people. So I started working on X-Ray. We started a project called the Pangeo Project, which we may come back to, you know, from an architecture perspective.

Starting point is 00:05:38 I think it's pretty important here. Yeah, so Ryan and I worked together for a bunch of years doing that. And then right before I started, you know, we started Earthmover. I was at a nonprofit called Carbon Plan where we were doing climate science and climate policy under one roof. And so we had a team of both policy people and scientists and data engineers trying to move as quickly as possible to use data and data science to answer policy relevant questions. And there I was like full practitioner, just like, you know, any of these companies that are trying to operationalize these data pipelines. And it's really then that convinced myself that like, there are missing infrastructure layers in the stack for us as a society to be able to use this

Starting point is 00:06:22 kind of data to, you know, to enact the change we want to see. So that convinced me to start the company. And yeah, we can go into any of those pieces as we get into this. I think it'd be really helpful for the audience because it's such, you know, to be honest, when I met you two, I was like, I knew a good amount of the space because I actually at one point worked on catastrophe modeling for insurance companies. So I know a lot about HPC clusters and using volumetric data to try and generate a statistical distribution of what the probable outcome is, doing floodplain analysis and all these things to basically discover, hey, if we take this reinsurance contract,

Starting point is 00:06:59 are we going to go bankrupt? That's basically the reinsurance industry. It's the industry that insures the insurers. And ultimately, at the end of the day, it's your last resort insurer. And so that comes down to, can we basically understand when we're going to make money on this contract? Are we going to go bankrupt? Because it's a zero or one outcome in reinsurance, effectively, at the end of the day. It's like, you either make money or the company's over on most contracts. So can you help our listeners? And I think Tim and I understand, you know, today, what are the like in industry revenue producing places where,

Starting point is 00:07:31 you know, these are science data, right? No climate data or science data is used to generate money or places where it could be. Because I think the first problem in this domain is really like having people to better understand one, there's like why it matters from like a theoretical or like a research perspective but more importantly like why does this matter from an industry perspective and what industries are most impacted by this problem space yeah it's a great question maybe i'll i'll take a minute to like sharpen our definition of what type of data we're talking about and then you know maybe joe can talk a little bit about some of the type of the work they did at carbon plan maybe around carbon markets and carbon accounting and stuff like that. But the type of data that we are talking about from a technical perspective is really characterized as large multidimensional array data sets, aka tensors, as they are called in the machine learning world, right? And so that's really the characteristic of this data

Starting point is 00:08:26 that is so particular and unique and that makes it ill-suited for most existing databases and data systems. And it's not quite the same thing as geospatial data, right? Which is a kind of much more established category. Certainly has a big overlap, but it's a more lower level concept. It's not specific necessarily to any one domain.

Starting point is 00:08:49 You have tons of data like this, for example, generated by microscopes, right? That scan cell volumes or by seismic instruments that are imaging the interior of the earth looking for oil, not to mention astronomy. There's not really a lot of commercial applications for astronomy data, but you know, it's the same data model, right? Massive images stacked in time, stacked across other dimensions, three, four, five dimensional tensors. That's why it's so easy to get petabytes. What

Starting point is 00:09:16 produces that type of data? Well, most imaging systems or scanning systems, sensor systems produce that type of data. Simulations use these type of grids to model, you know, any kind of physical system using, say, finite volume grids and discretization. And then traditionally, they are dumped out into a zoo of different file formats with like HDF5 probably being the most common one. But you know, every domain has got its own sort of bespoke custom file formats. And so that's sort of the type of data we're talking about in this, you know, climate and weather and environmental data space. It's really largely going to be satellite data and simulation data from weather and climate models.

Starting point is 00:09:55 And so that's sort of setting the stage. Yeah, maybe I'll kick it over to Joe and he can tell you what like a data pipeline looks like to go from the raw data to like something that has actual business value. Yeah. So, yeah, I'll talk about two core application areas in the kind of climate and weather space. You know, the first is around carbon. If you follow the climate crisis at all, like you understand that like we have a carbon problem in the atmosphere right now. Like we're putting more carbon in the atmosphere than the land and ocean can take up. And there's really two like areas of work in this category. So you have the first category is like tracking emissions. And so, you know, this is like asking, where's that carbon coming from? And how much is it? And that's a remarkably hard thing to track. A reference point here is the Climate Trace Project.

Starting point is 00:10:45 So climatetrace.org would be a good reference for folks. But that project's trying to use satellites to estimate emissions and essentially the carbon budget using satellites alone. And it's a super ambitious and tricky problem. But when we were at Carbon Plan, we tried to do the forest sector. So just understand from the change of carbon in trees, where is there a sink and where is there a source? And that required, you know, processing, you know, many, many, many terabytes of satellite images, doing machine learning, and then ultimately generating a prediction based on, you know, some inference of, you know, where carbon is being taken up and where it's not.

Starting point is 00:11:26 That's on the emission side. There's also on the removal side, people trying to do the same data science for carbon markets. I could pay you to not cut down your forest so that forest can continue to sequester carbon. And there we have to have good validation. We have to have good tracking. And you have to be really clear about the provenance of how you made decisions. And so, again, like data science comes in really heavily there and being able to process data at scale and accurately is really important. The second area I'll mention, which is even probably more relevant to folks, is on the risk side. So how do we estimate the future risk of

Starting point is 00:12:09 extreme weather, floods, heat waves, et cetera? And there we need climate models and essentially projections of how the future is going to, you know, how a warmer future is going to impact weather and long-term climate. And to do that, we ingest data from HPC centers that have put out, you know, many petabytes of data, that data gets stored in the cloud. And we have to run that through, again, essentially machine learning pipelines or process-based models to try to understand where those applications are. And there are companies out there today that will sell you climate risk information. You want to know for your house what the risk of forest fire or flood is, like you can go out there and buy that information. And it's an emerging market, but it's one where there's, you know, quite a bit of

Starting point is 00:12:57 long-term economic value to us understanding where those risks are. I want to just like add one more real quick use case, which is we were talking with Ian before we started recording about this. Since we've started the company, we've actually seen the most demand sort of organic pull for what we're working on from folks working with weather data in sort of commodities and energy space. And it's the same shape of data. It's, you know, weather forecast data has the same basic shape, these big tensors, you know, gridded fields of predictions, but on a much shorter timescale, and therefore much more relevant to the business cycle. see a lot of movement on AI weather forecasting that there was just a new DeepMind forecast paper out in Nature yesterday, new Ensemble forecast. So there's a ton of innovation happening in weather forecasting. And we're kind of finding ourselves right in the middle of that, helping weather data move more efficiently through the cloud. I mean, that use case makes tons of sense to me.

Starting point is 00:14:01 Everyone knows from if you've watched the social network, Eduardo Sardar and making 300 grand or whatever of oil future speculation, right? This stuff drives markets and increasingly as we have a more fractious world

Starting point is 00:14:10 as a result of like a more volatile climate, like this is really important. And also I think like the idea of like as the astronomical data is really quite interesting as well,

Starting point is 00:14:18 and I can only imagine like the applications for volumetric data in space as an example and prediction of like a satellite movement or like collision detection

Starting point is 00:14:24 or like, you know, all these things, you're huge, hugely important. Then even things like I know from my own experience, things like aquaculture, a lot of this comes down to like, there's a lot of volumetric data associated in those fields as well. I'm quite curious, you know, there was a world where we didn't have this type of data, right? What's the traditional way that people have gone about trying to solve this problem? We couldn't have petabytes of data, but we still tried to build forecast models. And so I'm curious, what's the driver in industry to move towards the usage of more higher volume data formats?

Starting point is 00:14:56 So like, you know, you're using these sort of tensor formats you discussed. I assume it's about accuracy of the prediction, but it would be really great to understand, like, is it that we couldn't do any of these things before or that we were doing them and the results were just really slow and terrible and doing it this way gives us significantly higher accuracy in the same way that LLMs have given us significantly better solution

Starting point is 00:15:17 for natural language. Like, what's the sort of delta in industry from, okay, we were doing this really bad slow thing. It kind of gave us like a 60% and now we're%. And now this new thing opens up all these new opportunities. And I assume there's a lot of components here, but it also sounds like there's also been a shift as well occurring in the traditional way that people approach this problem space. Yeah, it's a great question. And a bunch of things are wrapped up into understanding how this tech space has evolved

Starting point is 00:15:44 over the last 10 years. So to narrow the scope, like let's just talk about weather data, right? Because it's like a fairly constrained problem. Weather forecasts are created primarily by the government. Public sector agencies like NOAA, National Weather Service, or the European Center for Medium-Range Weather Fore forecasting in Europe. It's something that we've been doing for 50 years as a society, investing in public sector weather forecasting as a common good for society. It's an incredible achievement technologically that we can do this.

Starting point is 00:16:17 And people love to complain about weather forecasts, but if you really look at the data and you quantitatively assess the skill of those forecasts, you can see them just getting better and better and better, you know, year on year. And that has massive value for society. Now, how are those forecasts made? Well, mostly the government operates programs to collect weather data, raw observations through things like weather balloons or satellites or on the ground weather stations. But those are just data. They don't make any prediction.

Starting point is 00:16:46 And so those are fed through a data assimilating numerical atmospheric model. And that runs on a dedicated supercomputer that's, you know, owned and operated by the National Weather Service that only does this. And it churns out these weather forecasts every hour, constantly. And that basically dumps out files in a format called GRIB, which, you know, if your listeners have heard of GRIB files, congratulations, you're a weather nerd. Most people never have. But if this is your world, you know exactly what it is and how painful it is. And those are put on an FTP server. And that's where essentially the

Starting point is 00:17:26 public sector stops. And then there's a whole sort of ecosystem of downstream vendors and service providers who take those and try and translate them into more useful information. So they consume those files. And ultimately, those go into like the app that's on your phone when you like, you know, look at a weather forecast. Right now, traditionally, all this processing chain is very much associated with like the HPC world. That's where the files originate. That's where most of the practitioners in the domain are trained. And so, you know, this is the world of like MPI and, you know, like batch processing and big expensive file systems and things like that. So one big thing that is changing is like the ubiquity of the cloud in just about any startup or really any business. Like they want to use the cloud and there's a real impedance mismatch between this sort of FTP server based way of distributing data and the modern cloud-native stack.

Starting point is 00:18:28 And then the other big driver, I would just say data analytics in general and AI in particular, there's real value you can get from having all the data. So the traditional paradigm for weather data is like, well, I'm just going to want this one point. Give me the forecast data at point X. And there's dozens of vendors out there who will have some API you can query and give you that data at point X. But if you say, actually, I'm a commodities trader and I'd like 20 years of all the forecast history, all variables, all pressure levels for the entire record. And I don't care if it's a petabyte of data.

Starting point is 00:19:02 I know how to make millions of dollars a month on this data, and I needed to train my AI system. There's no good answer to that, really, from that legacy stack. And so that's where the more modern cloud-native approach to Earth system data comes in. Cool. So I think we really want to talk about what is this cloud-native thingy we're talking about here. And also, I think it's want to talk about what is this cloud native thingy we're talking about here. And also, I think it's going to be really interesting because a lot of what I can read from your website or open source is really about this multidimensional array sort of file format. And just looking deeper into it, there's some conscious choices we have to make that's a little bit different. Looking like there's a chunking, ordering, compression, ways of doing parallel reads, uncoordinated writes with ice chunks. I think for our infraners and engineers out there, I never really dealt with this kind of things before. I worked on Spark,

Starting point is 00:19:54 worked on Kafka, worked on a lot of processing, but not this kind of things. What is the most unique part of this scientific data computing stack that's very different than your traditional like in parallel distributed storage or computing in general? I'll grab this one and then maybe Joe can talk about some of the tools that are used in this space. Fundamentally, like what we're trying to advocate and pursue at Earthmover and in our community more broadly is that there is this data model that deserves to be considered like structured data, but that isn't like relational tabular data, right? If you look at like the data space more broadly,

Starting point is 00:20:33 not focusing on this problem space, you basically see like, well, you've got basically two choices. You've got structured data and there you've got your rows and columns in a table and you could put it in a parquet file or you could put it in Snowfl, and you could put it in a Parquet file, or you could put it in Snowflake, or you could put it in Postgres, but everyone knows what they mean, structured data. Then you've got unstructured data, which is, in its most generic form, just a bunch of files somewhere that you write some sort of custom script to loop over. There's maybe something in between semi-structured data, but I don't really know what that means. Is it JSON or something? I don't know. What's the real definition of structured data, but I don't really know what that means. Is it JSON or something? I don't know. What's the real definition of structured data, though? It's that it's a more compact representation that

Starting point is 00:21:11 leverages the inherent structure in the data. Like you could take all your tables and store them as JSON or something like that, but it would be slow. It would be inefficient. It wouldn't use storage well. But because there's this structure inherent in the data, you can use that to optimize both the storage and the compute on top of that data. And that's what databases do. And they have a very specific data model that they work with. Our big hypothesis is that there exists a data model also for a lot of scientific and engineering domains that isn't the relational model, but it's still highly structured. And the best way to describe it is, let's just consider a data set of temperature at every point on the earth over time. This is a sort of classic cube,

Starting point is 00:21:53 right? And there's a certain, you know, just one of those slices you can think of as an image, but then it can be very high resolution, and then bring in a time dimension. So this is a three-dimensional array, right? And you've got orthogonal indexes on each of those dimensions. Maybe you've got latitude on one, you've got longitude on another, and then you've got, you know, time along the third, right? And, you know, you could take that data and squash it down to be tabular, right? You could make a row for latitude or a column for latitude, column for longitude, column for time, and then your temperature value. You will have just 4x'd the volume of data you have to store because you're throwing

Starting point is 00:22:31 away that inherent multidimensional structure in the data. And you've destroyed other information within that data, such as the connectivity between pixels, you know, that's inherent in, say, an image. And so, you know, what the multidimensional array data model offers is a much more efficient way to both store and query that type of data that leverages this inherent multidimensional structure. And so when we think about our stack, it starts from storage. How do we store this data effectively such that we can provide good access to it and support the sort of query patterns that workflows require? And then two, how do we like orchestrate computing on top of that in a way that

Starting point is 00:23:18 is fast and performant and can scale out? I don't know if that's answering your question, Tim, but like that's kind of where I start for a person who's unfamiliar. Why do we need something at all? Like, why not just use Parquet and like DuckDB? Like it's, you know, the standard stack, right? And that's the answer is that it doesn't really fit. And if you try and use those tools, you pay a big price. Yeah. And so maybe get a bit more specific. what are the biggest difference when it comes to like axis patterns or some sort of like limitations that existing tools really hit to? Because I think I understand that there's like an inherent nature of difference of the data type and maybe some axis patterns, but probably not very intuitively able to understand like, okay, maybe there's one core example here. Because given that there is like multidimensional data, sometimes we hear these terms multidimensional, like how many dimensions, how big a data set?

Starting point is 00:24:10 Is everybody only accessing everything at once or only small chunks of it, right? Like I think there's like some particular canonical example maybe we can talk about that, hey, we need a new stack from the storage up. And also there's probably things you can probably use, right? You don't have to probably rewrite the whole nine yards of everything, but like what are the core things you have to actually go in deeper

Starting point is 00:24:32 that allows your users to get the benefits of what you're doing? So maybe something like that will be helpful. Let's talk about two queries then that you might want to make against the data cube that Ryan was using as his illustration. So you have temperature over the globe and over time. So it's latitude by longitude by time. Those are your three axes on your cube. And, you know, there's two end member queries that are not particularly large, but are orthogonal. So the first one is you might want to extract a map. Everybody is familiar with looking at, you know, maps of our planet. So that's taking one slice out of this cube and now we're going to

Starting point is 00:25:11 visualize it or, you know, calculate some statistics on it or something like that. So that's the temperature for today. The second query is I might want to pull a time series of just all the temperature over time, you know, for this forecast or for this climate model or whatever from, you know, the pixel that's over San Francisco. And I want to drill down through that cube in an orthogonal dimension relative to that map. And so, you know, thinking about how that would work in a tabular database is I think trickier, right? So basically like by flattening that data, our data, like it's not to say we couldn't do it, but we're going to have to stride through the whole database in both cases. There's not really a way for us to have all the data that we want right next to each other. That inherent structure and how the data is stored is broken.

Starting point is 00:25:58 And so, yeah, then we start talking about like, how would we arrange that and that's when we get into you know data formats that are built around arrays and the basic architecture of how these work and this applies to czar which is what we work on a lot but it also applies to hdf in other formats out there is that you're going to take that cube you're going to break it into pieces those pieces we call chunks typically and then we're going to compress those chunks individually. And then when we go to extract that time series point, we only have to read the parts that we intersect with, and we don't have to read all the other parts that are also stored in that format.

Starting point is 00:26:37 So you don't even have to scan through those in that example. I think it'd be helpful to like, if you can, maybe put some numbers, like what's the, you know, traditionally if we're to use what exists, we'll try to solve this, you know, it costs this much or take this, like, what's the, like, what's the optimization as a result? Like, does this move a problem that's like unsolvable to something that's solvable? Is it like, it just makes a problem that was kind of solvable to something that is now just like significantly cheaper so we can like scale it up and do more. It might be both. It would be great to understand as a result of this, now we can do these things.

Starting point is 00:27:08 To answer that, it's worth looking at the landscape and looking at different teams and asking, what are they using? What are the technology options for a team in this space? And we've been contrasting this to the more mainstream data warehouse space. So why not just store all this data in BigQuery and Snowflake, right? And that's a conversation we often have with customers, right?

Starting point is 00:27:33 And so I was on with a customer this morning, and they were telling us about what does their pipeline look like to get weather data into Snowflake? Oh, no, they were using BigQuery. They were using BigQuery, right? So they have a process that is downloading GRIP files into their object storage, and then they have an ingestion script that is taking those and loading them into Snowflake.

Starting point is 00:27:58 And then they query the data back from Snowflake, and they, sorry, I keep saying Snowflake, BigQuery. They query the data back from BigQuery, and they are, like, I keep saying Snowflake, BigQuery, they query the data back from BigQuery and they are like at their wits end trying to get this to work and be scalable and performant. The speed of data ingestion that they can achieve through this approach can barely keep up with the rate at which the forecasts are generated. Then when it comes back, it comes time to actually query the data back. It's very brittle in how they can use it. Like as Joe mentioned, they've optimized the data for time series retrieval. But then if they want to pull back images and do more of a sort of computer vision-based machine learning pipeline on the data, it's possible to reconstruct.

Starting point is 00:28:38 And so what we propose to a team like that is, look, you don't need to have one copy of these files on object storage and then one version of this data in a data warehouse. You can achieve really low latency queries purely just by storing your data in object storage in an appropriate format, and particularly in this case, using ZAR data model,

Starting point is 00:29:00 using our ArrayLake platform to store and manage that data. And they can immediately deduplicate the fact that they have the same data replicated in two data systems, and they can lower the cost of making those queries really dramatically. And they can keep their data pipelines way more maintainable, right? And so that's the sort of business value that teams can see by adopting this approach to data management when it fits the data model. I have to say, though, most folks we talk to are just not even at all considering the

Starting point is 00:29:33 data warehouse option. Like most of them are just kind of assuming that they're going to use a very traditional sort of batch processing based approach to this data that doesn't support interactive, you know, on the fly queries whatsoever, and just requires a very slow and sort of very, you know, prescriptive data processing pipeline to do anything with these type of data, because they see them as these sort of opaque file formats, right. And so by providing a more database style experience for this data, where you can just flexibly query it and visualize it, but leveraging just purely data storage and object storage, it just unlocks so much more efficiency for the users. Very cool. So given that your building is such a lower level that I think probably, like I said, most people are just doing batch processing and things like that. And also because you have a particular storage format that requires people to sort of store your data into.

Starting point is 00:30:31 What has been like the biggest challenges so far or like lessons you guys learned to try to get people to adopt this? Because this is fundamentally, I can't just think of like just files I store, right? You know, I need to think about this as a way to store files and think about the whole stack with the files. But there's always trade- store, right? You know, I need to think about this as a way to store files and think about the whole stack with the files.

Starting point is 00:30:48 But there's always trade-offs, right? When this is a very new file format, what happens to my visualization tools? What happened to all my scripts I wrote? What happened to all the other people that want to read the same data, right? It doesn't naturally comes with this like existing ecosystem, I would assume.

Starting point is 00:31:04 So how do you guys think about getting people to adopt this and how do you like create everybody can able to read this as well so you want to grab that i don't know i don't know if i'm gonna have a perfect answer here so ryan can take part too it's a good question mean, we have some tailwinds and also some headwinds here. And so, you know, some of the tailwinds there is that, you know, we've built the platform on top of an open source format called czar. And what we're seeing out there is if an organization is trying to get out of this, like, I'm just stuck with a big bucket of, you know, opaque files that I don't have much control over.

Starting point is 00:31:46 They first reach for ZAR. And so it's really emerged as kind of the leading way to store this data in the cloud in a cloud-native way. Now, what we've learned is, yes, there are integration problems with that. And so we're building some tooling that helps bridge those integration gaps. So maybe I'll come back to that. The biggest problem that we see right now is just there's a lot of legacy tooling out there. And so, you know, trying to explain to people how they can do the old, you know, the thing that they're trying to achieve using, you know, a similar approach to what they're doing today, but with this new cloud

Starting point is 00:32:22 native architecture is, you know, one of the big challenges for us. But again, I think we have open source tailwinds on that one. We haven't talked much about what the platform Earthmover looks like just yet, but we started with the storage layer. So that's something that we call Array Lake. Think of it as a data lake platform for ZAR data. And on top of that, we're building the ability to expose that data through essentially a compute engine. And that compute engine has been designed specifically to integrate with existing tools. So this is more or less in response to the fact that we understand that

Starting point is 00:32:57 there are legacy tools out there. There's a data visualization tool, a dashboard tool that already exists. It expects data through a particular format. And it turns out it's actually quite easy to take your data in a Ray Lake and expose it through, you know, into your tiles, you know, your slippy map, Google Maps style feeling thing. And so that's one of the sort of things we've had to do to address those integration problems. Yeah. I mean, I think in terms of the biggest challenges, I would just say that like in our space, there just like aren't a lot of like products period, like managed data services and databases that are, you know, that folks are used to. So like, it's just not a familiar conversation to be having with certain types of developers, like, hey, like, maybe you guys should just use a managed service for this rather than like just building all your own data system from scratch. Right. And whereas I think in other more mature technology domains, it's like just more accepted that there's yeah, like there's certain problems that like an engineering team doesn't have to solve themselves. Like you're not going to write your own database. You're just going to use, you know, RDS or, you know, like it's more standard.

Starting point is 00:34:01 And so most of the teams we talk to, there's sort of like two end members of like participants. There's like people kind of more like me and Joe who came out of like academia and are not used to having any tools or services or support at all and just write like, you know, spaghetti code. And then there's like people from like the more traditional engineering space

Starting point is 00:34:22 who like know the sort of mainstream set of tools. And so we're trying to kind of like help them meet in the middle. But yeah, just having that conversation and helping people to understand the value for them from leveraging the managed service and, you know, the relationship with the open source and, you know, how all of it fits together

Starting point is 00:34:39 in an architecture. We're still figuring out how to tell that story in a way that resonates the most with like the customers we're talking to. Let's go into a section we call Spicy Future. Tell us what you believe that most other folks don't believe yet in sort of this infra space. And given you're in a particular space, I'm actually really curious, what's your hot

Starting point is 00:35:03 take here? I have very many hot takes. I'll start with one that maybe is, I don't know if it will seem hot or not, but I think the cloud as we know it is actually ending and becoming much more decentralized. We are seeing a huge reluctance to be locked in to a particular cloud provider and folks really making strong choices around that. We see people chasing GPUs wherever they can get the cheapest GPUs and consequently really worrying about egress fees. Our customers typically have up to petabytes of data. And so, you know, the fear of that sort of data gravity is super real. And then we see all these like innovative new compute providers popping up all over the Internet.

Starting point is 00:35:51 So, you know, in that context, we're really excited about more sort of decentralization of how data are stored. In particular, we've been partnering with this company Tigris that you might know about that is offering sort of decentralized, globally available Egress fee, S3 compatible object storage. And we think that is like really compatible with the future we see where people are doing computing. Maybe they're even going back on-prem and just building a GPU cluster. And so we think cloud data platforms

Starting point is 00:36:20 have to work in that context and can't just assume everyone is in AWS or Google Cloud. Is that spicy or not? I don't know. Maybe everyone knows this by now. Well, we'll start with the basic level. I guess, Joe, do you have your hot take here? Well, I mean, I think I'll take it to the other end. You know, that was focused on the architecture broadly of cloud. You know, I'm going to take it to the data side and just, you know, really underscore this idea that like today we're talking about petabytes of data, you know, and a handful of customers,

Starting point is 00:36:53 and we're headed to a exascale future, not just for exascale sake, but because like the questions that are going to be answered, the efficiencies in our society that are going to be gained are like so fundamental to where we need to go as a society that it's worth going there. Maybe this conversation feels niche today, but like if we're going to achieve this sort of ability to respond to the kind of existential crises that we have, whether they're climate or the next, you know, public health emergency or whatever it is, like we have to get much, much better using the data we have. And so I'm really excited about thinking about like what that means from the architecture perspective of what we're doing here. And like, what do we have to do to go from where we're at today to the next order of magnitude and scale and efficiency for doing that? You got another one, Ryan? Well, yeah, I got a whole backlog of spicy takes that I could keep spitting out

Starting point is 00:37:48 as long as you guys will listen. But, you know, this one may be not coming as a surprise, but like, I do not think like data is solved. What I mean by that is in a certain like corner of like, you know, the Twitter and the blogosphere, there's this sort of sense that like data has been solved basically by like Apache Arrow and Parquet and that ecosystem of technologies. And I think it's a fantastic ecosystem of technologies and has had such an impact. But the data model of Arrow,

Starting point is 00:38:21 you know, which is at the core of so much of this, it doesn't accommodate a lot of important science and engineering problems, right? It's not rich enough to support, for example, these sort of volumetric scenarios or, you know, things related to microimaging and genomics and, you know, so we're very bullish on like deep tech and the impact of AI in really, really hard and complex scientific problems. And so we think there's still more work to be done on the core foundations of the data stack, not to take away anything away from all that that has accomplished. And on the contrary, it's like very inspirational for how like that sort of foundational technology can empower and enable so many cool things to be built. So the fun thing

Starting point is 00:39:05 about being in this space right now is there's just a lot of really great ideas you can copy and solve problems. And people have sort of charted the path for what it looks like. But we get to adopt them and adapt them to sort of slightly different data domains and models. I think that's absolutely correct. And I think oftentimes we sit back and it's really easy to think about the fact that, you know, the problems we've really solved in data are like business data problems. And at the end of the day, it's like stuff I can store in a row table format that's structured, that's approaching maybe solve. But like what we're also witnessing, especially with this recent innovation

Starting point is 00:39:45 with large language models, but more broadly, things like attention, the fact that we've got billion-dollar potential AI training runs right around the corner, this explosion of GPUs, is broadly a broadening of the digitization effort. So more things that were previously analog or only could exist in the lab

Starting point is 00:40:04 because we didn't have the Q capacity or we didn't have a way to generate value from the data set, right? effort. So more things that were previously analog or only could exist in the lab because we didn't have the compute capacity or we didn't have a way to generate value from the data set, right? Because we didn't have the model technology or we didn't have the data that's captured, right? So now we can capture image data, we can process in different ways. We have all these new ways to work with different types of data that we couldn't before to create value from it. And so I 100% agree with you. It's a new day. And the first wave of computing was all about making business move faster, which is basically trading a bunch of business data around. And the next wave is we're seeing things like self-driving cars, image recognition, use cases all over the

Starting point is 00:40:35 place with body cams and all this other stuff. There's tons of opportunity as what is currently analog becomes digital and accessible. And so Ryan and Joe, it's been great having you. Where can people find out more about EarthMover and what you're working on? Yeah, you can definitely check out our website, earthmover.io, and you can hear about our platform there. We're also really excited about our new open-source project called IceChunk. You can think of it as Apache Iceberg for tensor data and massive scale. This is sort of the new open core storage format for the platform.

Starting point is 00:41:09 We didn't get to talk about it much today, but it does some amazing things. And it is all written in Rust with just a light Python layer on top. And people can check that out at icejunk.io. And we're definitely looking for feedback and input you know, input from the community there. Amazing. Well, thank you so much, Ryan and Joe. It's been great having you and I'm sure we'll end up having you back to talk more about all the cool stuff you're working on. Thank you guys.

Starting point is 00:41:36 Very good. Cool. Yeah. Thanks both. you

Your Ad Here

The Infra Pod - Why do we need a cloud native data lake for geospatial data? Chat with cofounders of Earthmover (Ryan and Joe)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.