The Infra Pod - Why do we need a cloud native data lake for geospatial data? Chat with cofounders of Earthmover (Ryan and Joe)
Episode Date: January 13, 2025Future of Data in Climate and Earth Sciences In this episode of The Infra Pod, Tim and Ian sat down with the cofounders of Earthmover Ryan and Joe to discuss the cloud native multi-dimensional data l...ake they're building for climate and geospatial data. They delve into the role of advanced data handling methods, multidimensional array data, the impact of cloud-native technologies, and how these influence various sectors ranging from insurance to environmental resilience. 00:00 Introduction 00:09 Excitement for Earthmover 00:44 Ryan's Background 03:46 Joe's Involvement 07:54 Definition and Value of Earth Science Data 10:08 Industry Applications 14:37 Challenges and Traditional Methods 19:19 Cloud-Native Solutions 27:09 Data Management in the Cloud 35:05 Spicy Future! 40:46 Contact and Conclusion
Transcript
Discussion (0)
Well, welcome to the InfraPod.
This is Tim from Essence and Ian. Let's go.
Man, I am crazy excited.
I think I said every pod, so I don't know how valuable that is at this point.
But we're excited.
We have Ryan and Joe from EarthMover here today to talk about the future of data for climate and earth sciences,
geoengineering, and basically everything about
the analog world in which we live. So I couldn't be more excited to learn this. This is very novel
and new and also super cool, but also unique. So Ryan, tell us a little about yourself. Give us
a little background and we'd love to learn more about you and about your company and how you
started. Yeah, awesome. Super excited to be here. Thanks so much for the invitation, you guys.
Yeah, so I'm Ryan Abernathy. I'm the CEO and co-founder of Earthmover. And my background,
the way I got into this world of startups and data is really through my career as a scientist.
So I am a computational oceanographer and climate
modeler. What that means is I spent most of my career trying to understand how the ocean works
and how it interacts with the rest of the earth system through the use of theory, you know, basic physics, observational data, and hardcore numerical simulation.
And that work over the course of my career became increasingly data intensive up to the
point where we were now looking at data archives and file sets that were measured in petabyte
size. And something happened. And I just switched to
that lower level meta problem of how are we going to process all of this data? How are we going to
analyze all of this data? How are we actually going to make sense of all these petabytes that
are coming to us at an accelerating rate within earth system science. That's a basic journey that I've been on. And from a sort of
infrastructure point of view, that's taken me from the world of like traditional academic research
computing environment, which is basically like a hodgepodge of servers, you know, on-prem in the
basement and national level supercomputing facilities, so like traditional HPC technology
to the cloud and trying to figure out how are we going to work with this type of data in the cloud.
And so, you know, the overall motivation for our work is that this domain of scientific and engineering work that involves large earth system
data sets is extremely painful for the practitioners who are working there. There are
very few of these sort of high-level services and databases and tooling that make data analytics easy and pleasant to a data scientist or a
developer. And there are tons of tech debt and old fashioned architecture and inefficient workflows.
And in the research world, that's a problem for, you know, grad students and postdocs who are just
used to having to deal with that pain as the cost of doing business. But as we see an increasing number of commercial organizations, you know, trying to build products and innovative
solutions around climate, environmental resilience, those problems are now becoming business problems,
and they have a real cost for these type of companies. And so that's creating a market and
a demand for solutions. And that's where Earthmover comes in, basically.
That's really cool.
Joe, so how did you get involved with Ryan and this endeavor around Earthmover to try and solve this?
How did I get sucked into this whole thing?
Earth sciences and all the fun that's related to data.
Yeah, so Ryan and I share part of our origin story here in the sense that we're both trained scientists.
So Ryan's I share, you know, part of our origin story here in the sense that we're both trained scientists. So, you know, Ryan's an oceanographer. I did a PhD in civil engineering, but focused on climate model development and climate modeling,
really trying to understand like what the future climate is going to mean for, you know,
the water cycle, the built environment, you know, the places that people live and that
sort of stuff.
And yeah, so the part of the story that overlaps is I was also, you know, the places that people live and that sort of stuff. And yeah, so the part of
the story that overlaps is I was also, you know, generating terabytes to petabytes of data early
in my research career and very quickly came to the conclusion that like, we don't have the tools
for this amount of data. Like, it's just, this is not, I can't see myself just trudging through data
like, like this, like we need to do something about this.
So I somewhat through happenstance got connected with some open source projects and started working on one in particular called X-Ray, which is how I met Ryan.
And, you know, X-Ray kind of forms the basis of a lot of this.
So X-Ray is an open source Python project.
Think of it like Andis, but for multidimensional arrays.
So it's multidimensional arrays, metadata, indexes. And so I started working on that. And over time, like the impact of working on open source and building tools around data trumped my desire career as an academic. And I got just pretty addicted to having that sort of real-time impact in the work of many more people.
So I started working on X-Ray.
We started a project called the Pangeo Project, which we may come back to, you know, from an architecture perspective.
I think it's pretty important here.
Yeah, so Ryan and I worked together for a bunch of years doing that.
And then right before I started, you know, we started Earthmover.
I was at a nonprofit called Carbon Plan where we were doing climate science and climate policy under one roof.
And so we had a team of both policy people and scientists and data engineers trying to move as quickly as possible to use data and data science to answer policy relevant questions.
And there I was like full practitioner, just like, you know, any of these companies that are trying
to operationalize these data pipelines. And it's really then that convinced myself that like,
there are missing infrastructure layers in the stack for us as a society to be able to use this
kind of data to, you know, to enact the change we want
to see. So that convinced me to start the company. And yeah, we can go into any of those pieces
as we get into this. I think it'd be really helpful for the audience because it's such,
you know, to be honest, when I met you two, I was like, I knew a good amount of the space
because I actually at one point worked on catastrophe modeling for insurance companies. So I know a lot about HPC clusters and using
volumetric data to try and generate a statistical distribution of what the
probable outcome is, doing floodplain analysis and all these things to basically
discover, hey, if we take this reinsurance contract,
are we going to go bankrupt? That's basically the reinsurance industry.
It's the industry that insures the
insurers. And ultimately, at the end of the day, it's your last resort insurer. And so that comes
down to, can we basically understand when we're going to make money on this contract? Are we going
to go bankrupt? Because it's a zero or one outcome in reinsurance, effectively, at the end of the
day. It's like, you either make money or the company's over on most contracts. So can you
help our listeners? And I think Tim
and I understand, you know, today, what are the like in industry revenue producing places where,
you know, these are science data, right? No climate data or science data is used to generate
money or places where it could be. Because I think the first problem in this domain is really like
having people to better understand one, there's like why it matters from like a theoretical or like a research perspective but more importantly like why does this matter
from an industry perspective and what industries are most impacted by this problem space yeah it's
a great question maybe i'll i'll take a minute to like sharpen our definition of what type of data
we're talking about and then you know maybe joe can talk a little bit about some of the type of
the work they did at carbon plan maybe around carbon markets and carbon accounting and stuff like that.
But the type of data that we are talking about from a technical perspective is really characterized as large multidimensional array data sets, aka tensors, as they are called in the machine learning world, right? And so that's really the characteristic of this data
that is so particular and unique
and that makes it ill-suited for most existing databases
and data systems.
And it's not quite the same thing as geospatial data, right?
Which is a kind of much more established category.
Certainly has a big overlap,
but it's a more lower level concept.
It's not specific necessarily to any one domain.
You have tons of data like this, for example,
generated by microscopes, right?
That scan cell volumes or by seismic instruments
that are imaging the interior of the earth
looking for oil, not to mention astronomy.
There's not really a lot of commercial applications
for astronomy data, but you know, it's the same data model, right? Massive images stacked in time, stacked across other
dimensions, three, four, five dimensional tensors. That's why it's so easy to get petabytes. What
produces that type of data? Well, most imaging systems or scanning systems, sensor systems
produce that type of data. Simulations use these type of grids to model,
you know, any kind of physical system using, say, finite volume grids and discretization.
And then traditionally, they are dumped out into a zoo of different file formats with like HDF5
probably being the most common one. But you know, every domain has got its own sort of bespoke
custom file formats. And so that's sort of the type of data
we're talking about in this, you know, climate and weather and environmental data space. It's
really largely going to be satellite data and simulation data from weather and climate models.
And so that's sort of setting the stage. Yeah, maybe I'll kick it over to Joe and he can tell
you what like a data pipeline looks like to go from the raw data to like something that has actual business value.
Yeah. So, yeah, I'll talk about two core application areas in the kind of climate and weather space.
You know, the first is around carbon. If you follow the climate crisis at all, like you understand that like we have a carbon problem in the atmosphere right now.
Like we're putting more carbon in the atmosphere than the land and ocean can take up. And there's really two like areas of work in this category.
So you have the first category is like tracking emissions. And so, you know, this is like asking,
where's that carbon coming from? And how much is it? And that's a remarkably hard thing to track.
A reference point here is the Climate Trace Project.
So climatetrace.org would be a good reference for folks.
But that project's trying to use satellites to estimate emissions and essentially the
carbon budget using satellites alone.
And it's a super ambitious and tricky problem.
But when we were at Carbon Plan, we tried to do the forest sector.
So just understand from the change of carbon in trees, where is there a sink and where is there a source?
And that required, you know, processing, you know, many, many, many terabytes of satellite images, doing machine learning,
and then ultimately generating a prediction based on, you know, some inference of, you know, where carbon is being taken up and where it's not.
That's on the emission side.
There's also on the removal side, people trying to do the same data science for carbon markets.
I could pay you to not cut down your forest so that forest can continue to sequester carbon.
And there we have to have good validation.
We have to have good tracking.
And you have to be really clear about the provenance of how you made decisions.
And so, again, like data science comes in really heavily there and being able to process data at scale and accurately is really important.
The second area I'll mention, which is even probably more relevant to folks, is on the risk side. So how do we estimate the future risk of
extreme weather, floods, heat waves, et cetera? And there we need climate models and essentially
projections of how the future is going to, you know, how a warmer future is going to impact
weather and long-term climate. And to do that, we ingest data from HPC centers that have
put out, you know, many petabytes of data, that data gets stored in the cloud. And we have to run
that through, again, essentially machine learning pipelines or process-based models to try to
understand where those applications are. And there are companies out there today that will sell you
climate risk information. You want to know for your house what the risk of forest fire or flood is, like you can go out there and buy that
information. And it's an emerging market, but it's one where there's, you know, quite a bit of
long-term economic value to us understanding where those risks are. I want to just like add
one more real quick use case, which is we were talking with Ian before we started recording about this. Since we've started
the company, we've actually seen the most demand sort of organic pull for what we're working on
from folks working with weather data in sort of commodities and energy space. And it's the same
shape of data. It's, you know, weather forecast data has the same basic shape, these big tensors, you know, gridded fields of predictions, but on a much shorter timescale, and therefore much more relevant to the business cycle. see a lot of movement on AI weather forecasting that there was just a new DeepMind forecast paper
out in Nature yesterday, new Ensemble forecast. So there's a ton of innovation happening in weather
forecasting. And we're kind of finding ourselves right in the middle of that, helping weather data
move more efficiently through the cloud. I mean, that use case makes tons of sense to me.
Everyone knows from if you've watched the social network, Eduardo Sardar and making 300 grand
or whatever
of oil future speculation,
right?
This stuff drives markets
and increasingly
as we have a more
fractious world
as a result of like
a more volatile climate,
like this is really important.
And also I think like
the idea of like
as the astronomical data
is really quite interesting
as well,
and I can only imagine
like the applications
for volumetric data
in space
as an example
and prediction of like
a satellite movement
or like collision detection
or like, you know, all these things, you're huge, hugely important.
Then even things like I know from my own experience, things like aquaculture, a lot of this comes
down to like, there's a lot of volumetric data associated in those fields as well.
I'm quite curious, you know, there was a world where we didn't have this type of data, right?
What's the traditional way that people have gone about trying to solve this problem?
We couldn't have petabytes of data, but we still tried to build forecast models.
And so I'm curious, what's the driver in industry to move towards the usage of more
higher volume data formats?
So like, you know, you're using these sort of tensor formats you discussed.
I assume it's about accuracy of the prediction, but it would be really great to understand,
like, is it that we couldn't do any of these things before
or that we were doing them
and the results were just really slow and terrible
and doing it this way gives us significantly higher accuracy
in the same way that LLMs
have given us significantly better solution
for natural language.
Like, what's the sort of delta in industry from,
okay, we were doing this really bad slow thing.
It kind of gave us like a 60% and now we're%. And now this new thing opens up all these new opportunities.
And I assume there's a lot of components here, but it also sounds like there's also been
a shift as well occurring in the traditional way that people approach this problem space.
Yeah, it's a great question. And a bunch
of things are wrapped up into understanding how this tech space has evolved
over the last 10 years.
So to narrow the scope, like let's just talk about weather data, right?
Because it's like a fairly constrained problem.
Weather forecasts are created primarily by the government.
Public sector agencies like NOAA, National Weather Service, or the European Center for Medium-Range Weather Fore forecasting in Europe. It's something that we've been doing for 50 years as a society, investing in public sector weather forecasting
as a common good for society.
It's an incredible achievement technologically
that we can do this.
And people love to complain about weather forecasts,
but if you really look at the data
and you quantitatively assess the skill of those forecasts,
you can see them just getting better and better and better, you know, year on year. And that has massive value for
society. Now, how are those forecasts made? Well, mostly the government operates programs to collect
weather data, raw observations through things like weather balloons or satellites or on the
ground weather stations. But those are just data.
They don't make any prediction.
And so those are fed through a data assimilating numerical atmospheric model.
And that runs on a dedicated supercomputer that's, you know,
owned and operated by the National Weather Service that only does this.
And it churns out these weather forecasts every hour, constantly. And
that basically dumps out files in a format called GRIB, which, you know, if your listeners have
heard of GRIB files, congratulations, you're a weather nerd. Most people never have. But
if this is your world, you know exactly what it is and how painful it is. And those are put on
an FTP server. And that's where essentially the
public sector stops. And then there's a whole sort of ecosystem of downstream vendors and service
providers who take those and try and translate them into more useful information. So they consume
those files. And ultimately, those go into like the app that's on your phone when you like, you know, look at a weather forecast.
Right now, traditionally, all this processing chain is very much associated with like the HPC world.
That's where the files originate. That's where most of the practitioners in the domain are trained.
And so, you know, this is the world of like MPI and, you know, like batch processing and big expensive file systems and things like that.
So one big thing that is changing is like the ubiquity of the cloud in just about any startup or really any business.
Like they want to use the cloud and there's a real impedance mismatch between this sort of FTP server based way of distributing data and the modern cloud-native stack.
And then the other big driver, I would just say data analytics in general
and AI in particular, there's real value you can get from having all the data.
So the traditional paradigm for weather data is like,
well, I'm just going to want this one point.
Give me the forecast data at point X.
And there's dozens of vendors out there who will have some API you can query and give you that data at point X.
But if you say, actually, I'm a commodities trader and I'd like 20 years of all the forecast history, all variables, all pressure levels for the entire record.
And I don't care if it's a petabyte of data.
I know how to make millions of dollars a month on this data, and I needed to train my AI system. There's no good answer to that,
really, from that legacy stack. And so that's where the more modern cloud-native approach to
Earth system data comes in. Cool. So I think we really want to talk about what is this cloud-native
thingy we're talking about here. And also, I think it's want to talk about what is this cloud native thingy we're talking about here.
And also, I think it's going to be really interesting because a lot of what I can read from your website or open source is really about this multidimensional array sort of file format.
And just looking deeper into it, there's some conscious choices we have to make that's a little bit different.
Looking like there's a chunking, ordering, compression, ways of doing parallel reads, uncoordinated writes with ice chunks. I think for our infraners and
engineers out there, I never really dealt with this kind of things before. I worked on Spark,
worked on Kafka, worked on a lot of processing, but not this kind of things. What is the most
unique part of this scientific data computing stack that's very different than your traditional
like in parallel distributed storage or computing in general?
I'll grab this one and then maybe Joe can talk about some of the tools that are used
in this space.
Fundamentally, like what we're trying to advocate and pursue at Earthmover and in our
community more broadly is that there is this data model that deserves to be considered like structured data,
but that isn't like relational tabular data, right? If you look at like the data space more broadly,
not focusing on this problem space, you basically see like, well, you've got basically two choices.
You've got structured data and there you've got your rows and columns in a table and you could
put it in a parquet file or you could put it in Snowfl, and you could put it in a Parquet file, or you could put it in Snowflake, or you could put it in Postgres, but everyone knows what they mean, structured data.
Then you've got unstructured data, which is, in its most generic form, just a bunch of files somewhere that you write some sort of custom script to loop over.
There's maybe something in between semi-structured data, but I don't really know what that means.
Is it JSON or something? I don't know.
What's the real definition of structured data, but I don't really know what that means. Is it JSON or something? I don't know. What's the
real definition of structured data, though? It's that it's a more compact representation that
leverages the inherent structure in the data. Like you could take all your tables and store them as
JSON or something like that, but it would be slow. It would be inefficient. It wouldn't use storage
well. But because there's this structure inherent in the data, you can use that to optimize both the
storage and the compute on top of that data. And that's what databases do. And they have a very
specific data model that they work with. Our big hypothesis is that there exists a data model also
for a lot of scientific and engineering domains that isn't the relational model, but it's still
highly structured. And the best way to describe it is, let's just consider
a data set of temperature at every point on the earth over time. This is a sort of classic cube,
right? And there's a certain, you know, just one of those slices you can think of as an image,
but then it can be very high resolution, and then bring in a time dimension. So this is a
three-dimensional array, right? And you've got
orthogonal indexes on each of those dimensions. Maybe you've got latitude on one, you've got
longitude on another, and then you've got, you know, time along the third, right? And, you know,
you could take that data and squash it down to be tabular, right? You could make a row for latitude
or a column for latitude, column for longitude, column for time, and then your temperature value.
You will have just 4x'd the volume of data you have to store because you're throwing
away that inherent multidimensional structure in the data.
And you've destroyed other information within that data, such as the connectivity between
pixels, you know, that's inherent in, say, an image. And so, you know, what the
multidimensional array data model offers is a much more efficient way to both store and query
that type of data that leverages this inherent multidimensional structure. And so when we think
about our stack, it starts from storage. How do we store this data effectively
such that we can provide good access to it and support the sort of query patterns that
workflows require? And then two, how do we like orchestrate computing on top of that in a way that
is fast and performant and can scale out? I don't know if that's answering your question, Tim, but
like that's kind of where I start for a person who's unfamiliar. Why do we need something at all? Like, why not just use
Parquet and like DuckDB? Like it's, you know, the standard stack, right? And that's the answer
is that it doesn't really fit. And if you try and use those tools, you pay a big price.
Yeah. And so maybe get a bit more specific. what are the biggest difference when it comes to like axis patterns or some sort of like limitations that existing tools really hit to?
Because I think I understand that there's like an inherent nature of difference of the data type and maybe some axis patterns, but probably not very intuitively able to understand like, okay, maybe there's one core example here. Because given that there is like multidimensional data,
sometimes we hear these terms multidimensional,
like how many dimensions, how big a data set?
Is everybody only accessing everything at once
or only small chunks of it, right?
Like I think there's like some particular canonical example
maybe we can talk about that,
hey, we need a new stack from the storage up.
And also there's probably things you can probably use, right?
You don't have to probably rewrite the whole nine yards of everything,
but like what are the core things you have to actually go in deeper
that allows your users to get the benefits of what you're doing?
So maybe something like that will be helpful.
Let's talk about two queries then that you might want to make
against the data cube that Ryan was using as his
illustration. So you have temperature over the globe and over time. So it's latitude by longitude
by time. Those are your three axes on your cube. And, you know, there's two end member queries that
are not particularly large, but are orthogonal. So the first one is you might want to extract a map.
Everybody is familiar with looking at, you know, maps of our planet. So that's taking one slice out of this cube and now we're going to
visualize it or, you know, calculate some statistics on it or something like that. So that's the
temperature for today. The second query is I might want to pull a time series of just all the
temperature over time, you know, for this forecast or for this climate model or whatever from, you know, the pixel that's over San Francisco. And I want to drill down through
that cube in an orthogonal dimension relative to that map. And so, you know, thinking about how
that would work in a tabular database is I think trickier, right? So basically like by flattening
that data, our data, like it's not to say we couldn't do it, but we're going to have to stride through the whole database in both cases.
There's not really a way for us to have all the data that we want right next to each other.
That inherent structure and how the data is stored is broken.
And so, yeah, then we start talking about like, how would we arrange that and that's when we get into you know data formats that
are built around arrays and the basic architecture of how these work and this applies to czar which
is what we work on a lot but it also applies to hdf in other formats out there is that you're
going to take that cube you're going to break it into pieces those pieces we call chunks typically
and then we're going to compress those chunks individually.
And then when we go to extract that time series point,
we only have to read the parts that we intersect with,
and we don't have to read all the other parts that are also stored in that format.
So you don't even have to scan through those in that example.
I think it'd be helpful to like, if you can,
maybe put some numbers, like what's the,
you know, traditionally if we're to use what exists, we'll try to solve this, you know,
it costs this much or take this, like, what's the, like, what's the optimization as a result?
Like, does this move a problem that's like unsolvable to something that's solvable? Is it like, it just makes a problem that was kind of solvable to something that is now just like
significantly cheaper so we can like scale it up and do more. It might be both. It would be great to understand as a result of this,
now we can do these things.
To answer that, it's worth looking at the landscape
and looking at different teams and asking,
what are they using?
What are the technology options for a team in this space?
And we've been contrasting this
to the more mainstream data warehouse space.
So why not just store all this data in BigQuery and Snowflake, right?
And that's a conversation we often have with customers, right?
And so I was on with a customer this morning, and they were telling us about what does their
pipeline look like to get weather data into Snowflake?
Oh, no, they were using BigQuery.
They were using BigQuery, right?
So they have a process that is downloading
GRIP files into their object storage,
and then they have an ingestion script
that is taking those and loading them into Snowflake.
And then they query the data back from Snowflake,
and they, sorry, I keep saying Snowflake, BigQuery.
They query the data back from BigQuery,
and they are, like, I keep saying Snowflake, BigQuery, they query the data back from BigQuery and they are like at their wits end trying to get this to work and be scalable and performant. The speed of data ingestion that they can achieve through this approach can barely
keep up with the rate at which the forecasts are generated. Then when it comes back, it comes time
to actually query the data back. It's very brittle in how they can use it.
Like as Joe mentioned, they've optimized the data for time series retrieval.
But then if they want to pull back images and do more of a sort of computer vision-based machine learning pipeline on the data, it's possible to reconstruct.
And so what we propose to a team like that is, look, you don't need to have one copy of these files
on object storage
and then one version of this data in a data warehouse.
You can achieve really low latency queries
purely just by storing your data in object storage
in an appropriate format,
and particularly in this case,
using ZAR data model,
using our ArrayLake platform
to store and manage that data.
And they can immediately
deduplicate the fact that they have the same data replicated in two data systems, and they can lower
the cost of making those queries really dramatically. And they can keep their data pipelines
way more maintainable, right? And so that's the sort of business value that teams can see by
adopting this approach to data management when it fits the
data model. I have to say, though, most folks we talk to are just not even at all considering the
data warehouse option. Like most of them are just kind of assuming that they're going to use a very
traditional sort of batch processing based approach to this data that doesn't support interactive, you know, on the fly queries whatsoever, and just requires a very slow and sort of very, you know, prescriptive
data processing pipeline to do anything with these type of data, because they see them as
these sort of opaque file formats, right. And so by providing a more database style experience for
this data, where you can just flexibly query it and visualize it, but leveraging just purely data storage and object storage, it just unlocks so much more efficiency for the users.
Very cool.
So given that your building is such a lower level that I think probably, like I said, most people are just doing batch processing and things like that.
And also because you have a particular storage format that requires people to sort of store your data into.
What has been like the biggest challenges so far
or like lessons you guys learned
to try to get people to adopt this?
Because this is fundamentally,
I can't just think of like just files I store, right?
You know, I need to think about this
as a way to store files and think about the whole stack with the files. But there's always trade- store, right? You know, I need to think about this as a way to store files
and think about the whole stack with the files.
But there's always trade-offs, right?
When this is a very new file format,
what happens to my visualization tools?
What happened to all my scripts I wrote?
What happened to all the other people
that want to read the same data, right?
It doesn't naturally comes with this
like existing ecosystem, I would assume.
So how do you guys think about
getting people to adopt this and how do you like create everybody can able to read this as well
so you want to grab that i don't know i don't know if i'm gonna have a perfect answer here so
ryan can take part too it's a good question mean, we have some tailwinds and also some headwinds
here. And so, you know, some of the tailwinds there is that, you know, we've built the platform
on top of an open source format called czar. And what we're seeing out there is if an organization
is trying to get out of this, like, I'm just stuck with a big bucket of, you know, opaque files that
I don't have much control over.
They first reach for ZAR.
And so it's really emerged as kind of the leading way to store this data in the cloud in a cloud-native way.
Now, what we've learned is, yes, there are integration problems with that.
And so we're building some tooling that helps bridge those integration gaps.
So maybe I'll come back to that. The biggest
problem that we see right now is just there's a lot of legacy tooling out there. And so, you know,
trying to explain to people how they can do the old, you know, the thing that they're trying to
achieve using, you know, a similar approach to what they're doing today, but with this new cloud
native architecture is, you know, one of the big challenges for us.
But again, I think we have open source tailwinds on that one.
We haven't talked much about what the platform Earthmover looks like just yet, but we started with the storage layer.
So that's something that we call Array Lake.
Think of it as a data lake platform for ZAR data.
And on top of that, we're building the ability to expose that data through
essentially a compute engine. And that compute engine has been designed specifically to integrate
with existing tools. So this is more or less in response to the fact that we understand that
there are legacy tools out there. There's a data visualization tool, a dashboard tool that already
exists. It expects data through a particular format. And it turns out it's actually quite easy to take your data in a Ray Lake and expose it
through, you know, into your tiles, you know, your slippy map, Google Maps style feeling thing. And
so that's one of the sort of things we've had to do to address those integration problems.
Yeah. I mean, I think in terms of the biggest challenges, I would just say that like in our space, there just like aren't a lot of like products period, like managed data services and databases that are, you know, that folks are used to.
So like, it's just not a familiar conversation to be having with certain types of developers, like, hey, like, maybe you guys should just use a managed service for this rather than like just building all your own data system from scratch.
Right. And whereas I think in other more mature technology domains, it's like just more accepted that there's yeah, like there's certain problems that like an engineering team doesn't have to solve themselves.
Like you're not going to write your own database. You're just going to use, you know, RDS or, you know, like it's more standard.
And so most of the teams we talk to, there's sort of like two end members of like participants.
There's like people kind of more like me and Joe
who came out of like academia
and are not used to having any tools
or services or support at all
and just write like, you know, spaghetti code.
And then there's like people
from like the more traditional engineering space
who like know the sort of mainstream set of tools.
And so we're trying to kind of like
help them meet in the middle.
But yeah, just having that conversation
and helping people to understand the value for them
from leveraging the managed service
and, you know, the relationship with the open source
and, you know, how all of it fits together
in an architecture.
We're still figuring out how to tell that story
in a way that resonates the most
with like the customers we're talking to.
Let's go into a section we call Spicy Future.
Tell us what you believe that most other folks don't believe yet in sort of this
infra space.
And given you're in a particular space, I'm actually really curious, what's your hot
take here?
I have very many hot takes.
I'll start with one that maybe is, I don't know if it will seem hot or not, but I think the cloud as we know it is actually ending and becoming much more decentralized.
We are seeing a huge reluctance to be locked in to a particular cloud provider and folks really making strong
choices around that. We see people chasing GPUs wherever they can get the cheapest GPUs
and consequently really worrying about egress fees. Our customers typically have up to petabytes
of data. And so, you know, the fear of that sort of data gravity is super real.
And then we see all these like innovative new compute providers popping up all over the Internet.
So, you know, in that context, we're really excited about more sort of decentralization of how data are stored.
In particular, we've been partnering with this company Tigris that you might know about that is offering sort of decentralized,
globally available Egress fee,
S3 compatible object storage.
And we think that is like really compatible with the future we see where people are doing computing.
Maybe they're even going back on-prem
and just building a GPU cluster.
And so we think cloud data platforms
have to work in that context
and can't just assume everyone is in AWS or Google Cloud.
Is that spicy or not? I don't know. Maybe everyone knows this by now.
Well, we'll start with the basic level. I guess, Joe, do you have your hot take here?
Well, I mean, I think I'll take it to the other end.
You know, that was focused on the architecture broadly of cloud. You know,
I'm going to take it to the data side and just, you know, really underscore this idea that like
today we're talking about petabytes of data, you know, and a handful of customers,
and we're headed to a exascale future, not just for exascale sake, but because like the questions
that are going to be answered, the efficiencies in our society that are going to be gained are like so fundamental to where we need to go as a society that it's worth going there.
Maybe this conversation feels niche today, but like if we're going to achieve this sort of ability to respond to the kind of existential crises that we have, whether they're climate or the next, you know, public health emergency or whatever it is, like we have to get much, much better using the data we
have. And so I'm really excited about thinking about like what that means from the architecture
perspective of what we're doing here. And like, what do we have to do to go from where we're at
today to the next order of magnitude and scale and efficiency for doing that? You got another one,
Ryan? Well, yeah, I got a whole backlog of spicy takes
that I could keep spitting out
as long as you guys will listen.
But, you know, this one may be not coming as a surprise,
but like, I do not think like data is solved.
What I mean by that is in a certain like corner of like,
you know, the Twitter and the blogosphere,
there's this sort of sense that like data has been solved
basically by like Apache Arrow and Parquet and that ecosystem of technologies. And I think it's
a fantastic ecosystem of technologies and has had such an impact. But the data model of Arrow,
you know, which is at the core of so much of this, it doesn't accommodate a lot of
important science and engineering problems, right? It's not rich enough to support, for example,
these sort of volumetric scenarios or, you know, things related to microimaging and genomics and,
you know, so we're very bullish on like deep tech and the impact of AI in really, really hard and complex scientific
problems. And so we think there's still more work to be done on the core foundations of the data
stack, not to take away anything away from all that that has accomplished. And on the contrary,
it's like very inspirational for how like that sort of foundational technology can empower and
enable so many cool things to be built. So the fun thing
about being in this space right now is there's just a lot of really great ideas you can copy
and solve problems. And people have sort of charted the path for what it looks like.
But we get to adopt them and adapt them to sort of slightly different data domains and models.
I think that's absolutely correct. And I think oftentimes we sit back and it's really easy to think about the fact that,
you know, the problems we've really solved in data are like business data problems.
And at the end of the day, it's like stuff I can store in a row table format that's structured,
that's approaching maybe solve.
But like what we're also witnessing, especially with this recent innovation
with large language models,
but more broadly, things like attention,
the fact that we've got billion-dollar
potential AI training runs right around the corner,
this explosion of GPUs,
is broadly a broadening of the digitization effort.
So more things that were previously analog
or only could exist in the lab
because we didn't have the Q capacity or we didn't have a way to generate value from the data set, right? effort. So more things that were previously analog or only could exist in the lab because
we didn't have the compute capacity or we didn't have a way to generate value from the data set,
right? Because we didn't have the model technology or we didn't have the data that's captured,
right? So now we can capture image data, we can process in different ways. We have all these new
ways to work with different types of data that we couldn't before to create value from it. And so
I 100% agree with you. It's a new day. And the first wave of computing was all about
making business move faster, which is basically trading a bunch of business data around. And the
next wave is we're seeing things like self-driving cars, image recognition, use cases all over the
place with body cams and all this other stuff. There's tons of opportunity as what is currently
analog becomes digital and accessible. And so Ryan and Joe, it's been great having you.
Where can people find out more about EarthMover and what you're working on?
Yeah, you can definitely check out our website, earthmover.io,
and you can hear about our platform there.
We're also really excited about our new open-source project called IceChunk.
You can think of it as Apache Iceberg for tensor data and massive scale.
This is sort of the new open core storage format for the platform.
We didn't get to talk about it much today, but it does some amazing things.
And it is all written in Rust with just a light Python layer on top.
And people can check that out at icejunk.io.
And we're definitely looking for feedback and input you know, input from the community there.
Amazing. Well, thank you so much, Ryan and Joe.
It's been great having you and I'm sure we'll end up having you back to talk
more about all the cool stuff you're working on.
Thank you guys.
Very good. Cool. Yeah. Thanks both. you