Orchestrate all the Things - Building AI for Earth with Clay: The intelligence platform transforming Geospatial data analysis. Featuring Clay Executive Director Bruno Sánchez
Episode Date: June 11, 2025How a rocket scientist turned entrepreneur created the "ChatGPT for Earth data" using transformers and satellite imagery. Bruno Sánchez is a rocket scientist with a somewhat deviant trajectory.... An astrophysicist by training, he used the tools of his trade - mathematics and science - at the broadest possible scale: the universe. At some point, however, his focus switched to using those same tools for more down to earth goals. Sánchez had a stint at the World Bank, where as a member of interdisciplinary teams he helped make sense of geospatial data. Then he realized the core of what he was doing was mapping, which prompted him to launch a company called Mapbox, providing online maps on the web. This experience brought another realization for Sánchez - that we have so much data about Earth that we don't really know how to use it: "We know what are the trees in the world. We know what are the forests in the world. It's just a matter of processing [data] properly", as he put it. So when he got the opportunity to attempt to put all of that together in the same data center and in one workbench, he went for it. That was the Planetary Computer project at Microsoft, and Sánchez loved it. Then, ChatGPT happened. Sánchez noted that the T in ChatGPT - the transformer - was an architecture that seemed to work great for modalities such as text, images, and audio, but no one seemed to be using it for earth data. So he decided to give it a try. He built a team, raised funds, created a non-profit, and built an open source model using open data. And this is how Clay was born. Read the article published on Orchestrate all the Things here: https://linkeddataorchestration.com/2025/06/11/building-ai-for-earth-with-clay-the-intelligence-platform-transforming-geospatial-data-analysis/
Transcript
Discussion (0)
Welcome to Orchestrate All The Things.
I'm George Anadiotis and we'll be connecting the dots together.
Stories about technology, data, AI and media and how they flow into each other,
shaping our bikes.
Bruno Sanchez is a rocket scientist with a somewhat deviant trajectory.
An astrophysicist by training, he used the tools of his trade, mathematics and science,
on the broadest possible scale, the universe. At some point however, his focus switched to using those same tools for more down to
earth goals.
Sanchez had a stint at the World Bank, where as a member of interdisciplinary teams, he
helped make sense of geospatial data.
Then he realized the core of what he was doing was mapping, which prompted him to launch
a company called Mapbox, providing online maps on the web.
This experience brought another realization for Sanchez, that we have so much data about
Earth that we don't really know how to use it.
So when he got the opportunity to attempt to pull all of that together in the same data
center and in one workbench, he went for it.
That was the Planetary Computer Project at Microsoft and Sanchez loved it.
This is what happened next.
And then, Chars GPT happened.
And then, we realized that the key in Chars GPT,
the transformer, was an architecture of AI
that seemed exceedingly amazing in understanding text,
understanding images, understanding audio, transcriptions,
all of those things, but no one was doing it.
That question of how many trees are in the world,
how many forest, how many fires,
that modality, people were not working on that.
So we decided to give it a try.
And then the transformer was open source,
so let's put it together.
We raised $4 million to do that from Philanthropist.
And we made it, and it's great.
And it's amazing, it works.
It's incredible.
It's orders of money to faster, cheaper,
and better than anything else we've ever seen, which
is kind of exactly the same thing that happened with text and images and audio. It proves
again that this T of Chatsyptyge Transformer is an amazing human invention. But the point
that then we realize is that, huh, this is amazing, but you need to be an expert on Geospatial
and you need to be an expert in AIospatial and you need to be an expert
in AI to use that file, to use that AI.
If you truly want to unlock the power of knowing how many trees are being cut or forest or
erosion or construction, all of those things, you cannot only do the AI.
You have to do a service, a product that that makes it exceedingly easy to use?
That is the question.
I don't know.
We're building it now.
We have a demo, and people can just, if they Google Clay,
they will find it.
explore.madewithclay.org.
We can see a demo of that.
It's a map.
It's a map when you click Places,
and it allows you to find things.
But then we ask ourselves, is it a map because it deserves to be a map or because I am used
to maps because I'm in this industry and I want a map?
We are thinking maybe the way to maximize the utility of clay is not to be a map.
Maybe it's also a chat interface.
Maybe it's just a column or a spreadsheet
when people are working on things.
We don't know.
We don't know what that is.
I hope you will enjoy this.
If you'd like my work on orchestrate all the things,
you can subscribe to my podcast,
available on all major platforms.
Myself published newsletter,
also syndicated on Substack, Hacker.net,
Medium, NTZone or follow, gesturate all the things on your social media of choice.
If someone asked me what role do I feel or what objective, I would say I'm a scientist.
I think curiosity has driven a lot of my professional life. And my background is very technical.
I did a PhD in astrophysics in Germany in the Mars Plank.
And then I went to the US to use satellites,
sorry, to use telescopes on rockets.
So that line of being a rocket scientist,
every single boss I ever had after that,
something they said that he doesn't take a rocket scientist, but single boss I ever had after that, something they said that he
doesn't take a rocket scientist, but I was one of those. Even though people probably mean rocket
engineering, not rocket science. I did rocket science, I did astrophysics, and I absolutely
loved to be able to understand the complexity of the universe. It's something uncanny about the fact
that how beautiful processes are in the Earth and in the universe,
and we get the brains and the intelligence
enough to understand it.
Or we think we understand it.
To some degree, we understand it.
And to me, it was always a source of deep, not only gratitude, but also deep pleasure,
intellectual pleasure.
And at the same time, I started to feel that all of those tools that I have gained, all
of those mathematical, geological abstractions, like hypothesis, all of those tools I had,
I was applying them in the universe. So why not in issues on Earth, issues on society,
issues on social development or economic development?
So I switched.
And I had this amazing position during this postdoc.
And I wanted to do climate change.
And this could be a very long answer.
But I tried to keep it so we can double down.
Basically it was really hard because I was framed as an astrophysicist.
And so why would someone want me to do things on climate change or to do things on health
interview at the World Bank and other places?
And it was always this answer, this is very impressive, Sibi. We don't have your use. We will much more likely hire an MBA
which has less years of training for doing technical things
than an astrophysicist who has more years of training
on things that are more complex.
We can talk about why and the biases and perceptions,
but long story short, I managed to break the barrier.
And I managed to demonstrate that my skills
on mathematics were very horizontal and very applicable. I might not know and I did not know
about climate change, I did not know about economy, I did not know politics, but working together
with those profiles, we were an amazing combo. So I worked together with experts in those fields and
we developed a model on climate change. Then I did went to the World Bank,
and I worked in the office of the president
of the World Bank, Jim Kim.
And it was always my role is, hey, there is these issues,
there is these technical drivers or data drivers
or satellite images or whatever thing it is.
And Cam Bruno, can you help us understand
what's the there there?
What is the thing?
Working with other people, experts in economy or
experts in other things. And we did that. And it was amazing. And then to follow the arc,
I left the wall bank to go to a developing country to try to understand those issues, not from
D.C., Washington, D.C., and not by business class, but also on the ground. So it was not long, but I did leave for two months in Bhutan,
helping in the remotest places of Bhutan.
That was an amazing experience.
And after that, I went back to work on satellites.
I realized a lot of the things I was doing was mapping.
And I helped launch a company called Mapbox, which
is online maps on the web.
And I loved it.
And it made me realize that we have so much data about Earth,
we don't really know how to use it.
We know what are the trees in the world.
We know what are the forests in the world.
It's just a matter of processing it properly.
So I got the opportunity to build a planetary computer
at Microsoft, which was the attempt
to put all of those things together
into the same data center, into the same workbench.
And it was amazing, it was a great opportunity.
And then, Charge GPT happened.
And then, we realized that the key in Charge GPT,
the transformer, was an architecture of AI that seemed exceedingly amazing in understanding
text, understanding images, understanding audio, transcriptions, all of those things,
but no one was doing Earth.
That question of how many trees are in the world, how many forests, how many fires, that
modality, people were not working on that. So we decided to give
it a try. And because we thought it deserved to be a nonprofit, we created a nonprofit
to make the model completely open source, only using open data so that we can make the
model open license. So there's no questions about what data you use, what are the IPs
or the licenses of the train model
if you only use open data, which we have tons of petabytes on.
I knew that because I had built a planetary computer, right?
I had to put all that data, those more than 50 petabytes.
And then the transformer was open source,
so let's put it all together.
We raised $4 million to do that from Philanthropist.
And we made it.
And it's great. And it's amazing.
It works. It's incredible. It's orders of money to faster, cheaper, and better than anything else
we've ever seen, which is kind of exactly the same thing that happened with text and images and audio.
It's proofs again that this T of child CPT, the transformer, is an amazing human invention.
This T of Chatsypti, the transformer, is an amazing human invention. And then I'm just making the whole story short to go to the end, is that we realize that
it's not the only one.
There are others.
I'm happy to look about other models like clay, which are also similar, like PIFI, the
Terraformer, and others.
But the point that then we realize is that, huh, this is amazing, but you need to be an
expert on geospatial and you need to be an expert in AI
to use that file, to use that AI.
Unless maybe, I don't know, 80,000 people in the world
who fulfill those criteria.
If you truly want to unlock the power of knowing
how many trees are being cut or forest or erosion
or construction, all of those things,
you cannot only do the AI.
You have to do a service, a product,
that makes it exceedingly easy to use.
The chat GPT of GPT, like anyone can work with the chat bot.
So what is the equivalent for that?
And because that is a product, not a model,
it made less sense to us to raise philanthropy money.
Because to me, it makes sense to raise philanthropy money
for something that is get money, do the output.
And if there is no money, well, there's no more output.
But the output is there.
The model is there.
It's there if we disappear.
That file is there, and anyone who knows how to use it
will get amazing results.
We lift the tide for everyone's boat.
We make it easy for everyone who knows how to do it.
But if you provide the service and you are a nonprofit, you have to constantly raise
funds to deliver those services because they cost money.
It's the OPEC, right?
So, we didn't want to do that constant battle
that a lot of nonprofits have to do
that ends up being a little bit like a consulting company
or a project space company.
It was really nice when we had this money
to deliver this model, we made it, we did it, go on.
Let's move on.
So we decided to make it a for-profit.
That's legend.
So Clay is the nonprofit, Clay is the model,
Legend is the company whose service
is to use Clay or others to do those things easy.
If you know what to do, go ahead, use it.
If you want the easy way to do it,
or the thing on the scale, that's the legend.
And that's kind of the reason I'm here
and the lamp pad has always been the same,
which is how to use deep technology
to balance impact in society using different tools like non-profits, for-profits, startups,
Microsoft, World Bank. I feel I've tried to do that compass all the time.
Okay, indeed. I have to say it was much longer than the typical answer people give. Yes, sorry.
That's okay. I'm not saying that as a bad thing necessarily, just a fact.
So you already included in your opening statement a lot of things that we were going to touch on a little bit later anyway.
So you already mentioned, Clay, and that was going to be obviously a little bit later anyway. So you already mentioned clay and that was
going to be obviously my second question. And that's the main reason that it's what
picked my my attention basically the I just saw that recently that it was released. Perhaps
it's actually been around for longer. I'm not sure you can tell us.
Roughly.
Okay. And so what is clay? I'm you can you can give the long answer. I can give the short answer that's
based on my understanding after you give yours.
So the monologue before was kind of the context of why we do the thing. So right now when
we do the things, then now we need to go into the details of what it actually is, what it does, right? So at the end of the day, clay is an architecture, it's a processor, it takes
satellite images or plane images or drone images, it takes any kind of image and then
understands what is there. If you have a plane, if you have crops,
if you have water, you have boats,
you have ways to count them, to calculate with them.
I'm being a bit abstract on purpose
because the power of this tool
is not that it's able to count cars.
It's not that it's able to predict crop yields.
It's that it's able to do all of those things.
Is that the power of this transformer thing, the power of the invention we did, or OpenAI
did with research from others, is that this architecture allows to do a lot of the compute,
a lot of the thinking to do a lot of things ahead of time and for everyone.
So when you work with clay, you take those
images and if you want to count cars, it gives you the answer. But if you want to predict
crop yields of the same place that also has cars, you can do that equally fast because
the model has understood how to see things, how to see semantics. So that's what clay is. It's like chat GPT creates words. We can create
semantics or understand images. Actually, that was going to be, you know, the,
the three word explainer that I would give for, for people. If somebody asked me,
it would be something like that. That GPT equivalent for image data
with a little bit of a footnote there
because I'm not exactly sure
if it only works for image data.
I believe I did see somewhere in your documentation
mentioning things like sensor data and other types.
So it's a little bit, it's a little bit different.
So it's that, but with a little twist
that makes it, in my opinion, way more interesting because because you can go to charge the PT and keep the images. You can ask about the
first station and you can ask about questions over there to charge the PT. But what those things do
is the equivalent of reading the Lonely Planet, the travel book of Colombia, and saying you've
been to Colombia. When you ask the capital of Spain to that
charge the PT and it says Madrid,
it is basically because it has read that that is the thing.
But what Clay does is it takes the images,
the rusters in a technical way of any satellite,
visual, infrared, synthetic radar,
all of those things and it understands it.
When you ask what are the forest,
it's not what it has read about it,
is that it has seen that fire in these images.
And it can point to where that fire.
If you want looking for solar panels,
it's not that it has read where it was.
It's not that it has images of solar panels.
It has visited virtually those locations
and it knows what's there.
So it's kind of the librarian of Earth
where you can index and you can ask questions
and it's not just going by the words
of someone talking about it,
is that it has visited those places.
It's the Marco Polo, if you want.
Yeah.
Of course, there are some similarities, some
parallels, let's say, but there are also very important differences, obviously, this being
the most important. The training, the data set that you used to train, as you mentioned already
in the beginning, you have basically used a gazillion of open data images
to train that data, to train that model.
And as I have been able to figure out just by reading
what is available out there, it seems to me like the key
to make that work is actually embeddings,
because you used embeddings to transcribe, let's say, the meaning,
quote, meaning of every image into text. So what I'm wondering, well, first of all,
again, for the benefit of people who may be listening, you and I know what an embedding is,
but that's not necessarily the case. So if you can make like a one minute introduction to appendix and then I'm very, very curious.
What exactly did you use?
What function let's say, did you use to do those appendix?
So when we train Clain, when we train Clay,
what we do for those who understand the words
is a unit mask autoencoder. Which
it means is that you take the image and then you force in the system that that image, which
is a certain size with red, green, blue, all of those colors, and you force to compress
it into just a bunch of numbers.
You just give it a very narrow space
to store the information of what a summary of the image.
It's like if you ask an expert to just tweet
what is the content of an image.
You only have one tweet.
But then there is the other half of the model.
So the first half is doing that from image to embedding, which is that thing.
And then the other half of the model only has access to that tweet, to that vector,
and its job is to reconstruct the image.
And then the function that makes all of this work is that I am going to evaluate the work of all
of those parameters by the difference between the input
that generates the embedding and the output made
with embedding.
So that in itself already tells you
that embedding is by necessity is a compression.
It's a lossy compression because it's not perfect,
but it's a compression of the contents of the image.
And then we go one step more, which is that we remove,
when we do that embedding, when we give the image,
the input image to the model, we remove part of it.
So we remove half the face or like randomly crop images.
And we made the
vector summary of the image with those holes. But then we ask it to reconstruct the entire
image. So it needs to understand that if I see half a face, there's probably the other
half there. If I see half a glass, there's probably another glass on the other side. So we are forcing the model to understand by compression, but also by context.
By the fact that, yes, I don't see these things, but they might be there.
That's the mask autoencoder, which allows us to not need human labels or not need to
supervise the models because we can generate a lot of training by
taking all of those images, cropping holes in places and asking to do that. A little bit more
than a minute. Right, so how many rounds of training have you done so far or to ask it in
a different way, how many different versions of the model have you released so far?
Three. So we had the core idea early on, before we got funding, and it's basically the model
exists for text, as I was saying, and even exists for images. So to go from images to
Earth basically is a few tweaks of, hey, an Earth image has a location, has a time,
has an instrument, has a few things that are special. So, we did that thing. We did a small
scale kind of to prove if it works. Then we got the funding. We did the version one knowing that
this would work, knowing we had access to these compute GPUs to train. And then we then the latest version, what we call 1.5,
which is a bigger model, like more parameters,
more numbers and way more data,
like billions of little patches of images
and a lot of training doing the patching
all of this masking thing.
That was the largest, to our knowledge,
has been the largest training in the open, because I guess
it's someone does it not in the open, we don't know, the largest training in the open for
an Earth model.
Okay. And do you have, have you potentially seen already how people use it out there?
Well, first of all, before we get to that point, actually, I, it's probably a better
question to ask. So how can people use it? You already mentioned in the beginning
that well, yes, I mean, in theory, if you have the model out there, you could use it,
but it takes certain expertise to do that. So I think you have created different ways
for people to use it as well.
Clay is a technical resource. You need to know how to run a Python notebook,
at least. You need to know some of your special AI. So we do have tutorials to guide you to do
that. We have done some of those applications. We have demonstrated that Clay is able to do
land cover, which is when you get a map, and you need to know what's water, what's forest,
what grass, all of those classes. we are able to recreate, create those maps
extremely quickly with a very high precision.
The regression was 90.92, if I remember correctly.
Also to calculate the amount of biomass in a forest,
the amount of biomass above ground is kind of
how much carbon is there on those trees.
So typically all of these operations,
they take a lot of computer vision, pixel math,
all of these things to do with clay is just in the bedding.
So it's much faster.
We know that can be done.
We also know we can detect ships,
we can detect aquaculture,
these are real cases of people doing it.
But the thing is that because the model is open
without limitation, we don't know what people are doing.
We only know what they tell us.
So we know people who are doing using the claim model
for many use cases.
So we know it's useful, but to the point is that
it's a technical resource.
So if we really want to deliver the promise
of understanding Earth,
then we need to make it extremely easy to use.
And that's when we decided to wrap it with a service.
Okay, so how do people use the services?
Is there like a text interface?
Is it like an application?
Is it like a web page?
How does it work?
That is the question.
I don't know, we're building it now.
We have a demo and people can just,
if they Google Clay, they will find it.
Explore.madewithclay.org.
We can see a demo of that.
It's a map.
It's a map when you click places
and it allows you to find things.
But then we ask ourselves,
is it a map because it deserves to be a map
or because I am used to maps
because I'm in this industry and I want a map.
We are thinking maybe the way to maximize the utility of play
is not to be a map.
Maybe it's also a chat interface.
Maybe it's just a column or a spreadsheet
when people are working on things.
We don't know.
We don't know what that is.
Okay, okay. Fair enough.
Yeah, to be honest with you, just because possibly I was guided that way, let's say, because of the analogy we made with ChatGPT,
somehow in my mind, I felt, well, okay, you're using embeddings, so there must be a way to express what you want to get out of the model in text.
But apparently, that's not necessarily the case.
Yeah. So when we do those embeddings, when we do those compression, in no step, we are
asking it to explain it in text, English, Spanish, or whatever language it is. We are
asking the model to do whatever it wants to find those vectors and those relationships. But it's also true that we have a whole other tool
to understand which is text, which is images,
which is text and words and relations with words.
We can say forest, but you can also say eucalyptus,
and the concepts are similar because there's some relations.
So it is obvious that if we are able
to combine the power of text models
with the power of earth models,
it would unlock much more.
And that's what we are working on now
to figure out how we can align the embeddings of text
with the embeddings of earth.
And we have some proof of concepts.
It's not easy at all, But, yeah, it's obvious
that it's the future.
Okay. Okay. So you're basically looking into getting into the next level, let's say, which
is having a multimodal model.
Yes. Yes. But here's the question that we ask ourselves. When you train a text model, I would argue it takes a lot of effort.
Because text could be of anything.
Not only any language, but you have text of reality and also text of fiction or text of
conspiracy or text of fake stuff like imaginary things.
But when you train on Earth, you only have one Earth.
It's large.
It's marvelously complex.
But there is no fake data.
There is no fiction on Earth data.
So just by that logic, it seems that training
an Earth model is far easier than training a text model
because the possibilities are smaller.
So to me, that means the multimodal model that
includes Earth is not going to pay much attention
to the Earth's modality, because the text is the harder one.
So I don't think that a multimodal that includes Earth
is going to be as good as an Earth-only model that then learns to understand the text model.
I would argue that this depends on the type of dataset
that you choose to train on text.
I mean, sure.
If you take the entire web and things from social media
and whatever, then yes, you're going to get lots of junk. But there are very
well curated and very selectively data sets out there that you can possibly use to train
more meaningfully.
We are doing that. We are taking OpenStitchMap, which is like Google Maps for people who don't
know. It's just a map that you can then query and use. So we are taking the text of OpenStateMap of the locations of things
as the text training for mapping.
Okay. And that's very interesting, actually. I would not imagine that you would do that,
but when you say it, it makes kind of sense. And the way that you
combine against those things is based
on coordinates or just visual pattern recognition. So, Ranel is kind of naive. So, we take the
image and we do the clay thing and we end up with embedding. And then we take the same image and we
ask OpenStreetMap, what's there? What do you have that's there? And OpenStreetMap says, oh, there is
a desert, there's a river, there's a river, there's a port, there's a
parking, there's a gas station, whatever, all those texts.
And then we use a model of text to make an embedding of that.
So then we take the embedding of the text of that place and the clay embedding of the
same place, and we figure out how can we align them so that they minimize the losses of the differences
when they try to recreate each other
or when you find similars.
Similar things in the embedding,
play embeddings should be similar things
on the text embeddings.
Because they are encoding the same thing
just in different ways.
Okay.
But still, in the text you get out of OpenStreetMaps,
I'm guessing that you basically only get things like names of locations and landmarks.
Yeah, you have to filter those out. There are things you will never expect on satellite
image to know the name of our hotel. This is on the rooftop.
Actually, I would imagine the opposite, that maybe you need those things if you want to
be able to use them, like, I don't know, find how many trees are around this hotel, for
example.
Oh, yes.
That's frontier.
And that's, this is the thing that I find so fascinating, which is that we are at the
frontier of human ingenuity when it comes to understand
Earth. And the most amazing part is that we are not limited by data, because we have the
data and it's open. And every five days we get a new image of any location of Earth with
Sentinel. And it's not technology either, because transformers are proven to be that is how we tweak this to find the answer.
So it is indeed an exceedingly exciting time when it comes to understand that.
Okay, well, since you mentioned transformers, there's something else specifically about them
that I've been meaning to ask you, which has to do with the limitations of transformers.
It's true that they have certainly powered much of the innovation
having to do with both text and multimodal models.
However, one limitation that I'm pretty sure anyone who's even remotely
acquainted to these models by now is familiar with
is the so-called
hallucination problem. I mean, the fact that these models tend to be creative in ways that sometimes
end up producing things that are not truthful, basically. So, are you using that architecture
as well? Do you have that problem as well? Much less.
Yes?
Okay.
We don't. I say we don't, but I get pushback from people. And the reason I say we do not
is because you get the hallucinations when you use models in text to generate the next
word. That's how these things work. You give it a sequence and then you ask what's the
next word. So it has trained on the data to predict what's the next word.
But we don't use clay to predict the next location of Earth.
We only use clay to understand what's there.
So we never ask, give me just from nothing,
give me an image of that.
The only thing that is kind of close is the masking thing.
Remember when I said in the training, we mask out places and then we ask to reconstruct. me an image of that. The only thing that is kind of close is the masking thing. Remember
when I said in the training, we mask out places and then we ask to reconstruct. There is some
possibility that if something is very rare, the transformer reconstruction would pay less
attention because it doesn't know what it is. So, that's maybe the degree to which hallucination
might play a role when encoding things that are really rare. But because we don't generate the next word,
the next location, we don't take an image from last year
and ask for an image for next year of a location.
We don't have that problem with transformers.
We have other problems.
The transformers are not perfect.
It's just the world decided that to go with transformers.
We would have chosen something else,
like diffusion techniques or whatever it is.
But we chose transformers, and we're just going with it.
Why?
Because the amount of research, funding, and expertise
that the world has given to that particular technology
makes it the obvious choice to see what else it can do.
If I had to choose from scratch and I had the funding
to drive the world movement of AI,
maybe we wouldn't have chosen Transformers.
But that's not a decision in our hands.
Interesting.
Yeah.
So yeah, I mean, the more we talk about it,
the more the picture gets clearer.
And also at the same point, the more questions I get, I have to tell you,
which is a sign of a good conversation, if I may add.
So you are using transformers,
but only for the embedding part.
You're not using it to generate, to predict anything,
just to match with as good precision as you can? Is that it?
We evolved this. So, in the beginning, the whole concept of foundational model was to
do that thing, the encoder, embedding the code with the images, blah, blah, blah, the
thing we did, train the model, and once it's trained or pre-trained, then you take the
decoder, the second half,
and you replace it with a task.
And the task might be giving an image,
count the number of cars.
So you have the answer of three cars,
and then you put only the decoder,
and then you train only the decoder to count cars,
knowing that the encoder part,
the thing that goes from image to embedding is good,
because it has been proven good
in the foundational part.
So then that world at the time was a world
of foundational models and decoders.
So we need when you want to do a task, a specific task,
whatever that task is, you would take the foundational model
and fine tune it to make a derivative model
that is good at doing that.
But then we realized that to the degree
that we can use the embeddings to do anything,
we could then create embeddings
that are universally applicable.
So we generate embeddings and then with the embedding,
not with the model, we try to make a decoder only using that embedding,
not using the encoder.
And that is a world where I think has a lot of, sorry,
that has a lot of potential because it's not anymore
a model that you need to train a fine-tuned model,
but it's a world where you only need to use the embeddings,
the vector operations.
And to the degree that that works means that you get answers
in milliseconds, not in weeks.
Does that make sense?
OK, so you're saying that basically you can just
take the vectors that you have produced
and just put them in a vector database
and don't care about the model at all.
That's exactly what we're doing.
Because if we do that, imagine we
have a user that wants to find the solar panels in Greece.
And we have made the embeddings for the whole of Greece.
Or we make them for this person.
Then there you go.
Then we make the embeddings.
And then it's literally milliseconds
to know where they are, to have not perfect answer,
but to have a good answer of where the solar panels are.
But then someone else comes and say,
hey, I wanna find the boats,
or I wanna find construction,
or I want something else.
Exactly the same embeddings are used for that new operation.
So that means that you only need to create them once.
That is the power of embeddings,
is that the universal pre-compute or most of the way for
most of the answers.
Okay.
And, but of course, all right, a number of questions there.
First, do you have any kind of, I don't know, benchmarks or empirical evidence on the kind
of precision that people get on those on different types of queries?
We have benchmarks for very few cases and there's a couple of reasons. One is the lazy
reason which is the reality, which is that we had limited funding and we decided to prove
it deeply with a couple of things like land cover and biomass regression of finding assets.
And that proved good enough for us.
There is a lot of other assets.
There is a wealth of academic benchmarks to do that.
And clay is not part of those, simply because the time
to put them here.
And as I speak, there's people who
are putting clay in those benchmarks to add them.
But the real answer to me is that I don't really care that much about the benchmark
because to me the value is not only the score in the benchmark.
The value is that where before you could only get any answer if you had access to a lot
of compute, a lot
of expertise. Now you might get an answer that might not be perfect, extremely fast,
extremely cheap, and good enough. Even though the scores might not be perfect because the
embeddings will never be perfect, there is a value which is beyond the scores.
Maybe the value is open source.
Maybe the value is offline.
You can take embeddings and work with them.
The value is that you can dig around and tweak them.
And I feel the chase for the last decimal point on the scores drives you to spend your limited bandwidth in things that might not be what
it at the end of the day have the biggest impacts.
Sure. I think I share this sentiment. I was mostly curious, not about the last decimal
point, but just as a kind of rough measure. Is it like, I don't know, 80%, 50%, 99%?
It depends. So what we find, so if,
and I say that people are putting the model clay
in this benchmark and I love to see where it goes.
I suspect it's not gonna be the top,
but it's not gonna be in the middle.
I suspect it's not gonna be good enough.
But besides that, we also have seen
that the embeddings perform best when the
thing they're looking for is the dominant thing on the image. So you divide the world
in little squares and then you make embeddings for those little squares. So it works best
when the thing you're looking for. Like if now this image of this video conference, my
face is barely maybe one third, one fourth, 1⁄4 of the image,
it would be good.
But if I sit back, my face might be small for the embeddings.
I might not get it.
But if I get too close and you only see half the face,
you might even lose the semantic of the interface.
So it also depends on the size of the thing.
That's why we are making embeddings of different sizes.
We don't know.
The answer is that we are betting that there is a there there. Maybe there is not. Maybe
we are wrong and maybe invaders are not. But if it is, or to the extent that it is, I strongly
believe that it would unlock so much value to so many issues, social, economic, environmental,
and also investment-wise.
There's so many things that make sense for this that I'm going all in.
And I'd rather be wrong, but have even tried it,
than wait to see a technology that gives me 100% assurance that it will work.
So that makes sense.
And we're also going to get about the investment part and the opportunity part.
But before we get to that, I want to stay a little bit longer with that because I think there's lots of juice there.
So another question that occurred to me as we're talking about it is, All right, fine. Regardless of the actual accuracy that you can get today,
if you want to keep using the model, the embeddings rather, in six months or a year from now,
does that not mean that you should also update embeddings with new data? Because what you have
trained, what you have used to generate the embeddings is not necessarily valid after a
certain amount of time. So you must always regenerate. You need to generate embeddings because you need
the new images. Things happen in the world. There are fires or constructions of things.
So those new images you need to create those embeddings. But I don't think you need to train
the model again. And the reason for that is even the things change, the envelope of possibilities,
that's really, of course, like Spain is getting that the certification with climate change.
But even though for Madrid, what you might see has never happened in history, it has happened,
and it is happening in other places of the world.
So if the world, if the model gets to see a desert in the middle of Madrid, it will be unique for
Madrid, but it will not be unique for the world. So to the degree that in Greek or in physics,
there is this concept called ergodicity. So I believe the Earth is ergodic,
which means that all the possible states are present at any time. So any evolution of one time
over time has been present somewhere else. Of course, if you speak with conservation people,
if you speak with climate change, they would say that they don't believe that is the case. Maybe the answer is halfway. But that's also one reason why I think the Earth AI is much faster to converge
and to learn than text because Earth is ergodic. So we can learn very quickly. And anyone has this
intuitive sense. If suddenly, you know, Helsinki becomes
a forest, you have never seen that, but you immediately know it's a forest. Because you
have the concept of a forest from somewhere else.
Yeah, yeah, you're probably right on this one. So I wasn't really thinking about retraining
the model, but just regenerating the embeddings. Yes, you have to. And this takes time.
This takes effort.
If you are a clay user and you just want it for Athens,
then it's very easy.
You just do it in your laptop.
And if you are a company and you want the whole of Europe
or the whole world, then it's better.
That is one of the business reasons
for the company we created, is that we take care of that.
We make all the embeddings. We optimize our factory of embeddings That is one of the business reasons for the company we created, is that we take care of that.
We make all the embeddings.
We optimize our factory of embeddings to make them extremely fast and extremely cheap so
that we can have any embedding you can think of.
Okay.
All right.
So I guess I was also going to ask about fine tuning because again, reading the available
documentation, I kind of figured that
this is what you're nudging let's say users to work but I'm not so sure anymore so have you
given up on on fine-tuning and are you just now directing people to use straight up embeddings?
Not fully given up but to the degree we can get away without fine-tuning.
Yes, because fine-tuning is tweaking the entire parameters of the encoder and
finding new parameters for the decoder. It's lengthy, takes time, it's expensive,
and it only works for one output. And if you can go with embeddings... By the way,
some people argue that using embeddings
and make models of things with embeddings is like inference time fine tuning, which
is what also are Gemini and other models with very large context windows.
That's what they do.
When you ask a question, if you put in charge GPT, sorry, and if you put on any of these models,
a question, what the model receives
is not just the question, is prepended by a lot of context
that is given by the system.
So in some way, you could argue that you're doing
fine tuning on the spot of on inference time.
Okay, well, one of the reasons I wanted
to ask you specifically about fine tuning
and also about the original, let's say, training
that you did for Clay was because to my mind,
at least, there's a little bit of a paradox there.
And it goes into the broader conversation about AI and the cost of producing
AI models and running AI models and running data centers and all of that. So for me, the
paradox is basically like, okay, so you are using AI, which is a kind of environmental,
environmental, I don't know, not necessarily negative,
but something that we need to clarify exactly what kind
of impact does it have?
How many resources does it consume?
And all of those things.
So you are using that as a vehicle
to improve, presumably, the whole environmental situation,
or at least the understanding of the environmental situation.
So I'm wondering if, obviously, I
guess that you must have thought about that yourself.
So what is the answer that you give to that?
So first of all, are you able to pinpoint
how many resources have you used to train and run
clay versus how the
kind of impact that you can have positive hopefully with that?
Yes, so yes, we can measure the missions because we run this thing auto sound degree.
So the degree that our provider, which is AWS, is able to give us the scope one, two, and three
of that compute.
Then we also need to add the scopes of this laptop running
or the electricity that I'm using, all of these things.
We can measure those, but those are not as large
as the compute itself, which we can offset,
and we do offset for AWS offsets in their operations.
But to take a step back, yes,
AI is exceedingly resource intensive
in electricity, in water, and in other cases.
When I was at Microsoft, that was the unit I was working on,
and so I am familiar with those things.
And it's kind of ironic that in our case,
we then apply those AI's to hopefully avoid more emissions.
So that is something we obviously think about.
That is one of the reasons it's a nonprofit.
The idea is that we are extremely open,
not only on the outputs, but on the ways we do that
and how we approach it.
And the reason is that hopefully number one,
we can convince people that are thinking similarly
to not have their separate thing is to just join forces.
So instead of training 10 models,
we train one model that is open enough and good enough
so then people can use it with time tuning
or without time tuning.
But then it means that we don't need to create a ton of them.
Two, if you do your own thing,
you hopefully can learn from our experiences the
thing that works and the thing that doesn't. So you reduce the amount of training or wasted
training or experiments that you do with that. So that's kind of how we are approaching the issue.
It is also a fact that we need to design the transformer themselves, the architecture
to be as efficient as possible. That's also why we start with the smaller models and bigger
models. Because one of the other limitations you mentioned of transformers is that they
take, they take so much data to learn something. Like, they learn much slower than a human
brain. And not only that, when they learn something,
you have to be careful.
Because if you give them something else,
there is a very high risk that they
forget everything else, or it will collapse.
So when you fine tune for a task, one of the reasons
that then you cannot use that model anymore for anything
else is because a lot of the training
to understand any other thing kind of gets away, gets lost.
So that's why another reason why embeddings
are in my opinion more environmentally friendly.
And then lastly, if you use embeddings,
as I was saying, is most of the way for most of the answers.
So if you compare getting the answer traditionally
with computer vision or with other ways,
there is a lot of compute resources that are used to get to the same answer. But if you use
embeddings, a lot of that compute has already been embedded in embedding, has already been
inserted there. So then all of your answers are much more resource like environmentally
friendly because there is the consumption of emissions have been done when you created them, that is useful for everyone.
And yeah, I think it's notoriously complex and hard to come up with a kind of estimate, let's say, that is close, close to reality. I think the latest effort I have seen at that was, and you may have
seen it as well, was a series of articles actually that the MIT technology review published
in which they tried to calculate like the net effect, let's say, of AI model training
and the offsets and all of those things. And they did a fairly good job, I have to say,
as far as I was able to say,
but I think the final outcome was that,
it's tentative still.
We can't really pinpoint because there are so many variables
involved basically.
Yeah, this infinite variable.
Like for example, I bought a server and it's sitting in my
home,
which cannot do many invasions, but it can do some. But because here in Denmark, the
electricity in my house comes from wind mostly, that means that whatever invasions I do them,
they're free. It's not even net. It's like they are not, it's green electrons. And there's
no need to offset them because they are green electrons. So that's our emissions zero.
And because I am using a model that is already trained, you could argue, and that's how the
carbon market works, that I could even release credits of carbon because I'm avoiding emissions,
which is crazy, obviously.
But there are ways to minimize the emissions. One, and two,
it's extremely complicated. So not only to understand emissions, but also to handle them
and how the markets of emissions, offsets work. It is extremely complicated and just to pick on your example on the wind generated energy,
well yes sure it doesn't cost like fossil fuels to generate that but what about the resources it
took to build the turbines or you know the bases or to transport all of those things so it's
the only thing we can just say at this point probably is that it's super complicated.
The only thing we can just say at this point probably is that it's super complicated.
You know the thing that also I wonder is that maybe the answer is that because the data we get is those images, the satellite images we use for that, they need to be processed too. They need
to be downloaded from the satellite, they need to be read and calibrated, all of those things.
If at that time the providers, NASA, ESA, and others,
and I'm being in conversations to promote this idea,
they pick two, three, four models,
the ones that they pick, and they make the embeddings,
it would be incrementally far better on emissions
and on cost because it's already in memory, it's already in the. Because it's already in memory.
It's already in the processor.
It's already there, right?
So why not release it together with the image,
release the embedding?
Yeah, yeah, makes sense.
Okay, so that's a very good segue
to touch on the last part of the conversation
because what you mentioned could be a use case.
And use cases is precisely, I figure the reason why you chose to go one step beyond
the non-profit that brought Clay to life and found a private company
that you can use as a vehicle to offer services around Clay.
So what kind of services could they be?
What kind of use cases and clients
do you envision or do you potentially have already?
We do. We do have already paying customers very early on. And so it's much more ad hoc
for those who have been in startups. It's, you know, that in the, in the early days is
everything it's, it's using the product, but also understanding the customer very well and doing things in between, right?
And then again, our biggest issue is that we could do so many things that is hard to
convey the horizontal nature of embeddings, but at the same time, making it visible what
it can do. We do have one customer that is asking, hey, I get some paper from
certain places in the Amazon, and I want to make sure that the paper comes from those
places and not from protected places. So then what we do is give us the locations, and then
we give us the locations and we go and see in the embeddings, if the places that they say they took it
from, there should be a change.
In the embedding before that, because we have access
to historical data, so we can make embeddings
for historical data.
There should be a change from the forest before that
and a forest after that.
And the places that are protected,
there should be no change at all.
We might not be able to say if they were caught and given to someone else or things like that, right?
But we are able to do those monitoring services.
And the key is that the creation of the embedding,
the figuring out the semantic and the semantic change,
all of those things can be abstracted into functions that they never even touch the customer. All they want is in a column in their sheet,
which is, it comes from this, and this date, yes, no. And that yes, no encapsulates a lot
of work that can easily be abstracted. Also explained, also out-dated, but the customer
doesn't really care about transformers or embeddings.
They care about, hey, can you check this?
All right. That's an imaginative, I would say, use case.
I mean, I wouldn't think that people would ask you that, but apparently they are.
Do you have other use cases that you can share or ones
that you may suggest?
Find assets. Find things like solar panels or pools or boats or parking lots with cars
or not. Some of these are real cases, some are typical cases.
And what about the structure, let's say, of the company? What stage are you at
right now? Are you bootstrapping or have you raised some capital?
We did raise some save, which is called, which is early interest. We are now closing a seed round, which is the typical path for a VC company.
And yeah, we are early on.
We have something like six employees and customers and figuring out.
I think the most important thing is that we have a clear idea what the service is, but we are at the same time healthy in not knowing
what the product is. Because if we want, what we're talking about here is changing the geospatial
industry. It's thinking about it completely different. We don't want, we are not a geospatial
company. We are an answers company, if you will. And our biggest risk is becoming a geospatial company, we are an answers company, if you will.
And our biggest risk is becoming a geospatial company, of which there are many,
and of which there are other ways to do things.
Exactly, that was also going to be a question, because obviously you know this space better than myself,
but even with my kind of knowledge, I can think of a couple of big companies
that are into GeoSpecial. So it would be like, alright, so how do you compete with them?
Would you want to compete with them?
No, first of all, we love to partner with as many people as possible.
The beauty of some of the things we do is that they are so different and so abstractible
that you could easily
integrate that into traditional company that
does things traditionally with computer vision and others,
and then uses embeddings as a way to check the answers
or to have a first filter.
Or for example, I mentioned that the embeddings for now
are sometimes not good, but still could be useful to say,
OK, let's make a first pass with embeddings. Let's see the places that we have high confidence that the thing that you're
looking for is indeed there or high confidence that the thing you're looking for is not there.
So, true positives and true negatives. And then we can be sure about those. And then everything
in between, we use the traditional methods. So, even in those cases, we can find ways to partner.
So even in those cases, we can find ways to partner.
Okay, so the idea would be you use the legend service,
let's say as a kind of indicator, then you verify the results by having someone manually,
let's say, check them.
Or with other methods like traditionally,
or even human, reviewing a human to do that,
which is something we want to do. We're always designing
human in the loop AI tools, not just automated AI tools because they make errors. And because they
make errors, we basically focus not into replacing the user or the customer, but to making them
10x or 100x more efficient. Okay. So are you now so is there the the clay nonprofit let's say is it still
operational? Yeah, very much. Yeah, the the clay nonprofit is still there. We just received
another $2 million of from AWS to continue doing the work. So it's still working. Clay is the
model and will continue to be the model and continue to be improving the model as a non-profit in the open. And then Legend is the service for any model,
including Clay. And we focus on Clay and we are close to Clay, so we can extract the most value
of it. But it doesn't mean that we might not use other models. Okay. All right. So any closing thoughts that you'd like to share or I don't know, I guess since you have
legend going on, the typical thing that founders at this stage say is we're hiring.
Are you hiring?
We are hiring.
We are hiring.
I'm busy to find LinkedIn models so people with those skills totally check it out.
Just thank you for the interest.
It does help me because the biggest challenge I face
or we face in my opinion is not technical,
it's not data as I was explaining.
We understand those things
or we think we understand those things.
It's explaining and figuring out what the user wants. I don't
want to push the technology out. I want to be where the customer or where the user is.
I want to understand the difficulties they have and how I can help them. And to do that,
you need there is a fine balance between talking very technically, which I know I have done
in this conversation, but also listening to the challenges that people have and then
working to solve them.
Because at the end of the day, it doesn't really matter if it's AI or not AI, are you
solving someone's problem?
Thanks for sticking around.
For more stories like this, check the link in bio and follow Link Data Orchestration.