Orchestrate all the Things - Building AI for Earth with Clay: The intelligence platform transforming Geospatial data analysis. Featuring Clay Executive Director Bruno Sánchez

Starting point is 00:00:00 Welcome to Orchestrate All The Things. I'm George Anadiotis and we'll be connecting the dots together. Stories about technology, data, AI and media and how they flow into each other, shaping our bikes. Bruno Sanchez is a rocket scientist with a somewhat deviant trajectory. An astrophysicist by training, he used the tools of his trade, mathematics and science, on the broadest possible scale, the universe. At some point however, his focus switched to using those same tools for more down to earth goals.

Starting point is 00:00:30 Sanchez had a stint at the World Bank, where as a member of interdisciplinary teams, he helped make sense of geospatial data. Then he realized the core of what he was doing was mapping, which prompted him to launch a company called Mapbox, providing online maps on the web. This experience brought another realization for Sanchez, that we have so much data about Earth that we don't really know how to use it. So when he got the opportunity to attempt to pull all of that together in the same data center and in one workbench, he went for it.

Starting point is 00:01:04 That was the Planetary Computer Project at Microsoft and Sanchez loved it. This is what happened next. And then, Chars GPT happened. And then, we realized that the key in Chars GPT, the transformer, was an architecture of AI that seemed exceedingly amazing in understanding text, understanding images, understanding audio, transcriptions, all of those things, but no one was doing it.

Starting point is 00:01:36 That question of how many trees are in the world, how many forest, how many fires, that modality, people were not working on that. So we decided to give it a try. And then the transformer was open source, so let's put it together. We raised $4 million to do that from Philanthropist. And we made it, and it's great.

Starting point is 00:01:56 And it's amazing, it works. It's incredible. It's orders of money to faster, cheaper, and better than anything else we've ever seen, which is kind of exactly the same thing that happened with text and images and audio. It proves again that this T of Chatsyptyge Transformer is an amazing human invention. But the point that then we realize is that, huh, this is amazing, but you need to be an expert on Geospatial and you need to be an expert in AIospatial and you need to be an expert

Starting point is 00:02:25 in AI to use that file, to use that AI. If you truly want to unlock the power of knowing how many trees are being cut or forest or erosion or construction, all of those things, you cannot only do the AI. You have to do a service, a product that that makes it exceedingly easy to use? That is the question. I don't know. We're building it now. We have a demo, and people can just, if they Google Clay,

Starting point is 00:02:53 they will find it. explore.madewithclay.org. We can see a demo of that. It's a map. It's a map when you click Places, and it allows you to find things. But then we ask ourselves, is it a map because it deserves to be a map or because I am used to maps because I'm in this industry and I want a map?

Starting point is 00:03:14 We are thinking maybe the way to maximize the utility of clay is not to be a map. Maybe it's also a chat interface. Maybe it's just a column or a spreadsheet when people are working on things. We don't know. We don't know what that is. I hope you will enjoy this. If you'd like my work on orchestrate all the things,

Starting point is 00:03:37 you can subscribe to my podcast, available on all major platforms. Myself published newsletter, also syndicated on Substack, Hacker.net, Medium, NTZone or follow, gesturate all the things on your social media of choice. If someone asked me what role do I feel or what objective, I would say I'm a scientist. I think curiosity has driven a lot of my professional life. And my background is very technical. I did a PhD in astrophysics in Germany in the Mars Plank.

Starting point is 00:04:10 And then I went to the US to use satellites, sorry, to use telescopes on rockets. So that line of being a rocket scientist, every single boss I ever had after that, something they said that he doesn't take a rocket scientist, but single boss I ever had after that, something they said that he doesn't take a rocket scientist, but I was one of those. Even though people probably mean rocket engineering, not rocket science. I did rocket science, I did astrophysics, and I absolutely loved to be able to understand the complexity of the universe. It's something uncanny about the fact

Starting point is 00:04:46 that how beautiful processes are in the Earth and in the universe, and we get the brains and the intelligence enough to understand it. Or we think we understand it. To some degree, we understand it. And to me, it was always a source of deep, not only gratitude, but also deep pleasure, intellectual pleasure. And at the same time, I started to feel that all of those tools that I have gained, all

Starting point is 00:05:15 of those mathematical, geological abstractions, like hypothesis, all of those tools I had, I was applying them in the universe. So why not in issues on Earth, issues on society, issues on social development or economic development? So I switched. And I had this amazing position during this postdoc. And I wanted to do climate change. And this could be a very long answer. But I tried to keep it so we can double down.

Starting point is 00:05:47 Basically it was really hard because I was framed as an astrophysicist. And so why would someone want me to do things on climate change or to do things on health interview at the World Bank and other places? And it was always this answer, this is very impressive, Sibi. We don't have your use. We will much more likely hire an MBA which has less years of training for doing technical things than an astrophysicist who has more years of training on things that are more complex. We can talk about why and the biases and perceptions,

Starting point is 00:06:20 but long story short, I managed to break the barrier. And I managed to demonstrate that my skills on mathematics were very horizontal and very applicable. I might not know and I did not know about climate change, I did not know about economy, I did not know politics, but working together with those profiles, we were an amazing combo. So I worked together with experts in those fields and we developed a model on climate change. Then I did went to the World Bank, and I worked in the office of the president of the World Bank, Jim Kim.

Starting point is 00:06:50 And it was always my role is, hey, there is these issues, there is these technical drivers or data drivers or satellite images or whatever thing it is. And Cam Bruno, can you help us understand what's the there there? What is the thing? Working with other people, experts in economy or experts in other things. And we did that. And it was amazing. And then to follow the arc,

Starting point is 00:07:11 I left the wall bank to go to a developing country to try to understand those issues, not from D.C., Washington, D.C., and not by business class, but also on the ground. So it was not long, but I did leave for two months in Bhutan, helping in the remotest places of Bhutan. That was an amazing experience. And after that, I went back to work on satellites. I realized a lot of the things I was doing was mapping. And I helped launch a company called Mapbox, which is online maps on the web.

Starting point is 00:07:44 And I loved it. And it made me realize that we have so much data about Earth, we don't really know how to use it. We know what are the trees in the world. We know what are the forests in the world. It's just a matter of processing it properly. So I got the opportunity to build a planetary computer at Microsoft, which was the attempt

Starting point is 00:08:03 to put all of those things together into the same data center, into the same workbench. And it was amazing, it was a great opportunity. And then, Charge GPT happened. And then, we realized that the key in Charge GPT, the transformer, was an architecture of AI that seemed exceedingly amazing in understanding text, understanding images, understanding audio, transcriptions, all of those things, but no one was doing Earth.

Starting point is 00:08:37 That question of how many trees are in the world, how many forests, how many fires, that modality, people were not working on that. So we decided to give it a try. And because we thought it deserved to be a nonprofit, we created a nonprofit to make the model completely open source, only using open data so that we can make the model open license. So there's no questions about what data you use, what are the IPs or the licenses of the train model if you only use open data, which we have tons of petabytes on. I knew that because I had built a planetary computer, right?

Starting point is 00:09:12 I had to put all that data, those more than 50 petabytes. And then the transformer was open source, so let's put it all together. We raised $4 million to do that from Philanthropist. And we made it. And it's great. And it's amazing. It works. It's incredible. It's orders of money to faster, cheaper, and better than anything else we've ever seen, which is kind of exactly the same thing that happened with text and images and audio.

Starting point is 00:09:38 It's proofs again that this T of child CPT, the transformer, is an amazing human invention. This T of Chatsypti, the transformer, is an amazing human invention. And then I'm just making the whole story short to go to the end, is that we realize that it's not the only one. There are others. I'm happy to look about other models like clay, which are also similar, like PIFI, the Terraformer, and others. But the point that then we realize is that, huh, this is amazing, but you need to be an expert on geospatial and you need to be an expert in AI

Starting point is 00:10:07 to use that file, to use that AI. Unless maybe, I don't know, 80,000 people in the world who fulfill those criteria. If you truly want to unlock the power of knowing how many trees are being cut or forest or erosion or construction, all of those things, you cannot only do the AI. You have to do a service, a product,

Starting point is 00:10:28 that makes it exceedingly easy to use. The chat GPT of GPT, like anyone can work with the chat bot. So what is the equivalent for that? And because that is a product, not a model, it made less sense to us to raise philanthropy money. Because to me, it makes sense to raise philanthropy money for something that is get money, do the output. And if there is no money, well, there's no more output.

Starting point is 00:10:57 But the output is there. The model is there. It's there if we disappear. That file is there, and anyone who knows how to use it will get amazing results. We lift the tide for everyone's boat. We make it easy for everyone who knows how to do it. But if you provide the service and you are a nonprofit, you have to constantly raise

Starting point is 00:11:16 funds to deliver those services because they cost money. It's the OPEC, right? So, we didn't want to do that constant battle that a lot of nonprofits have to do that ends up being a little bit like a consulting company or a project space company. It was really nice when we had this money to deliver this model, we made it, we did it, go on.

Starting point is 00:11:37 Let's move on. So we decided to make it a for-profit. That's legend. So Clay is the nonprofit, Clay is the model, Legend is the company whose service is to use Clay or others to do those things easy. If you know what to do, go ahead, use it. If you want the easy way to do it,

Starting point is 00:11:56 or the thing on the scale, that's the legend. And that's kind of the reason I'm here and the lamp pad has always been the same, which is how to use deep technology to balance impact in society using different tools like non-profits, for-profits, startups, Microsoft, World Bank. I feel I've tried to do that compass all the time. Okay, indeed. I have to say it was much longer than the typical answer people give. Yes, sorry. That's okay. I'm not saying that as a bad thing necessarily, just a fact.

Starting point is 00:12:32 So you already included in your opening statement a lot of things that we were going to touch on a little bit later anyway. So you already mentioned, Clay, and that was going to be obviously a little bit later anyway. So you already mentioned clay and that was going to be obviously my second question. And that's the main reason that it's what picked my my attention basically the I just saw that recently that it was released. Perhaps it's actually been around for longer. I'm not sure you can tell us. Roughly. Okay. And so what is clay? I'm you can you can give the long answer. I can give the short answer that's based on my understanding after you give yours.

Starting point is 00:13:13 So the monologue before was kind of the context of why we do the thing. So right now when we do the things, then now we need to go into the details of what it actually is, what it does, right? So at the end of the day, clay is an architecture, it's a processor, it takes satellite images or plane images or drone images, it takes any kind of image and then understands what is there. If you have a plane, if you have crops, if you have water, you have boats, you have ways to count them, to calculate with them. I'm being a bit abstract on purpose because the power of this tool

Starting point is 00:13:57 is not that it's able to count cars. It's not that it's able to predict crop yields. It's that it's able to do all of those things. Is that the power of this transformer thing, the power of the invention we did, or OpenAI did with research from others, is that this architecture allows to do a lot of the compute, a lot of the thinking to do a lot of things ahead of time and for everyone. So when you work with clay, you take those images and if you want to count cars, it gives you the answer. But if you want to predict

Starting point is 00:14:31 crop yields of the same place that also has cars, you can do that equally fast because the model has understood how to see things, how to see semantics. So that's what clay is. It's like chat GPT creates words. We can create semantics or understand images. Actually, that was going to be, you know, the, the three word explainer that I would give for, for people. If somebody asked me, it would be something like that. That GPT equivalent for image data with a little bit of a footnote there because I'm not exactly sure if it only works for image data.

Starting point is 00:15:10 I believe I did see somewhere in your documentation mentioning things like sensor data and other types. So it's a little bit, it's a little bit different. So it's that, but with a little twist that makes it, in my opinion, way more interesting because because you can go to charge the PT and keep the images. You can ask about the first station and you can ask about questions over there to charge the PT. But what those things do is the equivalent of reading the Lonely Planet, the travel book of Colombia, and saying you've been to Colombia. When you ask the capital of Spain to that

Starting point is 00:15:46 charge the PT and it says Madrid, it is basically because it has read that that is the thing. But what Clay does is it takes the images, the rusters in a technical way of any satellite, visual, infrared, synthetic radar, all of those things and it understands it. When you ask what are the forest, it's not what it has read about it,

Starting point is 00:16:08 is that it has seen that fire in these images. And it can point to where that fire. If you want looking for solar panels, it's not that it has read where it was. It's not that it has images of solar panels. It has visited virtually those locations and it knows what's there. So it's kind of the librarian of Earth

Starting point is 00:16:30 where you can index and you can ask questions and it's not just going by the words of someone talking about it, is that it has visited those places. It's the Marco Polo, if you want. Yeah. Of course, there are some similarities, some parallels, let's say, but there are also very important differences, obviously, this being

Starting point is 00:16:52 the most important. The training, the data set that you used to train, as you mentioned already in the beginning, you have basically used a gazillion of open data images to train that data, to train that model. And as I have been able to figure out just by reading what is available out there, it seems to me like the key to make that work is actually embeddings, because you used embeddings to transcribe, let's say, the meaning, quote, meaning of every image into text. So what I'm wondering, well, first of all,

Starting point is 00:17:34 again, for the benefit of people who may be listening, you and I know what an embedding is, but that's not necessarily the case. So if you can make like a one minute introduction to appendix and then I'm very, very curious. What exactly did you use? What function let's say, did you use to do those appendix? So when we train Clain, when we train Clay, what we do for those who understand the words is a unit mask autoencoder. Which it means is that you take the image and then you force in the system that that image, which

Starting point is 00:18:16 is a certain size with red, green, blue, all of those colors, and you force to compress it into just a bunch of numbers. You just give it a very narrow space to store the information of what a summary of the image. It's like if you ask an expert to just tweet what is the content of an image. You only have one tweet. But then there is the other half of the model.

Starting point is 00:18:42 So the first half is doing that from image to embedding, which is that thing. And then the other half of the model only has access to that tweet, to that vector, and its job is to reconstruct the image. And then the function that makes all of this work is that I am going to evaluate the work of all of those parameters by the difference between the input that generates the embedding and the output made with embedding. So that in itself already tells you

Starting point is 00:19:18 that embedding is by necessity is a compression. It's a lossy compression because it's not perfect, but it's a compression of the contents of the image. And then we go one step more, which is that we remove, when we do that embedding, when we give the image, the input image to the model, we remove part of it. So we remove half the face or like randomly crop images. And we made the

Starting point is 00:19:45 vector summary of the image with those holes. But then we ask it to reconstruct the entire image. So it needs to understand that if I see half a face, there's probably the other half there. If I see half a glass, there's probably another glass on the other side. So we are forcing the model to understand by compression, but also by context. By the fact that, yes, I don't see these things, but they might be there. That's the mask autoencoder, which allows us to not need human labels or not need to supervise the models because we can generate a lot of training by taking all of those images, cropping holes in places and asking to do that. A little bit more than a minute. Right, so how many rounds of training have you done so far or to ask it in

Starting point is 00:20:41 a different way, how many different versions of the model have you released so far? Three. So we had the core idea early on, before we got funding, and it's basically the model exists for text, as I was saying, and even exists for images. So to go from images to Earth basically is a few tweaks of, hey, an Earth image has a location, has a time, has an instrument, has a few things that are special. So, we did that thing. We did a small scale kind of to prove if it works. Then we got the funding. We did the version one knowing that this would work, knowing we had access to these compute GPUs to train. And then we then the latest version, what we call 1.5, which is a bigger model, like more parameters,

Starting point is 00:21:28 more numbers and way more data, like billions of little patches of images and a lot of training doing the patching all of this masking thing. That was the largest, to our knowledge, has been the largest training in the open, because I guess it's someone does it not in the open, we don't know, the largest training in the open for an Earth model.

Starting point is 00:21:53 Okay. And do you have, have you potentially seen already how people use it out there? Well, first of all, before we get to that point, actually, I, it's probably a better question to ask. So how can people use it? You already mentioned in the beginning that well, yes, I mean, in theory, if you have the model out there, you could use it, but it takes certain expertise to do that. So I think you have created different ways for people to use it as well. Clay is a technical resource. You need to know how to run a Python notebook, at least. You need to know some of your special AI. So we do have tutorials to guide you to do

Starting point is 00:22:31 that. We have done some of those applications. We have demonstrated that Clay is able to do land cover, which is when you get a map, and you need to know what's water, what's forest, what grass, all of those classes. we are able to recreate, create those maps extremely quickly with a very high precision. The regression was 90.92, if I remember correctly. Also to calculate the amount of biomass in a forest, the amount of biomass above ground is kind of how much carbon is there on those trees.

Starting point is 00:23:00 So typically all of these operations, they take a lot of computer vision, pixel math, all of these things to do with clay is just in the bedding. So it's much faster. We know that can be done. We also know we can detect ships, we can detect aquaculture, these are real cases of people doing it.

Starting point is 00:23:17 But the thing is that because the model is open without limitation, we don't know what people are doing. We only know what they tell us. So we know people who are doing using the claim model for many use cases. So we know it's useful, but to the point is that it's a technical resource. So if we really want to deliver the promise

Starting point is 00:23:36 of understanding Earth, then we need to make it extremely easy to use. And that's when we decided to wrap it with a service. Okay, so how do people use the services? Is there like a text interface? Is it like an application? Is it like a web page? How does it work?

Starting point is 00:23:56 That is the question. I don't know, we're building it now. We have a demo and people can just, if they Google Clay, they will find it. Explore.madewithclay.org. We can see a demo of that. It's a map. It's a map when you click places

Starting point is 00:24:12 and it allows you to find things. But then we ask ourselves, is it a map because it deserves to be a map or because I am used to maps because I'm in this industry and I want a map. We are thinking maybe the way to maximize the utility of play is not to be a map. Maybe it's also a chat interface.

Starting point is 00:24:36 Maybe it's just a column or a spreadsheet when people are working on things. We don't know. We don't know what that is. Okay, okay. Fair enough. Yeah, to be honest with you, just because possibly I was guided that way, let's say, because of the analogy we made with ChatGPT, somehow in my mind, I felt, well, okay, you're using embeddings, so there must be a way to express what you want to get out of the model in text. But apparently, that's not necessarily the case.

Starting point is 00:25:07 Yeah. So when we do those embeddings, when we do those compression, in no step, we are asking it to explain it in text, English, Spanish, or whatever language it is. We are asking the model to do whatever it wants to find those vectors and those relationships. But it's also true that we have a whole other tool to understand which is text, which is images, which is text and words and relations with words. We can say forest, but you can also say eucalyptus, and the concepts are similar because there's some relations. So it is obvious that if we are able

Starting point is 00:25:46 to combine the power of text models with the power of earth models, it would unlock much more. And that's what we are working on now to figure out how we can align the embeddings of text with the embeddings of earth. And we have some proof of concepts. It's not easy at all, But, yeah, it's obvious

Starting point is 00:26:07 that it's the future. Okay. Okay. So you're basically looking into getting into the next level, let's say, which is having a multimodal model. Yes. Yes. But here's the question that we ask ourselves. When you train a text model, I would argue it takes a lot of effort. Because text could be of anything. Not only any language, but you have text of reality and also text of fiction or text of conspiracy or text of fake stuff like imaginary things. But when you train on Earth, you only have one Earth.

Starting point is 00:26:44 It's large. It's marvelously complex. But there is no fake data. There is no fiction on Earth data. So just by that logic, it seems that training an Earth model is far easier than training a text model because the possibilities are smaller. So to me, that means the multimodal model that

Starting point is 00:27:08 includes Earth is not going to pay much attention to the Earth's modality, because the text is the harder one. So I don't think that a multimodal that includes Earth is going to be as good as an Earth-only model that then learns to understand the text model. I would argue that this depends on the type of dataset that you choose to train on text. I mean, sure. If you take the entire web and things from social media

Starting point is 00:27:40 and whatever, then yes, you're going to get lots of junk. But there are very well curated and very selectively data sets out there that you can possibly use to train more meaningfully. We are doing that. We are taking OpenStitchMap, which is like Google Maps for people who don't know. It's just a map that you can then query and use. So we are taking the text of OpenStateMap of the locations of things as the text training for mapping. Okay. And that's very interesting, actually. I would not imagine that you would do that, but when you say it, it makes kind of sense. And the way that you

Starting point is 00:28:23 combine against those things is based on coordinates or just visual pattern recognition. So, Ranel is kind of naive. So, we take the image and we do the clay thing and we end up with embedding. And then we take the same image and we ask OpenStreetMap, what's there? What do you have that's there? And OpenStreetMap says, oh, there is a desert, there's a river, there's a river, there's a port, there's a parking, there's a gas station, whatever, all those texts. And then we use a model of text to make an embedding of that. So then we take the embedding of the text of that place and the clay embedding of the

Starting point is 00:28:59 same place, and we figure out how can we align them so that they minimize the losses of the differences when they try to recreate each other or when you find similars. Similar things in the embedding, play embeddings should be similar things on the text embeddings. Because they are encoding the same thing just in different ways.

Starting point is 00:29:20 Okay. But still, in the text you get out of OpenStreetMaps, I'm guessing that you basically only get things like names of locations and landmarks. Yeah, you have to filter those out. There are things you will never expect on satellite image to know the name of our hotel. This is on the rooftop. Actually, I would imagine the opposite, that maybe you need those things if you want to be able to use them, like, I don't know, find how many trees are around this hotel, for example.

Starting point is 00:29:53 Oh, yes. That's frontier. And that's, this is the thing that I find so fascinating, which is that we are at the frontier of human ingenuity when it comes to understand Earth. And the most amazing part is that we are not limited by data, because we have the data and it's open. And every five days we get a new image of any location of Earth with Sentinel. And it's not technology either, because transformers are proven to be that is how we tweak this to find the answer. So it is indeed an exceedingly exciting time when it comes to understand that.

Starting point is 00:30:35 Okay, well, since you mentioned transformers, there's something else specifically about them that I've been meaning to ask you, which has to do with the limitations of transformers. It's true that they have certainly powered much of the innovation having to do with both text and multimodal models. However, one limitation that I'm pretty sure anyone who's even remotely acquainted to these models by now is familiar with is the so-called hallucination problem. I mean, the fact that these models tend to be creative in ways that sometimes

Starting point is 00:31:15 end up producing things that are not truthful, basically. So, are you using that architecture as well? Do you have that problem as well? Much less. Yes? Okay. We don't. I say we don't, but I get pushback from people. And the reason I say we do not is because you get the hallucinations when you use models in text to generate the next word. That's how these things work. You give it a sequence and then you ask what's the next word. So it has trained on the data to predict what's the next word.

Starting point is 00:31:47 But we don't use clay to predict the next location of Earth. We only use clay to understand what's there. So we never ask, give me just from nothing, give me an image of that. The only thing that is kind of close is the masking thing. Remember when I said in the training, we mask out places and then we ask to reconstruct. me an image of that. The only thing that is kind of close is the masking thing. Remember when I said in the training, we mask out places and then we ask to reconstruct. There is some possibility that if something is very rare, the transformer reconstruction would pay less

Starting point is 00:32:15 attention because it doesn't know what it is. So, that's maybe the degree to which hallucination might play a role when encoding things that are really rare. But because we don't generate the next word, the next location, we don't take an image from last year and ask for an image for next year of a location. We don't have that problem with transformers. We have other problems. The transformers are not perfect. It's just the world decided that to go with transformers.

Starting point is 00:32:44 We would have chosen something else, like diffusion techniques or whatever it is. But we chose transformers, and we're just going with it. Why? Because the amount of research, funding, and expertise that the world has given to that particular technology makes it the obvious choice to see what else it can do. If I had to choose from scratch and I had the funding

Starting point is 00:33:06 to drive the world movement of AI, maybe we wouldn't have chosen Transformers. But that's not a decision in our hands. Interesting. Yeah. So yeah, I mean, the more we talk about it, the more the picture gets clearer. And also at the same point, the more questions I get, I have to tell you,

Starting point is 00:33:26 which is a sign of a good conversation, if I may add. So you are using transformers, but only for the embedding part. You're not using it to generate, to predict anything, just to match with as good precision as you can? Is that it? We evolved this. So, in the beginning, the whole concept of foundational model was to do that thing, the encoder, embedding the code with the images, blah, blah, blah, the thing we did, train the model, and once it's trained or pre-trained, then you take the

Starting point is 00:34:03 decoder, the second half, and you replace it with a task. And the task might be giving an image, count the number of cars. So you have the answer of three cars, and then you put only the decoder, and then you train only the decoder to count cars, knowing that the encoder part,

Starting point is 00:34:20 the thing that goes from image to embedding is good, because it has been proven good in the foundational part. So then that world at the time was a world of foundational models and decoders. So we need when you want to do a task, a specific task, whatever that task is, you would take the foundational model and fine tune it to make a derivative model

Starting point is 00:34:41 that is good at doing that. But then we realized that to the degree that we can use the embeddings to do anything, we could then create embeddings that are universally applicable. So we generate embeddings and then with the embedding, not with the model, we try to make a decoder only using that embedding, not using the encoder.

Starting point is 00:35:08 And that is a world where I think has a lot of, sorry, that has a lot of potential because it's not anymore a model that you need to train a fine-tuned model, but it's a world where you only need to use the embeddings, the vector operations. And to the degree that that works means that you get answers in milliseconds, not in weeks. Does that make sense?

Starting point is 00:35:34 OK, so you're saying that basically you can just take the vectors that you have produced and just put them in a vector database and don't care about the model at all. That's exactly what we're doing. Because if we do that, imagine we have a user that wants to find the solar panels in Greece. And we have made the embeddings for the whole of Greece.

Starting point is 00:35:55 Or we make them for this person. Then there you go. Then we make the embeddings. And then it's literally milliseconds to know where they are, to have not perfect answer, but to have a good answer of where the solar panels are. But then someone else comes and say, hey, I wanna find the boats,

Starting point is 00:36:09 or I wanna find construction, or I want something else. Exactly the same embeddings are used for that new operation. So that means that you only need to create them once. That is the power of embeddings, is that the universal pre-compute or most of the way for most of the answers. Okay.

Starting point is 00:36:32 And, but of course, all right, a number of questions there. First, do you have any kind of, I don't know, benchmarks or empirical evidence on the kind of precision that people get on those on different types of queries? We have benchmarks for very few cases and there's a couple of reasons. One is the lazy reason which is the reality, which is that we had limited funding and we decided to prove it deeply with a couple of things like land cover and biomass regression of finding assets. And that proved good enough for us. There is a lot of other assets.

Starting point is 00:37:12 There is a wealth of academic benchmarks to do that. And clay is not part of those, simply because the time to put them here. And as I speak, there's people who are putting clay in those benchmarks to add them. But the real answer to me is that I don't really care that much about the benchmark because to me the value is not only the score in the benchmark. The value is that where before you could only get any answer if you had access to a lot

Starting point is 00:37:44 of compute, a lot of expertise. Now you might get an answer that might not be perfect, extremely fast, extremely cheap, and good enough. Even though the scores might not be perfect because the embeddings will never be perfect, there is a value which is beyond the scores. Maybe the value is open source. Maybe the value is offline. You can take embeddings and work with them. The value is that you can dig around and tweak them.

Starting point is 00:38:16 And I feel the chase for the last decimal point on the scores drives you to spend your limited bandwidth in things that might not be what it at the end of the day have the biggest impacts. Sure. I think I share this sentiment. I was mostly curious, not about the last decimal point, but just as a kind of rough measure. Is it like, I don't know, 80%, 50%, 99%? It depends. So what we find, so if, and I say that people are putting the model clay in this benchmark and I love to see where it goes. I suspect it's not gonna be the top,

Starting point is 00:38:57 but it's not gonna be in the middle. I suspect it's not gonna be good enough. But besides that, we also have seen that the embeddings perform best when the thing they're looking for is the dominant thing on the image. So you divide the world in little squares and then you make embeddings for those little squares. So it works best when the thing you're looking for. Like if now this image of this video conference, my face is barely maybe one third, one fourth, 1⁄4 of the image,

Starting point is 00:39:26 it would be good. But if I sit back, my face might be small for the embeddings. I might not get it. But if I get too close and you only see half the face, you might even lose the semantic of the interface. So it also depends on the size of the thing. That's why we are making embeddings of different sizes. We don't know.

Starting point is 00:39:44 The answer is that we are betting that there is a there there. Maybe there is not. Maybe we are wrong and maybe invaders are not. But if it is, or to the extent that it is, I strongly believe that it would unlock so much value to so many issues, social, economic, environmental, and also investment-wise. There's so many things that make sense for this that I'm going all in. And I'd rather be wrong, but have even tried it, than wait to see a technology that gives me 100% assurance that it will work. So that makes sense.

Starting point is 00:40:22 And we're also going to get about the investment part and the opportunity part. But before we get to that, I want to stay a little bit longer with that because I think there's lots of juice there. So another question that occurred to me as we're talking about it is, All right, fine. Regardless of the actual accuracy that you can get today, if you want to keep using the model, the embeddings rather, in six months or a year from now, does that not mean that you should also update embeddings with new data? Because what you have trained, what you have used to generate the embeddings is not necessarily valid after a certain amount of time. So you must always regenerate. You need to generate embeddings because you need the new images. Things happen in the world. There are fires or constructions of things.

Starting point is 00:41:14 So those new images you need to create those embeddings. But I don't think you need to train the model again. And the reason for that is even the things change, the envelope of possibilities, that's really, of course, like Spain is getting that the certification with climate change. But even though for Madrid, what you might see has never happened in history, it has happened, and it is happening in other places of the world. So if the world, if the model gets to see a desert in the middle of Madrid, it will be unique for Madrid, but it will not be unique for the world. So to the degree that in Greek or in physics, there is this concept called ergodicity. So I believe the Earth is ergodic,

Starting point is 00:42:06 which means that all the possible states are present at any time. So any evolution of one time over time has been present somewhere else. Of course, if you speak with conservation people, if you speak with climate change, they would say that they don't believe that is the case. Maybe the answer is halfway. But that's also one reason why I think the Earth AI is much faster to converge and to learn than text because Earth is ergodic. So we can learn very quickly. And anyone has this intuitive sense. If suddenly, you know, Helsinki becomes a forest, you have never seen that, but you immediately know it's a forest. Because you have the concept of a forest from somewhere else. Yeah, yeah, you're probably right on this one. So I wasn't really thinking about retraining

Starting point is 00:42:59 the model, but just regenerating the embeddings. Yes, you have to. And this takes time. This takes effort. If you are a clay user and you just want it for Athens, then it's very easy. You just do it in your laptop. And if you are a company and you want the whole of Europe or the whole world, then it's better. That is one of the business reasons

Starting point is 00:43:21 for the company we created, is that we take care of that. We make all the embeddings. We optimize our factory of embeddings That is one of the business reasons for the company we created, is that we take care of that. We make all the embeddings. We optimize our factory of embeddings to make them extremely fast and extremely cheap so that we can have any embedding you can think of. Okay. All right. So I guess I was also going to ask about fine tuning because again, reading the available

Starting point is 00:43:44 documentation, I kind of figured that this is what you're nudging let's say users to work but I'm not so sure anymore so have you given up on on fine-tuning and are you just now directing people to use straight up embeddings? Not fully given up but to the degree we can get away without fine-tuning. Yes, because fine-tuning is tweaking the entire parameters of the encoder and finding new parameters for the decoder. It's lengthy, takes time, it's expensive, and it only works for one output. And if you can go with embeddings... By the way, some people argue that using embeddings

Starting point is 00:44:27 and make models of things with embeddings is like inference time fine tuning, which is what also are Gemini and other models with very large context windows. That's what they do. When you ask a question, if you put in charge GPT, sorry, and if you put on any of these models, a question, what the model receives is not just the question, is prepended by a lot of context that is given by the system. So in some way, you could argue that you're doing

Starting point is 00:44:57 fine tuning on the spot of on inference time. Okay, well, one of the reasons I wanted to ask you specifically about fine tuning and also about the original, let's say, training that you did for Clay was because to my mind, at least, there's a little bit of a paradox there. And it goes into the broader conversation about AI and the cost of producing AI models and running AI models and running data centers and all of that. So for me, the

Starting point is 00:45:37 paradox is basically like, okay, so you are using AI, which is a kind of environmental, environmental, I don't know, not necessarily negative, but something that we need to clarify exactly what kind of impact does it have? How many resources does it consume? And all of those things. So you are using that as a vehicle to improve, presumably, the whole environmental situation,

Starting point is 00:46:06 or at least the understanding of the environmental situation. So I'm wondering if, obviously, I guess that you must have thought about that yourself. So what is the answer that you give to that? So first of all, are you able to pinpoint how many resources have you used to train and run clay versus how the kind of impact that you can have positive hopefully with that?

Starting point is 00:46:32 Yes, so yes, we can measure the missions because we run this thing auto sound degree. So the degree that our provider, which is AWS, is able to give us the scope one, two, and three of that compute. Then we also need to add the scopes of this laptop running or the electricity that I'm using, all of these things. We can measure those, but those are not as large as the compute itself, which we can offset, and we do offset for AWS offsets in their operations.

Starting point is 00:47:03 But to take a step back, yes, AI is exceedingly resource intensive in electricity, in water, and in other cases. When I was at Microsoft, that was the unit I was working on, and so I am familiar with those things. And it's kind of ironic that in our case, we then apply those AI's to hopefully avoid more emissions. So that is something we obviously think about.

Starting point is 00:47:30 That is one of the reasons it's a nonprofit. The idea is that we are extremely open, not only on the outputs, but on the ways we do that and how we approach it. And the reason is that hopefully number one, we can convince people that are thinking similarly to not have their separate thing is to just join forces. So instead of training 10 models,

Starting point is 00:47:50 we train one model that is open enough and good enough so then people can use it with time tuning or without time tuning. But then it means that we don't need to create a ton of them. Two, if you do your own thing, you hopefully can learn from our experiences the thing that works and the thing that doesn't. So you reduce the amount of training or wasted training or experiments that you do with that. So that's kind of how we are approaching the issue.

Starting point is 00:48:20 It is also a fact that we need to design the transformer themselves, the architecture to be as efficient as possible. That's also why we start with the smaller models and bigger models. Because one of the other limitations you mentioned of transformers is that they take, they take so much data to learn something. Like, they learn much slower than a human brain. And not only that, when they learn something, you have to be careful. Because if you give them something else, there is a very high risk that they

Starting point is 00:48:51 forget everything else, or it will collapse. So when you fine tune for a task, one of the reasons that then you cannot use that model anymore for anything else is because a lot of the training to understand any other thing kind of gets away, gets lost. So that's why another reason why embeddings are in my opinion more environmentally friendly. And then lastly, if you use embeddings,

Starting point is 00:49:15 as I was saying, is most of the way for most of the answers. So if you compare getting the answer traditionally with computer vision or with other ways, there is a lot of compute resources that are used to get to the same answer. But if you use embeddings, a lot of that compute has already been embedded in embedding, has already been inserted there. So then all of your answers are much more resource like environmentally friendly because there is the consumption of emissions have been done when you created them, that is useful for everyone. And yeah, I think it's notoriously complex and hard to come up with a kind of estimate, let's say, that is close, close to reality. I think the latest effort I have seen at that was, and you may have

Starting point is 00:50:07 seen it as well, was a series of articles actually that the MIT technology review published in which they tried to calculate like the net effect, let's say, of AI model training and the offsets and all of those things. And they did a fairly good job, I have to say, as far as I was able to say, but I think the final outcome was that, it's tentative still. We can't really pinpoint because there are so many variables involved basically.

Starting point is 00:50:39 Yeah, this infinite variable. Like for example, I bought a server and it's sitting in my home, which cannot do many invasions, but it can do some. But because here in Denmark, the electricity in my house comes from wind mostly, that means that whatever invasions I do them, they're free. It's not even net. It's like they are not, it's green electrons. And there's no need to offset them because they are green electrons. So that's our emissions zero. And because I am using a model that is already trained, you could argue, and that's how the

Starting point is 00:51:13 carbon market works, that I could even release credits of carbon because I'm avoiding emissions, which is crazy, obviously. But there are ways to minimize the emissions. One, and two, it's extremely complicated. So not only to understand emissions, but also to handle them and how the markets of emissions, offsets work. It is extremely complicated and just to pick on your example on the wind generated energy, well yes sure it doesn't cost like fossil fuels to generate that but what about the resources it took to build the turbines or you know the bases or to transport all of those things so it's the only thing we can just say at this point probably is that it's super complicated.

Starting point is 00:52:03 The only thing we can just say at this point probably is that it's super complicated. You know the thing that also I wonder is that maybe the answer is that because the data we get is those images, the satellite images we use for that, they need to be processed too. They need to be downloaded from the satellite, they need to be read and calibrated, all of those things. If at that time the providers, NASA, ESA, and others, and I'm being in conversations to promote this idea, they pick two, three, four models, the ones that they pick, and they make the embeddings, it would be incrementally far better on emissions

Starting point is 00:52:42 and on cost because it's already in memory, it's already in the. Because it's already in memory. It's already in the processor. It's already there, right? So why not release it together with the image, release the embedding? Yeah, yeah, makes sense. Okay, so that's a very good segue to touch on the last part of the conversation

Starting point is 00:53:00 because what you mentioned could be a use case. And use cases is precisely, I figure the reason why you chose to go one step beyond the non-profit that brought Clay to life and found a private company that you can use as a vehicle to offer services around Clay. So what kind of services could they be? What kind of use cases and clients do you envision or do you potentially have already? We do. We do have already paying customers very early on. And so it's much more ad hoc

Starting point is 00:53:34 for those who have been in startups. It's, you know, that in the, in the early days is everything it's, it's using the product, but also understanding the customer very well and doing things in between, right? And then again, our biggest issue is that we could do so many things that is hard to convey the horizontal nature of embeddings, but at the same time, making it visible what it can do. We do have one customer that is asking, hey, I get some paper from certain places in the Amazon, and I want to make sure that the paper comes from those places and not from protected places. So then what we do is give us the locations, and then we give us the locations and we go and see in the embeddings, if the places that they say they took it

Starting point is 00:54:26 from, there should be a change. In the embedding before that, because we have access to historical data, so we can make embeddings for historical data. There should be a change from the forest before that and a forest after that. And the places that are protected, there should be no change at all.

Starting point is 00:54:42 We might not be able to say if they were caught and given to someone else or things like that, right? But we are able to do those monitoring services. And the key is that the creation of the embedding, the figuring out the semantic and the semantic change, all of those things can be abstracted into functions that they never even touch the customer. All they want is in a column in their sheet, which is, it comes from this, and this date, yes, no. And that yes, no encapsulates a lot of work that can easily be abstracted. Also explained, also out-dated, but the customer doesn't really care about transformers or embeddings.

Starting point is 00:55:26 They care about, hey, can you check this? All right. That's an imaginative, I would say, use case. I mean, I wouldn't think that people would ask you that, but apparently they are. Do you have other use cases that you can share or ones that you may suggest? Find assets. Find things like solar panels or pools or boats or parking lots with cars or not. Some of these are real cases, some are typical cases. And what about the structure, let's say, of the company? What stage are you at

Starting point is 00:56:08 right now? Are you bootstrapping or have you raised some capital? We did raise some save, which is called, which is early interest. We are now closing a seed round, which is the typical path for a VC company. And yeah, we are early on. We have something like six employees and customers and figuring out. I think the most important thing is that we have a clear idea what the service is, but we are at the same time healthy in not knowing what the product is. Because if we want, what we're talking about here is changing the geospatial industry. It's thinking about it completely different. We don't want, we are not a geospatial company. We are an answers company, if you will. And our biggest risk is becoming a geospatial company, we are an answers company, if you will.

Starting point is 00:57:05 And our biggest risk is becoming a geospatial company, of which there are many, and of which there are other ways to do things. Exactly, that was also going to be a question, because obviously you know this space better than myself, but even with my kind of knowledge, I can think of a couple of big companies that are into GeoSpecial. So it would be like, alright, so how do you compete with them? Would you want to compete with them? No, first of all, we love to partner with as many people as possible. The beauty of some of the things we do is that they are so different and so abstractible

Starting point is 00:57:44 that you could easily integrate that into traditional company that does things traditionally with computer vision and others, and then uses embeddings as a way to check the answers or to have a first filter. Or for example, I mentioned that the embeddings for now are sometimes not good, but still could be useful to say, OK, let's make a first pass with embeddings. Let's see the places that we have high confidence that the thing that you're

Starting point is 00:58:09 looking for is indeed there or high confidence that the thing you're looking for is not there. So, true positives and true negatives. And then we can be sure about those. And then everything in between, we use the traditional methods. So, even in those cases, we can find ways to partner. So even in those cases, we can find ways to partner. Okay, so the idea would be you use the legend service, let's say as a kind of indicator, then you verify the results by having someone manually, let's say, check them. Or with other methods like traditionally,

Starting point is 00:58:40 or even human, reviewing a human to do that, which is something we want to do. We're always designing human in the loop AI tools, not just automated AI tools because they make errors. And because they make errors, we basically focus not into replacing the user or the customer, but to making them 10x or 100x more efficient. Okay. So are you now so is there the the clay nonprofit let's say is it still operational? Yeah, very much. Yeah, the the clay nonprofit is still there. We just received another $2 million of from AWS to continue doing the work. So it's still working. Clay is the model and will continue to be the model and continue to be improving the model as a non-profit in the open. And then Legend is the service for any model,

Starting point is 00:59:31 including Clay. And we focus on Clay and we are close to Clay, so we can extract the most value of it. But it doesn't mean that we might not use other models. Okay. All right. So any closing thoughts that you'd like to share or I don't know, I guess since you have legend going on, the typical thing that founders at this stage say is we're hiring. Are you hiring? We are hiring. We are hiring. I'm busy to find LinkedIn models so people with those skills totally check it out. Just thank you for the interest.

Starting point is 01:00:08 It does help me because the biggest challenge I face or we face in my opinion is not technical, it's not data as I was explaining. We understand those things or we think we understand those things. It's explaining and figuring out what the user wants. I don't want to push the technology out. I want to be where the customer or where the user is. I want to understand the difficulties they have and how I can help them. And to do that,

Starting point is 01:00:35 you need there is a fine balance between talking very technically, which I know I have done in this conversation, but also listening to the challenges that people have and then working to solve them. Because at the end of the day, it doesn't really matter if it's AI or not AI, are you solving someone's problem? Thanks for sticking around. For more stories like this, check the link in bio and follow Link Data Orchestration.

Orchestrate all the Things - Building AI for Earth with Clay: The intelligence platform transforming Geospatial data analysis. Featuring Clay Executive Director Bruno Sánchez

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.