Orchestrate all the Things - Taking the world by simulation: The rise of synthetic data in AI. Featuring Datagen CTO & Co-founder Gil Elbaz

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Would you trust AI that has been trained on synthetic data as opposed to real-world data? You may not know it, but you are probably already doing it, and it's fine, according to the findings of a newly released survey. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Starting point is 00:00:27 It's a pleasure to meet you. I'm Gil, the CTO of Datagen and one of the co-founders. A little bit of my background, I worked in the defense industry in Israel, doing some cool algorithmic things there. And I also did a first and second degree in the Technion where the second degree was really focused on computer vision on 3D data using deep learning when it was pretty early on 2015-16 before we had a good grasp of how to you know really manipulate and understand 3D data with deep learning and so there I actually had the opportunity to use synthetic data very, very early on. And so these were the kind of first sparks of what we see today.

Starting point is 00:01:12 And I was super surprised that it worked, right? It seemed like a hack. It seemed something that shouldn't work, but works anyway. It was very, very counterintuitive. And so, you know, looking forward at where graphics was going from 2015 to today, looking forward at how neural networks were developing going forward, and how the industry is adopting deep learning at an extremely fast pace, we saw that the manual annotation process, which today is, you know, very operationally intensive, you go out,

Starting point is 00:01:48 capture pictures of people and things at large scale, and then send it to manual annotation companies. This is not scalable, and it doesn't make sense. And so what we did is we thought about, and this is something that a lot of Israeli tech companies do, is they take a very technological approach instead of an operations focused approach. And so that's what we did as well. We really focused on how do we solve this problem with a technological approach that will scale to the needs of this growing industry and we did this through simulation so by simulating the real world we were able to create data to train ai to understand the real world and so this is really what we do at datagen and um yeah so we started off a fear and i garage mode we had one computer

Starting point is 00:02:41 that uh that my mom bought us which was very nice of her. And we started generating data. And actually, very early on, we actually started selling data. And so we actually had two large companies that we worked with, who are great partners. And we pushed really, really hard to actually get paid for the data and everything. But we had two great partners early on. and we actually sold data, over $100,000 of data before the seed round. So before we raised any capital, and we wanted to prove to ourselves that there's a market for this, that this is valuable, that this, you know, makes sense. And there was a lot of convincing to do, there was a lot of uphill battles that we went into. And now,

Starting point is 00:03:28 2021, in hindsight, it seems that synthetic data is really becoming very much the standard in the industry, very much the go-to solution because of the challenges that I mentioned before. But back then, it was very hard to convince people to use synthetic data. A lot of comments around, it will never be as good as real data. And now we're proving it both quantitatively, internally and externally, and with our customers.

Starting point is 00:03:56 And we see industries, complete industries that are moving towards this. HoloLens is completely built off of synthetic data. And so there are some amazing projects that are just fully focused on synthetic data today. And we see that this is not just a trend, it's a shift in the industry and it's going to continue. OK, thanks.

Starting point is 00:04:16 That's quite elaborate interaction. And you touched upon a few topics that I wanted to bring up and discuss with you anyway. So so thanks for that. Let's see how to proceed. Actually, I think the best way would be to touch upon the occasion, let's say, for having this conversation, which is a survey that you have commissioned and which is coming out in a few days, which is precisely about this topic, the use of synthetic data in the industry, specifically in the computer vision industry, because this is what you focus on, I imagine. And one of the things I wanted to ask you was precisely if you think that these findings and this way of working with synthetic data,

Starting point is 00:05:06 actually, can be generalized beyond computer vision, but let's start with from the beginning. And if you can say a few words about the survey itself, so what it is about, how it has been commissioned, the identity of this research, let's say what people participated and so on. Yeah, sure. So we commissioned a survey

Starting point is 00:05:29 that was completely independent of DataGen. So we wanted to get an unbiased understanding of the market and really understand both for ourselves and also create kind of more transparency around how synthetic data is being used today and what is the state of synthetic data in the industry. And so we commissioned this with an external company and we asked a few questions that really actually opened up a lot of insight to the state of synthetic data. And we were actually very much surprised. We saw that an enormous percentage of people

Starting point is 00:06:05 are actually reporting using synthetic data. So we have 96% that reported using synthetic data, which is much higher than what we expected. And it was very promising. We had over 80% of people that said that synthetic data is gonna be surpassing real world data. And so that they're already using either more or equal to the amount of real data that they're using. And this was also something that we think, and we talk about this with our customers, we talk about this with

Starting point is 00:06:37 the industry folks that we're in good connections with, that it makes sense that most of the data that's going to be used is synthetic data, where you're going to use also real data to fine tune the networks and get to that last step. But seeing this in practice, having over, I think, 300 independent industry experts answer these and really provide this overarching view is very, very much positive and supports what we've been promoting for such a long time. And so this was also kind of a great insight that we have. And so maybe another one is more around,

Starting point is 00:07:23 how edge cases and the motivations for using synthetic data. So edge cases, we saw that are actually delaying machine learning projects and hurting them in production. And so we have almost 60% reported that their training delays could have been mitigated if they had data that covered edge cases. And so this is something that we also

Starting point is 00:07:45 see with our clients and we help them solve this with synthetic data. Of course, with synthetic data, you're not as constrained in the data that you're collecting. And so an edge case that you might not see very often in the real world, like for instance, a kid crossing the street and a car kind of driving quickly towards the kid. These are things that you don't want to see and you don't see much in the real world. But these are things that you can simulate and generate at scale. And so we do things around, for instance, people falling asleep in the car with the camera in the car, doing driver monitoring and occupancy monitoring, and we can simulate also these scenarios

Starting point is 00:08:25 that are very hard to find in the real world. And of course, this helps projects converge much faster. And so we saw that almost 60% reported this, and this was a big finding for us as well. Because when you talk about edge cases, sometimes it sounds like they're less important, but actually there are a lot of edge cases that are super important because those are the exact cases where the models fail and they can also fail and hurt the actual, you know, users the most. And so, yeah, so this is a very central theme and very central thing that we saw.

Starting point is 00:09:03 And, yeah, there were a few additional points that were that were extremely interesting as well. But I would say that, I would say that, you know, there's still a lot of people collecting manual data. It makes sense to collect also manual data together with the synthetic data, but teams now need to focus on their data strategy going forward and a data strategy at a high level is you know how how am i as a team lead as a director of engineering as a director of computer vision or nlp or any other kind of you know machine learning manager how do manager. How do I go about collecting and annotating data today and over time? And so this data strategy is something that I think is very much still in its early phases. People are now doing these shifts to synthetic data from real data, but in general, there needs to be more

Starting point is 00:10:01 of a overarching focus on this. And I think that entire roles that are focused on this. And it comes together with both, understanding the state of the art machine learning, understanding synthetic data, understanding active learning, and putting all of these together in order to make one comprehensive data strategy that's super specific to the actual use

Starting point is 00:10:26 case at hand. So it's in-cabin vehicles or if it's robotics or if it's medical computer vision, these are all very different and all need very different strategies. Okay, yeah thanks for highlighting some of the findings in the survey. I would also add a little bit to what you mentioned. I think it's in the last part of the survey when you refer, a nutshell, this is the notion that well, models are already, let's say, sufficiently developed. So the focus should be around data. So instead of having many cycles around developing your your model, it's probably better to keep your model relatively stable and keep developing your data set. So it's a relatively new notion. Up to

Starting point is 00:11:28 now, most people have been mostly paying attention to developing their models and it kind of fits with the emphasis on data sets and therefore as an extension to that, the emphasis on enhancing your data sets with synthetic data i would say 100 and andrew and early this year early 2021 started you know promoting this data-centric approach and it really expanded out to the entire industry very quickly i think that this is a kind of something that was understood by many people, but it wasn't formalized and it didn't have a name for a long time. And so what happened is it was very quickly adopted, I think, because people all understood that the premise of this makes a lot of sense, right? That the models, they work great. We have transformers for video. We have great CNNs. We have solutions that work well on edge. We have solutions that work better on cloud. And so we're in a good situation there. And then separately, the data has been the pain point. And for a long time, it's been very hard

Starting point is 00:12:45 to try to optimize the data just because of all of the friction. But now that the friction is going down and the models have really converged, there's both no real alternative to focusing on the data. And on the other hand, it makes a lot of sense to focus on the data. We see that it's actually improving performance

Starting point is 00:13:04 substantially. So yes, this has all converged a lot of sense to focus on the data we see that it's actually improving performance substantially so yes this is all converged a lot in 2021 and it's a big part of also why synthetic data is becoming so central right now indeed indeed yeah so i would like to take a step back actually and go to the center kind of examine let's say the central premise of generating synthetic data in the in the first place you mentioned in your introduction that well to many people it seemed counterintuitive and i think probably you said to yourselves as well when when you started out like it seemed like a hack and i have to to admit that not having experience of generating or using synthetic data myself, I also have the same initial gut reaction, let's say. So how can this work? Because it seems like you're generating data artificially, so how can they be representative of the real world. And especially to connect it to something else that you mentioned that I also wanted to touch upon,

Starting point is 00:14:07 specifically edge cases for use cases such as autonomous driving, for example. And these are, as you also mentioned, notoriously hard to come by. And this is where actually algorithm models fail because there's not enough data to deal with situations that are not in the 80% of everyday occurrences, let's say. So what's the premise, what's the guiding principles that you follow when you want to generate synthetic data to make them representative of the real world?

Starting point is 00:14:41 Yeah, so I think that one, so synthetic data can mean a lot of things the fo the focus that we have and the focus that we see is working the best today is what's called simulated synthetic data and simulated synthetic data is a sub subset of synthetic data that's focused on 3d simulations of the real world and then capturing virtual images within that 3D simulation, and to create visual data, to create data that's fully labeled. And then that can actually be used to train models. And so in practice, the reason that we see that this is working well is twofold. One is kind of how we look at networks today. And we see neural networks in a different light. We see neural networks as algorithms

Starting point is 00:15:29 that take in a lot of data, right? They can take in hundreds of gigabytes of data. And what they do is they, in practice, let's say we have a neural network to detect a dog in an image, for instance. So it takes in 100 gigabytes of dog images. It then outputs a very specific output. It outputs a bounding box where the dog is in the image. And so what the neural network actually does during training is it compresses and extracts the knowledge needed from the domain in order to convert an image into the bounding box. So to map, it's like a function that maps the image to a specific bounding box.

Starting point is 00:16:10 And so what we see is the neural networks themselves, they only weigh a few megabytes and they're actually compressing hundreds of gigabytes of visual information and extracting from it only what's needed. And so if you look at it like that, then the neural networks themselves are less of like the interesting part, I guess. They're more just the compression mechanism. And the interesting part is actually the data. Okay. And so the data is really the focus here. And now the question is, okay, how do I create data that can represent the real world in the best way? And so going back to the simulated synthetic data,

Starting point is 00:16:47 there's synthetic data that's based off of GANs, based off of generative methods. And this is one way of going about it. It's very hard to create new information by just training an algorithm with a certain data set and then using that data to create more data. It doesn't really work because you with a certain data set and then using that data to create more data, it doesn't really work. Cause you have a certain data,

Starting point is 00:17:09 you have a certain bounds of the information that you're representing. But we're actually taking a different approach. And this is what Tesla is doing with their simulation. And this is what all of the automotive companies are doing. And this is also what Datagen is doing with focus on humans and understanding humans and environments. And so what we're doing is we're actually creating these 3D simulations. And so what's actually interesting about the 3D simulations is instead of going and

Starting point is 00:17:35 collecting video of people doing things, in our case, what we're doing is we're actually collecting information that's disentangled from the real world and is super high quality. And so this includes collecting super high quality scans from the real world of people, collecting high quality motion capture data. So people that are moving around and doing things, scanning objects and modeling procedural environments. And so we're creating these decoupled pieces of information from the real world. And the magic is really connecting it together at scale and providing it in a controllable, simple fashion to the user.

Starting point is 00:18:14 And so at the end of the day, the synthetic data is real data, right? It's made up of pieces of real data, but it's constructed together in a way that provides a lot more variance and a lot more control for the end user and so this is the the real power the real power and the the real secret is that the synthetic data is made up of or good synthetic data okay is made up of of a lot of information from the real world It's just constructed in a way that's much better suited to train our neural networks in practice. Okay, so I guess I'm imagining that in addition to having things such as scans of objects and capturing people in motion and the like, you

Starting point is 00:19:01 also need to have a model to connect them and to express things such as, I don't know, the velocity of throwing an object or moving objects or how people move or things such as, well, you can't move through walls and all of those things. So to me, this kind of hints at using something like game engines, for example. Do you use those or have you perhaps developed something similar of your own tailored to your specific needs? Yeah, so definitely. We use many different base tools, different game engines, and also modeling engines.

Starting point is 00:19:43 And there are many really good ones. As a startup, you don't want to develop things that already exist and work well and are free for the most part. And so we definitely leverage these kind of technologies to help us represent our simulations. On top of them, there are many layers of both machine learning that are used to create content and expand the content significantly. So we're not making one clip of 90 seconds like a movie. We're creating millions of unique humans doing things with variants in various environments. So there are different challenges there. And then we also have advances on rendering capabilities. So the ability to take a 3D scene with a virtual camera and then

Starting point is 00:20:33 capture an image from that, that's very GPU intensive. So there's a lot of also innovation around how we capture these images in a super realistic way at scale. And so that's also a big challenge. And so all of these pieces come together. And yes, there's definitely graphics engines in the middle, but it's one part out of a quite large stack. Okay. So my next question would be, so right, this is how you do it for visual data, which is the kind of data that you deal with. And in this domain, you kind of have a head start precisely because of the existence of these engines and models of the world, let's say, that have been developed independently and some of which you enhance to work in the way that you need them to work and so on. Would you say that this approach and this process can be generalized outside your domain? So to generate data that are not visual, how would you go about it?

Starting point is 00:21:40 And actually, is that something that you consider doing at some point, maybe? It's a great question. I would divide it into two kind of main buckets, the types of data. One is unstructured data and the second is structured data. And so with unstructured data like images, audio, signals from, for instance, these kinds of unstructured data can be simulated for the most part. And there are domains that's harder to simulate, domains that's easier to simulate, but these things can be done.

Starting point is 00:22:16 And so you can think of, let's say for audio, to simulate audio is actually relatively easier than simulating visual data. And we can place simulated objects within environments and calculate how the how the audio should should should be at the end of at the end of a pipeline and so in practice that's something that's very similar to the visual data with respect to text which is like semi-structured and structured data, meaning tabular data, for instance, medical records, that's a bit of a different problem. And there,

Starting point is 00:22:53 we actually see a lot of innovation. There are many startups focusing on tabular data, and this is mostly around privacy. Because tabulated is so sensitive with respect to privacy concerns there's a lot of focus on creating the ability to simulate data from an existing pool of data but not in order to expand significantly the amount of information more to create a privacy compliance layer on top of your data that you can actually send here different data scientists around the world so that they can start training models and creating insights that will then be able to,

Starting point is 00:23:32 that you'll then be able to apply to your original real data or the new data that's gonna be created through your business. And so this is kind of a separate company. We're not really focusing on the tabular side of things, although it's extremely interesting. And I think it's also a big part of the synthetic data story and then at the same time the the unstructured data like audio are things that we could definitely do in the future and so people talking might be one of the first ones where we have people talking creating audio waves right but also where you control both the

Starting point is 00:24:07 3d head the 3d person the identity what they're saying and then their ability to talk from a specific place and then we can create what's called multimodal data meaning data that has multiple types of modalities such as visual and audio together, for instance. I actually think the audio and the dialogue, to be more specific example, is a good one to highlight the challenges, I would say, because well, I would imagine it's not enough to simply have a model of things such as how does the sound propagate in a 3D environment and all of that. You actually need to have a model of language like, okay, this person is sitting at an office and this person, I don't know, has a task that is doing, what is the person likely to say? So you probably need to integrate things like language models and semantics and linguistics

Starting point is 00:25:09 and all of that. Definitely, definitely. Yeah, and it's actually good that you touch on language models as well, because there is also another kind of big shift happening in the world of AI, which is around foundational models. Foundational models include these large language models like GPT-3 and others that have come out around that.

Starting point is 00:25:31 And GPT-3, just for the sake of providing a baseline, is a very powerful language model that has the ability to pretty much solve or provide a good baseline for almost every main language task. So it could be complete like sentence completion. It can be a question answering. So you can ask it a question and it will answer you in a reasonably good way.

Starting point is 00:26:01 It could be a chat bot. All of these things can be powered by a single super large language model that was trained on a lot of the internet and a lot on semi-structured data around the internet that was curated. There are many challenges with it as well around biases, around privacy, around correctness, and around really the rights also to the data. I think that there's going to be a big challenge also in the future with regards to, is it even legal? Is it ethical for these giant companies to go and scrape the entire internet

Starting point is 00:26:38 and then collect all the data and train their model and then make a business out of it? These things are going to be big questions going forward on the legitimacy of foundational models. But yes, this is like languages are NLP is really the first domain where we see these foundational models leapfrogging all of the more domain specific, task specific models that we see in the other kind of machine learning domains. Now, in practice, almost every machine learning model today that's being used is

Starting point is 00:27:13 domain-specific, task-specific, modal-specific. So it's like a computer vision model for detecting dogs in a very specific scenario even. But looking forward, there's also this path that's potential path, might be a dangerous path as well, but a path that goes towards foundational models, which is also quite interesting. Yeah. To come back to another issue that you touched upon, annotation and related to that, I guess, is bias.

Starting point is 00:27:48 And when I say related, this is the fact that, well, which is well known to people who have done annotation or worked with annotators, that no two annotators necessarily annotate in the same way. So you could say that there is some kind of bias inserted there. In contrast to that, in computer vision data that's synthetic, like the ones that you produce, I imagine that the annotation part, you briefly touched upon it earlier, comes as part and parcel, let's say, of the whole of what you get. And this is because of the fact that it's a simulated environment that you control and therefore objects in there are pre annotated, I suppose. Right?

Starting point is 00:28:36 Definitely. Yes. So with manual annotation, like you said, there's a challenge with getting consistent results. And this actually hinders the network performance. In addition, there are biases. And the biases actually occur. It's a big problem. The biases are actually where it's very hard to annotate. For instance, an object that's in a dark environment will maybe not get annotated a lot of times. And it could also be a person in a dark environment, right? And so you have these biases that actually occur where it's hard to annotate maybe not get annotated a lot of times and it could also be a person in a dark environment right

Starting point is 00:29:05 and so you have these biases that actually occur where it's hard to annotate the data and so this is a big problem and it inserts biases that that you know are substantial into the data and yes everything that's represented in the simulated environment we can actually on runtime create perfect annotations that are pixel perfect and that just, you know, they don't have any error because it's all computed. There's no human in the loop pretty much there. Okay, I see. And just to return to those edge cases, how do you actually go about generating those edge cases? You mentioned

Starting point is 00:29:46 previously, for example, scenarios where you have car accidents and the like. I presume, by the way, that you do actually generate that kind of data and therefore you have clients that use them for autonomous vehicle training and this sort of thing. So how do you go about generating an edge case for autonomous vehicles, let's say? Yeah, it's a great question. We focus a lot on the inside of the vehicle. So let's say someone falling asleep at the wheel, which is a complete edge case in the real world. And so, you know, having someone fall asleep at the wheel, there are two kind of, to get that kind of data, there are two alternatives. Either you can bring in a thousand different actors and have them fall asleep at the wheel.

Starting point is 00:30:29 There are also companies that pay actors not to sleep for over 24 hours and then they come in and they have to pay them extra, of course. But and they're also companies that bring in the same actor for 10, 10 different sessions. And some of the sessions, they don't sleep. Some of them, they do sleep and try to gather data in that way. These projects are operationally intensive, like you understand, and they cost millions and millions of dollars. And it also takes a long time. And so the other option, the option that we're kind of presenting is we actually bring in actors. We capture them falling asleep with motion capture suits, so high quality motion capture suits. And we also scan many people,

Starting point is 00:31:14 and we have the ability to use latent space representations to generate new unique identities that don't have any privacy issues. So they're not real people. So they don't look like any real person that was collected and so we have the ability to take let's say a hundred thousand people take a thousand different motion captures and then map them to one to to each other and in this way you can create millions of data points of various people that are falling asleep at the wheel in various different ways and so this is you is wearing different clothing. We can randomize the motion also as they're falling asleep.

Starting point is 00:31:49 There are a lot of computational capabilities that we're adding on top of the data. But in practice, it's just a much more scalable method of creating this data. And when we think about what edge cases do we want to map, this is in many times, both guided by the market. So our customers are, you know, they work on these very challenging problems and they, they know for the most part, a lot of the edge cases that are hard for them to gather, but they, they know that they need. So they come to us and ask a lot of times for very specific edge cases. And we, you know, we're happy to create this data at scale for them.

Starting point is 00:32:24 And then the second part of this is, is us trying to understand, you know, we're happy to create this data at scale for them. And then the second part of this is us trying to understand, you know, where the industry is going and then what might be a challenge. So for instance, having babies in the car, right? This is a this is a big problem, you know, forgetting children in the car or forgetting animals in the car. There are a bunch of different problems. And so we know we also have a map and a road map that's very much aligned with these main problems as well and we we run and

Starting point is 00:32:50 progress with that independently as well okay it sounds like there's even i would call it a bit of directorial aspect in in this in this process so you have to you have to set it up in a way that's realistic. Definitely. Yeah. I think the magic is that we've created an interface internally, right? So that we can ask for these high level requirements. So we can ask for, let's say someone falling asleep at the wheel or two people falling asleep, one in the passenger seat and one in the driver's seat. We can ask for this at a very high level and then create in a self serve platform that's right in front of the users in a way that they can control all these different parameters, we can create this data at scale. And so this is, I think that this kind of internal capability has allowed us to really scale up efforts and answer the needs of a lot of customers. Okay, so one of the things that you mentioned in the beginning, actually, and I share this observation was that you were

Starting point is 00:34:01 surprised by the rate of adoption that synthetic data shows, at least in your domain. So my question on that would be, how much do you think that you could possibly extrapolate from this for other industries? And we already mentioned the differences that exist between visual synthetic data and other types of synthetic data. So it may be harder, let's say, to or perhaps let's put it that way, that generating synthetic data of other types and for other scenarios may not be at the same stage that you are currently with computer vision synthetic data. But do you think that this signifies a trend towards adoption in other domains and other types of data as well?

Starting point is 00:34:51 And do you expect that to happen soon? Yeah, I think that anything that has to do with privacy, so using people's information in order to train AI, is going to need to move and shift to synthetic data. Even companies that have models that are already trained. So you can take, let's say, Apple Siri, even if it's already trained and it's trained on real data, this means that inside of the company, they have somewhere that data set that exists of real people either talking, doing things and all that. And so, you know, we see that a lot of companies, the big ones especially, are trying to take the existing data pools that they have and shift it to synthetic just to get rid of all of the personally identifiable information. And so I see this not only in computer vision, but in all fields that you have privacy information that needs to be privacy compliant today with the tools, both in structured data and unstructured data. So a lot of what we do

Starting point is 00:35:57 is around humans, for instance, unstructured data. There's no reason really to use uh or to have a significant amount of real um visual data and on the other on the other hand it makes a lot of sense that the shift will happen um you know with within these within these large companies with with all of the domains pretty much um yeah it's just too much risk too much risk that's on these big companies to keep the data safe. And, you know, it's better than any cybersecurity, right, is the ability to just not have problematic data in your databases. I'm kind of, what I'm seeing, let's say, based on your answer is that maybe this will see adoption in specific, let's say, requirements in specific scenarios. So I couldn't imagine there's other use cases like, I don't know, e or how feasible it would actually be to try and generate synthetic data for consumer behavior, let's say in, I don't know, around Black Friday or any other of those events.

Starting point is 00:37:13 Maybe it can be done, but I'm not sure how reliable that synthetic data would be. Yeah, I think that what I would say is that's actually a very interesting use case. I think that, you know, an entire company could be created around that specific use case. But it is possible, I think. It's a connection between both tabular data and also unstructured, kind of more behavioral data, where they're moving the mouse, what they're doing on the screen and all of that but i'm just thinking about you know if there is an enormous amount of information and there is of let's say shoppers on black friday at amazon.com then i'm sure that it is possible in a way to simulate these interactions simulate what is actually happening on the site and it it can also be both very, like intuitive to understand for the product folks that are optimizing the site. And also,

Starting point is 00:38:11 of course, it can be used to train models to then predict things. So actually, I think that it should be possible. But again, it's a totally separate company. And there are new challenges that arise from that kind of data. The challenge I see to be more specific is that you may get into a kind of feedback loop situation where you're training models to predict future behavior based on past behavior that's generated. So you get back to me, in my my mind at least you get back into the hack kind of uh territory let's say yeah yeah so it's again it's exactly it's different than the

Starting point is 00:38:52 simulated approach like we do the simulated synthetic data approach based on the simulator this is more like the gan generated approach that by the way is used also a lot by the structured data the structured synthetic data group and so yeah for me it's much closer to the structured synthetic data than the unstructured but uh i think that you're not going to be creating new information what you can do is make sure that there's a privacy compliant version of of the black fr, for instance. And so that I think is possible. And the goal there would just be for the data to represent the real world data in the best way possible and without showing any of the privacy or without ruining the privacy of

Starting point is 00:39:40 the customers that were on the site. And then they can actually delete the real data at a certain point. And so they would have kind of a replacement for the real data without having to track their customers in a maybe borderline ethical way. Okay. All right, so yeah, we can wrap up here, I guess,

Starting point is 00:40:07 if you have any comments to offer on where do you think this is going next, both for the industry and the practice, let's say, at large. So with such high adoption rates already, I don't know, is that a problem or an opportunity for you as a company, let's say? So what's your future plans? Yeah, we see it as definitely as an opportunity.

Starting point is 00:40:30 We see that this adoption is going to enable the next layers and the next levels of capabilities of computer vision and also is going to bring a lot of computer vision capabilities to production and so we're going to be seeing it in our day-to-day even more more smart stores smart classrooms smart offices all of these things that we want and so you know we see this as an opportunity and something that's going to expand the entire computer vision industry the second part is is really that and again this computer vision industry is very much in early stages we We're not yet where the software industry is, for instance. This is very, very much the first step. The second thing is we have to mention the metaverse and everything that's happening there. The hardware of the metaverse is going to be pretty much completely based off of synthetic data. We saw HoloLens, Microsoft's AR glasses that are developed with hand tracking

Starting point is 00:41:26 completely based on synthetic data, eye tracking completely based on synthetic data, and now face reconstruction also based on synthetic data. And so, we see that the hardware enablement is gonna be synthetic data-based, as well as later on these capabilities are gonna be inserted into the metaverse and gonna be very much part of making the experience the connection between the real world and the

Starting point is 00:41:49 digital world seamless and so we see that there's going to be a lot of innovation going forward also in that direction and maybe the last part is just you know we we think that in the future, the concept of having PhD students being the main kind of stakeholders, the main people doing this computer vision or creating these new computer vision capabilities, it's not going to be the only thing like we're going to also see ways for people with less experience with less specialized knowledge being able to create amazing computer vision applications and then connect them into their android app into their maybe ar glasses app into their various devices and so one trend also that we see going forward

Starting point is 00:42:39 is is really opening the the market of computer vision to the whole world, to all of the developers in the world, and allowing them to integrate new capabilities at the click of a button, pretty much. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Orchestrate all the Things - Taking the world by simulation: The rise of synthetic data in AI. Featuring Datagen CTO & Co-founder Gil Elbaz

Would you trust AI that has been trained on synthetic data, as opposed to real world data? You may not know it, but you are probably already doing it, and it's fine, according to the findings of ...a newly released survey. Piece published on VentureBeat.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.