Orchestrate all the Things - Taking the world by simulation: The rise of synthetic data in AI. Featuring Datagen CTO & Co-founder Gil Elbaz
Episode Date: December 21, 2021Would you trust AI that has been trained on synthetic data, as opposed to real world data? You may not know it, but you are probably already doing it, and it's fine, according to the findings of ...a newly released survey. Piece published on VentureBeat.
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Would you trust AI that has been trained on synthetic data as opposed to real-world data?
You may not know it, but you are probably already doing it,
and it's fine, according to the findings of a newly released survey.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.
It's a pleasure to meet you. I'm Gil, the CTO of Datagen and one of the co-founders.
A little bit of my background, I worked in the defense industry in Israel, doing some
cool algorithmic things there. And I also did a first and second degree in the Technion where the second degree was really
focused on computer vision on 3D data using deep learning when it was pretty early on 2015-16
before we had a good grasp of how to you know really manipulate and understand 3D data with
deep learning and so there I actually had the opportunity to use synthetic data very, very early on.
And so these were the kind of first sparks
of what we see today.
And I was super surprised that it worked, right?
It seemed like a hack.
It seemed something that shouldn't work, but works anyway.
It was very, very counterintuitive.
And so, you know, looking forward at where graphics was going from 2015 to today,
looking forward at how neural networks were developing going forward, and how the industry
is adopting deep learning at an extremely fast pace, we saw that the manual annotation process,
which today is, you know, very operationally intensive, you go out,
capture pictures of people and things at large scale, and then send it to manual annotation
companies.
This is not scalable, and it doesn't make sense.
And so what we did is we thought about, and this is something that a lot of Israeli tech
companies do, is they take a very technological approach instead of an operations focused approach.
And so that's what we did as well. We really focused on how do we solve this problem with a technological approach that will scale to the needs of this growing industry and we did this through simulation so by simulating the real world
we were able to create data to train ai to understand the real world and so this is really
what we do at datagen and um yeah so we started off a fear and i garage mode we had one computer
that uh that my mom bought us which was very nice of her. And we
started generating data. And actually, very early on, we actually started selling data.
And so we actually had two large companies that we worked with, who are great partners. And we
pushed really, really hard to actually get paid for the data and everything. But we had two great
partners early on. and we actually sold
data, over $100,000 of data before the seed round. So before we raised any capital, and we wanted to
prove to ourselves that there's a market for this, that this is valuable, that this, you know, makes
sense. And there was a lot of convincing to do, there was a lot of uphill battles that we went into. And now,
2021, in hindsight, it seems that synthetic data is really becoming very much the standard
in the industry, very much the go-to solution because of the challenges that I mentioned before.
But back then, it was very hard to convince people
to use synthetic data.
A lot of comments around,
it will never be as good as real data.
And now we're proving it both quantitatively,
internally and externally, and with our customers.
And we see industries, complete industries
that are moving towards this.
HoloLens is completely built off of synthetic data.
And so there are some amazing projects
that are just fully focused on synthetic data today.
And we see that this is not just a trend,
it's a shift in the industry and it's going to continue.
OK, thanks.
That's quite elaborate interaction.
And you touched upon a few topics that I
wanted to bring up and discuss with you anyway.
So so thanks for that.
Let's see how to proceed. Actually, I think the best way would be to touch upon the occasion, let's say, for having this conversation, which is a survey that you have commissioned and which is coming out in a few days, which is precisely
about this topic, the use of synthetic data in the industry, specifically in the computer
vision industry, because this is what you focus on, I imagine.
And one of the things I wanted to ask you was precisely if you think that these findings and this way of working with synthetic data,
actually, can be generalized beyond computer vision,
but let's start with from the beginning.
And if you can say a few words about the survey itself,
so what it is about, how it has been commissioned,
the identity of this research,
let's say what people participated and so on.
Yeah, sure.
So we commissioned a survey
that was completely independent of DataGen.
So we wanted to get an unbiased understanding of the market
and really understand both for ourselves
and also create kind of more transparency
around how synthetic data is being used today
and what is the state of synthetic data in the industry.
And so we commissioned this with an external company and we asked a few questions that really actually opened up a lot of insight to the state of synthetic data.
And we were actually very much surprised. We saw that an enormous percentage of people
are actually reporting using synthetic data.
So we have 96% that reported using synthetic data,
which is much higher than what we expected.
And it was very promising.
We had over 80% of people that said that synthetic data
is gonna be surpassing real world data.
And so that they're already using either more or equal to the amount of real data that they're using. And this was also
something that we think, and we talk about this with our customers, we talk about this with
the industry folks that we're in good connections with, that it makes sense that most of the data
that's going to be used is synthetic data,
where you're going to use also real data to fine tune the networks and get to that last step.
But seeing this in practice, having over, I think, 300 independent industry experts answer these and really provide this overarching view
is very, very much positive
and supports what we've been promoting for such a long time.
And so this was also kind of a great insight that we have.
And so maybe another one is more around,
how edge cases
and the motivations for using synthetic data.
So edge cases, we saw that are actually delaying
machine learning projects and hurting them in production.
And so we have almost 60% reported that
their training delays could have been mitigated
if they had data that covered edge cases.
And so this is something that we also
see with our clients and we help them solve this with synthetic data. Of course, with synthetic
data, you're not as constrained in the data that you're collecting. And so an edge case that you
might not see very often in the real world, like for instance, a kid crossing the street and a car
kind of driving quickly towards the kid.
These are things that you don't want to see and you don't see much in the real world.
But these are things that you can simulate and generate at scale.
And so we do things around, for instance, people falling asleep in the car with the camera in the car,
doing driver monitoring and occupancy monitoring, and we can simulate also these scenarios
that are very hard to find in the real world.
And of course, this helps projects converge much faster.
And so we saw that almost 60% reported this,
and this was a big finding for us as well.
Because when you talk about edge cases,
sometimes it sounds like they're less important,
but actually there are a lot of edge cases that are super important because those are the exact cases where the models fail and they can also fail and hurt the actual, you know, users the most.
And so, yeah, so this is a very central theme and very central thing that we saw.
And, yeah, there were a few additional points that were
that were extremely interesting as well. But I would say that, I would say that, you know,
there's still a lot of people collecting manual data. It makes sense to collect also manual data
together with the synthetic data, but teams now need to focus on their data strategy going forward and a data
strategy at a high level is you know how how am i as a team lead as a director of engineering as a
director of computer vision or nlp or any other kind of you know machine learning manager how do manager. How do I go about collecting and annotating data today and over time? And so
this data strategy is something that I think is very much still in its early phases. People are
now doing these shifts to synthetic data from real data, but in general, there needs to be more
of a overarching focus on this. And I think that entire roles that are focused on this.
And it comes together with both,
understanding the state of the art machine learning,
understanding synthetic data,
understanding active learning,
and putting all of these together
in order to make one comprehensive data strategy
that's super specific to the actual use
case at hand. So it's in-cabin vehicles or if it's robotics or if it's medical computer vision,
these are all very different and all need very different strategies.
Okay, yeah thanks for highlighting some of the findings in the survey. I would also add a little bit to what you mentioned.
I think it's in the last part of the survey when you refer, a nutshell, this is the notion that well, models are already,
let's say, sufficiently developed. So the focus should be
around data. So instead of having many cycles around
developing your your model, it's probably better to keep your
model relatively stable and keep developing your data set. So it's a relatively new notion. Up to
now, most people have been mostly paying attention to developing their models and it kind of fits
with the emphasis on data sets and therefore as an extension to that, the emphasis on enhancing your data sets with synthetic data i would say
100 and andrew and early this year early 2021 started you know promoting this data-centric approach and it really expanded out to the entire industry very quickly i think that this is a kind of something that was understood by many people,
but it wasn't formalized and it didn't have a name for a long time. And so what happened is
it was very quickly adopted, I think, because people all understood that the premise of this
makes a lot of sense, right? That the models, they work great. We have transformers for video. We have great CNNs. We have solutions that work well on edge.
We have solutions that work better on cloud. And so we're in a good situation there. And then
separately, the data has been the pain point. And for a long time, it's been very hard
to try to optimize the data
just because of all of the friction.
But now that the friction is going down
and the models have really converged,
there's both no real alternative to focusing on the data.
And on the other hand,
it makes a lot of sense to focus on the data.
We see that it's actually improving performance
substantially. So yes, this has all converged a lot of sense to focus on the data we see that it's actually improving performance substantially so yes this is all converged a lot in 2021 and it's a big part of also why synthetic
data is becoming so central right now indeed indeed yeah so i would like to take a step back
actually and go to the center kind of examine let's say the central premise of
generating synthetic data in the in the first place you mentioned in your introduction that
well to many people it seemed counterintuitive and i think probably you said to yourselves as
well when when you started out like it seemed like a hack and i have to to admit that not having experience of generating or using synthetic data myself, I also have the same initial gut reaction, let's say.
So how can this work?
Because it seems like you're generating data artificially, so how can they be representative of the real world. And especially to connect it to something else that you mentioned that I also wanted to touch upon,
specifically edge cases for use cases
such as autonomous driving, for example.
And these are, as you also mentioned,
notoriously hard to come by.
And this is where actually algorithm models fail
because there's not enough data to deal with situations
that are not in the 80% of everyday occurrences, let's say.
So what's the premise, what's the guiding principles that you follow when you want to generate synthetic data to make them representative of the real world?
Yeah, so I think that one, so synthetic data can mean a lot of things
the fo the focus that we have and the focus that we see is working the best today is what's called
simulated synthetic data and simulated synthetic data is a sub subset of synthetic data that's
focused on 3d simulations of the real world and then capturing virtual images within that 3D simulation, and to create visual data,
to create data that's fully labeled. And then that can actually be used to train models. And so
in practice, the reason that we see that this is working well is twofold. One is
kind of how we look at networks today. And we see neural networks in a different light.
We see neural networks as algorithms
that take in a lot of data, right?
They can take in hundreds of gigabytes of data.
And what they do is they, in practice,
let's say we have a neural network
to detect a dog in an image, for instance.
So it takes in 100 gigabytes of dog images.
It then outputs a very specific output. It outputs a bounding box where the dog is in the image. And so what the neural network actually does during training is it compresses and extracts the knowledge needed from the domain in order to convert an image into the bounding box. So to map, it's like a function that maps the image
to a specific bounding box.
And so what we see is the neural networks themselves,
they only weigh a few megabytes
and they're actually compressing hundreds of gigabytes
of visual information and extracting from it
only what's needed.
And so if you look at it like that,
then the neural networks themselves are less of like the interesting part, I guess. They're more just the compression mechanism.
And the interesting part is actually the data. Okay. And so the data is really the focus here. And now the question is, okay, how do I create data that can represent the real world in the best way? And so going back to the simulated synthetic data,
there's synthetic data that's based off of GANs,
based off of generative methods.
And this is one way of going about it.
It's very hard to create new information
by just training an algorithm with a certain data set
and then using that data to create more data. It doesn't really work because you with a certain data set and then using that data to create more data,
it doesn't really work.
Cause you have a certain data,
you have a certain bounds of the information
that you're representing.
But we're actually taking a different approach.
And this is what Tesla is doing with their simulation.
And this is what all of the automotive companies are doing.
And this is also what Datagen is doing
with focus on humans and understanding humans and environments. And so what we're doing is we're actually creating these 3D
simulations. And so what's actually interesting about the 3D simulations is instead of going and
collecting video of people doing things, in our case, what we're doing is we're actually collecting
information that's disentangled from the real world and is super high quality.
And so this includes collecting super high quality scans from the real world of people, collecting high quality motion capture data.
So people that are moving around and doing things, scanning objects and modeling procedural environments.
And so we're creating these decoupled pieces of information
from the real world.
And the magic is really connecting it together at scale
and providing it in a controllable, simple fashion to the user.
And so at the end of the day, the synthetic data is real data, right?
It's made up of pieces of real data,
but it's constructed together in a way that provides a lot more
variance and a lot more control for the end user and so this is the the real power the real power
and the the real secret is that the synthetic data is made up of or good synthetic data okay
is made up of of a lot of information from the real world It's just constructed in a way that's much better suited
to train our neural networks in practice. Okay, so I guess I'm imagining that in addition to
having things such as scans of objects and capturing people in motion and the like, you
also need to have a model to connect them and to express
things such as, I don't know, the velocity of throwing an object or moving objects or
how people move or things such as, well, you can't move through walls and all of those
things.
So to me, this kind of hints at using something like game engines, for example.
Do you use those or have you perhaps developed something similar of your own tailored to your specific needs?
Yeah, so definitely.
We use many different base tools, different game engines, and also modeling engines.
And there are many really good ones.
As a startup, you don't want to develop things that already exist and work well and are free for the most part.
And so we definitely leverage these kind of technologies to help us represent our simulations.
On top of them, there are many layers of both machine learning that are used to create content and expand the content significantly.
So we're not making one clip of 90 seconds like a movie.
We're creating millions of unique humans doing things with variants in various environments.
So there are different challenges there. And then we also have
advances on rendering capabilities. So the ability to take a 3D scene with a virtual camera and then
capture an image from that, that's very GPU intensive. So there's a lot of also innovation
around how we capture these images in a super realistic way at scale. And so that's also a big challenge. And so all
of these pieces come together. And yes, there's definitely graphics engines in the middle,
but it's one part out of a quite large stack. Okay. So my next question would be, so
right, this is how you do it for visual data, which is the kind of data that you deal with.
And in this domain, you kind of have a head start precisely because of the existence of these engines and models of the world, let's say, that have been developed independently and some of which you enhance to work in the way that you need them to work and so on.
Would you say that this approach and this process can be generalized outside your domain?
So to generate data that are not visual, how would you go about it?
And actually, is that something that you consider doing at some point, maybe?
It's a great question. I would divide it into two kind of main buckets, the types of data.
One is unstructured data and the second is structured data. And so with unstructured data like images, audio, signals from, for instance,
these kinds of unstructured data
can be simulated for the most part.
And there are domains that's harder to simulate,
domains that's easier to simulate,
but these things can be done.
And so you can think of, let's say for audio,
to simulate audio is actually relatively easier
than simulating visual data.
And we can place simulated objects within
environments and calculate how the how the audio should should should be at the end of at the end
of a pipeline and so in practice that's something that's very similar to the visual data with
respect to text which is like semi-structured and structured data, meaning
tabular data, for instance, medical records, that's a bit of a different problem. And there,
we actually see a lot of innovation. There are many startups focusing on tabular data,
and this is mostly around privacy. Because tabulated is so sensitive with respect to privacy concerns there's a lot of focus on creating
the ability to simulate data from an existing pool of data but not in order to expand significantly
the amount of information more to create a privacy compliance layer on top of your data
that you can actually send here
different data scientists around the world
so that they can start training models
and creating insights that will then be able to,
that you'll then be able to apply to your original real data
or the new data that's gonna be created
through your business.
And so this is kind of a separate company.
We're not really focusing on the tabular side of things,
although it's extremely interesting. And I think it's also a big part of the synthetic data story and then at the same time
the the unstructured data like audio are things that we could definitely do in the future and so
people talking might be one of the first ones where we have people talking creating audio waves right but also where you control both the
3d head the 3d person the identity what they're saying and then their ability to talk from a
specific place and then we can create what's called multimodal data meaning data that has
multiple types of modalities such as visual and audio together, for instance.
I actually think the audio and the dialogue, to be more specific example, is a good one to
highlight the challenges, I would say, because well, I would imagine it's not enough to simply have a model of things such as how does the sound
propagate in a 3D environment and all of that. You actually need to have a model of language like,
okay, this person is sitting at an office and this person, I don't know, has a task that is doing,
what is the person likely to say? So you probably need to integrate things like language models and semantics and linguistics
and all of that.
Definitely, definitely.
Yeah, and it's actually good that you touch
on language models as well,
because there is also another kind of big shift happening
in the world of AI, which is around foundational models.
Foundational models include these large
language models like GPT-3 and others that have come out around that.
And GPT-3, just for the sake of providing a baseline, is a very powerful language
model that has the ability to pretty much solve
or provide a good baseline
for almost every main language task.
So it could be complete like sentence completion.
It can be a question answering.
So you can ask it a question
and it will answer you in a reasonably good way.
It could be a chat bot.
All of these things can be powered
by a single super
large language model that was trained on a lot of the internet and a lot on semi-structured data
around the internet that was curated. There are many challenges with it as well around biases,
around privacy, around correctness, and around really the rights also to the data.
I think that there's going to be a big challenge also in the future with regards to,
is it even legal? Is it ethical for these giant companies to go and scrape the entire internet
and then collect all the data and train their model and then make a business out of it?
These things are going to be big questions going forward
on the legitimacy of foundational models.
But yes, this is like languages are NLP is really the first domain
where we see these foundational models leapfrogging all of the more domain
specific, task specific models that we see
in the other kind of machine learning
domains. Now, in practice, almost every machine learning model today that's being used is
domain-specific, task-specific, modal-specific. So it's like a computer vision model for detecting
dogs in a very specific scenario even. But looking forward, there's also this path
that's potential path, might be a dangerous path as well,
but a path that goes towards foundational models,
which is also quite interesting.
Yeah.
To come back to another issue that you touched upon,
annotation and related to that, I guess, is bias.
And when I say related, this is the fact that, well, which is well known to people who have done annotation or worked with annotators,
that no two annotators necessarily annotate in the same way.
So you could say that there is some kind of bias inserted there. In contrast to that,
in computer vision data that's synthetic, like the ones that you produce, I imagine that the
annotation part, you briefly touched upon it earlier, comes as part and parcel, let's say, of the whole of what you get.
And this is because of the fact that it's a simulated environment that you control
and therefore objects in there are pre annotated, I suppose.
Right?
Definitely. Yes.
So with manual annotation, like you said,
there's a challenge with getting consistent results.
And this actually hinders the
network performance. In addition, there are biases. And the biases actually occur. It's a big problem.
The biases are actually where it's very hard to annotate. For instance, an object that's in a dark
environment will maybe not get annotated a lot of times. And it could also be a person in a dark
environment, right? And so you have these biases that actually occur where it's hard to annotate maybe not get annotated a lot of times and it could also be a person in a dark environment right
and so you have these biases that actually occur where it's hard to annotate the data and so this
is a big problem and it inserts biases that that you know are substantial into the data and yes
everything that's represented in the simulated environment we can actually on runtime create
perfect annotations that are pixel perfect
and that just, you know, they don't have any error because it's all computed.
There's no human in the loop pretty much there.
Okay, I see.
And just to return to those edge cases, how do you actually go about generating those edge cases? You mentioned
previously, for example, scenarios where you have car accidents and the like. I presume, by the way,
that you do actually generate that kind of data and therefore you have clients that use them for
autonomous vehicle training and this sort of thing. So how do you go about generating an edge case
for autonomous vehicles, let's say? Yeah, it's a great question. We focus a lot on the inside
of the vehicle. So let's say someone falling asleep at the wheel, which is a complete edge
case in the real world. And so, you know, having someone fall asleep at the wheel, there are two
kind of, to get that kind of data, there are two alternatives.
Either you can bring in a thousand different actors and have them fall asleep at the wheel.
There are also companies that pay actors not to sleep for over 24 hours and then they come in and they have to pay them extra, of course.
But and they're also companies that bring in the same actor for 10, 10 different sessions.
And some of the sessions,
they don't sleep. Some of them, they do sleep and try to gather data in that way.
These projects are operationally intensive, like you understand, and they cost millions and millions
of dollars. And it also takes a long time. And so the other option, the option that we're
kind of presenting is we actually bring in actors. We capture them falling asleep with
motion capture suits, so high quality motion capture suits. And we also scan many people,
and we have the ability to use latent space representations to generate new unique identities
that don't have any privacy issues. So they're not real people. So they don't look like any
real person that was collected
and so we have the ability to take let's say a hundred thousand people take a thousand different
motion captures and then map them to one to to each other and in this way you can create millions
of data points of various people that are falling asleep at the wheel in various different ways
and so this is you is wearing different clothing.
We can randomize the motion also as they're falling asleep.
There are a lot of computational capabilities that we're adding on top of the data.
But in practice, it's just a much more scalable method of creating this data.
And when we think about what edge cases do we want to map,
this is in many times,
both guided by the market. So our customers are, you know, they work on these very challenging
problems and they, they know for the most part, a lot of the edge cases that are hard for them to
gather, but they, they know that they need. So they come to us and ask a lot of times for very
specific edge cases. And we, you know, we're happy to create this data at scale for them.
And then the second part of this is, is us trying to understand, you know, we're happy to create this data at scale for them.
And then the second part of this is us trying to understand, you know, where the industry is going and then what might be a challenge.
So for instance, having babies in the car, right?
This is a this is a big problem, you know, forgetting children in the car
or forgetting animals in the car.
There are a bunch of different problems.
And so we know we also
have a map and a road map that's very much aligned with these main problems as well and we we run and
progress with that independently as well okay it sounds like there's even i would call it a bit of
directorial aspect in in this in this process so you have to you have to set it up in a way that's realistic.
Definitely. Yeah. I think the magic is that we've created an interface internally, right? So that
we can ask for these high level requirements. So we can ask for, let's say someone falling
asleep at the wheel or two people falling asleep, one in the passenger seat and one in the driver's seat.
We can ask for this at a very high level and then create in a self serve platform that's right in front of the users in a way that they can control all these different parameters, we can create this data at scale. And so this is, I think that this kind of internal capability has allowed us
to really scale up efforts and answer the needs of a lot of customers. Okay, so one of the things
that you mentioned in the beginning, actually, and I share this observation was that you were
surprised by the rate of adoption that synthetic data shows, at least in your domain.
So my question on that would be, how much do you think that you could possibly extrapolate from this for other industries?
And we already mentioned the differences that exist between visual synthetic data and other types of synthetic
data.
So it may be harder, let's say, to or perhaps let's put it that way, that generating synthetic
data of other types and for other scenarios may not be at the same stage that you are
currently with computer vision synthetic data. But do you think that this signifies a trend towards adoption in other domains
and other types of data as well?
And do you expect that to happen soon?
Yeah, I think that anything that has to do with privacy,
so using people's information in order to train AI,
is going to need to move and shift to synthetic data. Even companies that have models that are already trained. So you can take,
let's say, Apple Siri, even if it's already trained and it's trained on real data, this means that
inside of the company, they have somewhere that data set that exists of real people either talking, doing things and all that.
And so, you know, we see that a lot of companies, the big ones especially, are trying to take the existing data pools that they have and shift it to synthetic just to get rid of all of the personally identifiable information. And so I see this not only in computer vision, but in all fields that you have privacy information that needs to be privacy
compliant today with the tools, both in structured data and unstructured data. So a lot of what we do
is around humans, for instance, unstructured data. There's no reason really to use uh or to have a significant amount of real
um visual data and on the other on the other hand it makes a lot of sense that the shift will happen
um you know with within these within these large companies with with all of the domains pretty much
um yeah it's just too much risk too much risk that's on these big companies to keep
the data safe. And, you know, it's better than any cybersecurity, right, is the ability to just
not have problematic data in your databases. I'm kind of, what I'm seeing, let's say, based on your answer is that maybe this will see adoption in specific, let's say, requirements in specific scenarios.
So I couldn't imagine there's other use cases like, I don't know, e or how feasible it would actually be to try and generate synthetic data for consumer behavior,
let's say in, I don't know, around Black Friday or any other of those events.
Maybe it can be done, but I'm not sure how reliable that synthetic data would be.
Yeah, I think that what I would say is that's actually a very interesting use case.
I think that, you know, an entire company could be created around that specific use case.
But it is possible, I think.
It's a connection between both tabular data and also unstructured, kind of more behavioral data, where they're moving the mouse, what they're doing on the screen and all of that but i'm just thinking about you know
if there is an enormous amount of information and there is of let's say shoppers on black friday at amazon.com then i'm sure that it is possible in a way to simulate these interactions simulate
what is actually happening on the site and it it can also be both very, like intuitive to understand for
the product folks that are optimizing the site. And also,
of course, it can be used to train models to then predict
things. So actually, I think that it should be possible. But
again, it's a totally separate company. And there are new
challenges that arise from that kind of data.
The challenge I see to be more specific is that you may get into a kind of feedback loop situation
where you're training models to predict future behavior based on past behavior that's generated.
So you get back to me, in my my mind at least you get back into the hack
kind of uh territory let's say yeah yeah so it's again it's exactly it's different than the
simulated approach like we do the simulated synthetic data approach based on the simulator
this is more like the gan generated approach that by the way is used also a lot by the structured
data the structured synthetic data
group and so yeah for me it's much closer to the structured synthetic data than the unstructured
but uh i think that you're not going to be creating new information what you can do is make
sure that there's a privacy compliant version of of the black fr, for instance. And so that I think is possible.
And the goal there would just be for the data to represent the real world data in the best
way possible and without showing any of the privacy or without ruining the privacy of
the customers that were on the site.
And then they can actually delete the real data
at a certain point.
And so they would have kind of a replacement
for the real data without having to track their customers
in a maybe borderline ethical way.
Okay.
All right, so yeah, we can wrap up here, I guess,
if you have any comments to offer on where do you think this is going next,
both for the industry and the practice,
let's say, at large.
So with such high adoption rates already,
I don't know, is that a problem
or an opportunity for you as a company, let's say?
So what's your future plans?
Yeah, we see it as definitely as an opportunity.
We see that this adoption is going to enable the next layers and the next levels of capabilities of computer vision and also is going to bring a lot of computer vision capabilities to production and so we're going to be seeing it in our day-to-day even more more smart stores smart classrooms smart offices all of these things
that we want and so you know we see this as an opportunity and something that's going to expand
the entire computer vision industry the second part is is really that and again this computer
vision industry is very much in early stages we We're not yet where the software industry is, for instance.
This is very, very much the first step.
The second thing is we have to mention the metaverse and everything that's happening there.
The hardware of the metaverse is going to be pretty much completely based off of synthetic data.
We saw HoloLens, Microsoft's AR glasses that are developed with hand tracking
completely based on synthetic data,
eye tracking completely based on synthetic data,
and now face reconstruction also based on synthetic data.
And so, we see that the hardware enablement
is gonna be synthetic data-based,
as well as later on these capabilities
are gonna be inserted into the metaverse
and gonna be very much part of making the experience the connection between the real world and the
digital world seamless and so we see that there's going to be a lot of innovation going forward
also in that direction and maybe the last part is just you know we we think that in the future, the concept of having PhD students
being the main kind of stakeholders, the main people doing this
computer vision or creating these new computer vision capabilities,
it's not going to be the only thing like we're going to also see ways
for people with less experience with less specialized knowledge being able to create
amazing computer vision applications and then connect them into their android app into their
maybe ar glasses app into their various devices and so one trend also that we see going forward
is is really opening the the market of computer vision to the whole world, to all of the developers in the world, and allowing them to integrate new capabilities at the click of a button,
pretty much.
I hope you enjoyed the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.