The Data Stack Show - 145: What is Synthetic Data? Featuring Omar Maher of Parallel Domain
Episode Date: July 5, 2023...
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack.
They've been helping us put on the show for years and they just launched an awesome new product called Profiles.
It makes it easy to build an identity graph and complete customer profiles right in your warehouse or data lake.
You should go check it out at ruddersack.com today. Welcome back to the Data Sack Show.
Costas, we have a really exciting subject, synthetic data, but in an even more exciting context, which is imagery, video, and self-driving
cars. So Omar from Parallel Domain is going to, I hope, I'm confident he's going to teach us so
much about synthetic data. And I think we're going to just learn a ton about self-driving cars and what it takes to even get training data
and go through that entire process.
I'm really interested in synthetic data in general.
We talked with other, I think we had one other guest on the episode, actually, very few episodes
on synthetic data.
And parallel domain is pretty specific, right? They're pretty opinionated on the type of data that they work with. And it's
like the most extreme type, right? You're talking about imagery, you're talking about, you know,
labeling that is geometric. I mean, it's crazy. So I guess on a personal level, I want to know
what attracted Omar to that sort of difficult
problem.
So, that's what I'm going to ask him about.
How about you?
Yeah.
You're right.
I think it's the second show that we have about synthetic data.
And the first one, if I remember correctly, it was more about kind of tabular data, right?
So in here, it's going to be more about visual data.
I mean, I have so many questions, to be honest.
I want to understand what the difference is
between this data.
What does it mean to have synthetic data for that
compared to have synthetic data for tabular data?
Understand what it means to simulate this, what are the types of the labels that you use,
and why, and all that stuff. And I want to see what's the relationship also between whatever like a company like Parallel Domains is doing and things like
3D graphics and computer games. Why? You know, like all these things that, I mean, since we had
the first computer, we were trying to simulate reality in a way, right? So there's a lot of
overlap between so many different domains. And I'd love to hear from him
like what's how much of an overlap we have there and what that overlap means in terms of like
moving knowledge from one domain to the other, right? And what does this mean for the future
of like all these very interesting technologies. So let's dive in and let's have this conversation with him.
Let's do it.
Omar, welcome to the Data Stack Show.
So great to have you.
Eric, pleasure to be here.
Thank you for hosting me.
All right, you have a rich background
in all sorts of machine learning
and really sort of interesting AI use cases.
So give us your journey.
How did you get into data?
And then what led you to parallel domain?
Awesome.
So I started playing with data since I was in college, when I was interning at a major
software company in Egypt, where I'm from originally.
And I was applying for a web dev
kind of, you know, role for the internship. And then I met one of my best friends now,
who inspired me about, you know, business intelligence, data warehousing, you know,
data engineering, stuff like that. And my journey began since then. That was like, you know,
what, 13 years ago, 14 years years something like that so i started business
intelligence data warehousing i moved to data mining once i graduated i co-founded two technical
startups that use machine learning heavily for personalized recommendations we were building
some sort of like the yelp of egypt you know social reviews and stuff like that and then from
there i started assuming like you know director director slash VP roles in multiple companies for advanced analytics and machine learning, right?
So I worked for some time in Dubai, and then I moved to the United States to build a machine
learning team in a company called Esri, the world leader in geospatial analytics. And it was super
fun. We put together a team of, you know of AI experts, data data throughout my life.
So yeah, about 13, 14 years of playing with data, mostly machine learning related,
working with customers in different places and building products around that.
Awesome. Do you remember, now this is rewinding in history a bit, but do you remember that friend
who sort of got you into business intelligence, using databases to do analytics? Do you remember the first thing he showed you
that made you think like, oh, wow, this is like, I have to dig into this?
He showed me a dashboard. And yes, actually, it's fun, because actually, it's the first time I think
about that. So thank you for reminding me of this beautiful moment.
I literally remember that moment when I was sitting, you know, at his desk and he was
like sharing the news with me that, hey, Omar, we don't really have openings for like web
dev, but we have a super interesting thing on the side that I'm doing that I'd love to
have you work on BI data housing.
And he started me showing, you know, this beautiful dashboard.
I think it was some sort of like Microsoft technology or something, you know, and then
he started talking about, you know, the process behind that dashboard, what they're doing
to clean the data and store it in data houses, etc.
So yeah, that was the moment.
Interesting.
Maybe Power BI or something.
Did it feel like you were interested in web dev?
Did it actually feel a little bit contiguous because you write code on the back end and then you're like displaying it or giving
some sort of visual experience it was it was an interesting moment because i was
like i spent almost a year preparing myself to become an expert web dev you know i took a course
i read a lot of things i built different systems for my
gym for example you know that i was going to etc but the twist of like having a look at the world
of data and analytics and you know data warehouses and that case that point was interesting because
i started seeing the kind of business value that this can drive.
Because you can build web applications that people can use and this is going to generate data, but what are you going to do with this after that?
Unless you do something to make that data accessible, fresh, clean,
useful, it's not going to be of much use. And that's
the art and science of data analytics, which got me
eventually into data mining and stuff. So I think the pivot was nice. Yeah. describe for us a project that you were working on where you needed to train a model and it was
just so painful because you didn't have the raw material that you needed yeah that's easy because
i can reference like literally like 30 40 projects that didn't start every project
no like literally let's if i just take like the last five years of my life here in the States or six years working as a director of AI in Esri, I think at least 50% of the computer vision projects.
So we were using satellite imagery and drone imagery, for example, to help our customers extract intelligence out of that, right? everything from like detecting building footprints to assessing damage after hurricanes to you know
for working with insurance to quantify the impact of like storms or something on houses to working
with agriculture on understanding the crop health and you know assessing you know the crop growth
and stuff like that all of these projects were computer vision related and we needed high quality
labeled data to train the deep learning
models to make these detections. And unfortunately, at least half of those projects didn't start
because we didn't have that good quality labeled data. So customers, for example, would have
very few labeled data. And by labeled data, as you guys obviously know, they have the imagery
and they have the labels, which is like the polygons around the houses or the labels for the
crops whether you're healthy etc so either they would have zero labeled data or they have some
that it's not enough and even when we showed them how to label data themselves like the tools
they either didn't have the time nor the expertise nor the like you know workflow to to support it
yeah so unfortunately most of these like 50 at least of these projects didn't start in agriculture
in national government in insurance you know even in retail um because of that reason and that's why
i was so excited to like you know join a company that's doing synthetic data because I've been living in this pain for at least half of my career life.
Yeah.
Okay.
I have a dumb business question though.
So you're working with satellite imagery at a multi-billion dollar company.
Is there not a revenue line to launch some satellites and start taking those images and
create, almost vertically
integrate the data that you need? I don't know anything about the regulations of launching a
satellite into space, but it's actually interesting to me, can you not just take more pictures from a
satellite? I mean, I'm sure you can't, but that's interesting to me. Well, actually, here's the thing. So the good news is that we are in no shortage of
3D pictures, right? Like we have a ton of satellite imagery already by the existing
providers. It's just the fact that we need to label those images, right?
So it's a labeling issue, not an image issue.
Exactly. I mean, yeah, there is definitely a need to have like, you know,
high quality and recent images and stuff like that. But the good thing is that there are a lot
of companies out there already doing that. And, you know, you can contract them, you can work
with them, you can buy data from them, etc. The thing is, do you have the needed label data on
top of this imagery to do the workflows that you know, that's usually the bottom line, right? And
there are a lot of companies out there like, you know that's usually the bottom line right and there are
a lot of companies out there like you know providing labeling services etc it's just when
you have a customer interested in doing something and they don't have that data right now that's
usually the problem right sure yeah that makes total sense okay so what tell us about synthetic
data like what when did you first experience it, actually?
I'm just going to ask you about all these pivotal moments throughout the last decade and a half.
When did you first experience synthetic data?
And what was the light bulb that went off where you said, wow, I think this is actually the solution?
Totally.
So I used to hear about synthetic data like early on like i think what like four or five
years ago there were different attempts to use gans you know generative adversarial networks
to come up with you know synthetic images and use those as input training data for like let's say
geospatial work right so there are a bunch of companies out there doing these kind of experiments
you know hey we're trying to detect this kind of vehicle, for example, from satellite imagery, but we don't have a lot of data.
So let's go generate some synthetic data using GANs.
The quality wasn't great, to be honest.
The results wasn't great.
So I, yeah, you know, I heard about it, but didn't read the news.
I think the pivotal moment when I was, and I remember that moment when I was like, you know, doing my daily exercise on the treadmill, and I was watching the Tesla AI Day, the first one, where they showed for the first time the simulation and synthetic data engines and is amazing. Because they showcased really impressive,
you know, technology that can simulate with high realism, different locations,
different weather conditions, and different edge cases, long tail scenarios, where you want to
train those self-driving cars on, you know, like, you know, pedestrians jaywalking, for example, or cars or objects partially occluded, etc.
And it was pretty realistic.
And they spoke about the different components that go into this kind of synthetic data engine,
the simulation, the rendering, the graphics, the weather parameters, etc.
So that was the first time that I saw that, and it was pretty impressive.
And I thought, wow, this is much needed in many different domains, not just self-driving
cars.
Yeah, absolutely.
Well, let's step back just a minute.
Do you have, so synthetic data makes sense, right?
You're sort of creating something that isn't naturally occurring, right?
Like we hear, you know,
synthetic drugs or other things like that, where it's like, okay, you're combining
different things to make something that doesn't necessarily naturally occur in the real world.
It makes sense that synthetic data follows along those same lines, but
how would you define synthetic data? Do you have sort of a top level definition of like
what demarcates synthetic data or makes it categorically synthetic data as opposed to,
you know, a training data set or something else?
Absolutely. So I think at its heart, it's any data that is generated artificially by computers
instead of being generated or captured
in real life, right? So that's the bottom line, right? And there are different ways to generate
this data. You can use classical ways like procedural generation, for example, game engines,
et cetera, or you can use AI-based methods like GANs or stable diffusion or self-supervised learning,
any kind of techniques that we hear about today. So there are different ways to generate this data,
but the bottom line is it's data that's artificially generated by computers trying
to be as realistic as similar as possible to the real world data instead of being explicitly
captured in real life, right? So a car having a camera collecting data out there
this is real life data a form collecting you know input data from users that they are inputting this
is real world data now you start using statistics to generate tabular data that looks similar to
to that synthetic you start using like start using graphics and procedural
generation and other
AI-based methods to generate good-looking images
that could be used for training machine learning models,
that's synthetic. Now, one thing
to mention, though, usually when synthetic
data is mentioned, it's usually mentioned in a machine
learning context as data
suitable to train machine learning models.
So, theoretically,
you can generate synthetic data to do different things with it, right?
But usually when it's mentioned these days,
it's like for machine learning purposes
to make machine learning models better,
whether those are computer vision or not.
Yeah, totally.
Now, that's what parallel domain does, right?
So can you give us...
So parallel domain is entirely in the world of synthetic data.
What does parallel domain do?
And what specific type of synthetic data do you specialize in?
What do you create?
How do your customers use it?
Absolutely.
So I want to expand on this, but maybe just a little before that, I want to share something
about the inspiration or the motivation of why this company was built and why this whole journey started.
And I think we touched partially on that, right? I mean, here's the thing, getting back to the
example that I was saying about geospatial imagery not having labels, not having good
labels is one problem with the real-world data. In fact, there are a lot of other problems or
challenges. One, it's way time-consuming. The time you spend on collecting real-world data. In fact, there are a lot of other problems or challenges. One, it's way time
consuming. Like the time you spend on collecting real-world data. Let's take an example. Let's
take you're trying to get some data for a self-driving car or a smart car to figure out
its way and stuff like that, right? You go out and you start collecting a ton of data.
And then you need to process this data. And then you need an army of people to label this data
because you need different kinds of labels, right're like you need bounding boxes you need semantic
segmentation you need you know a depth you know it needs a lot of things and while you're doing
the labels it's usually like you know it needs a lot of quality assurance right so you have people
labeling you have people revising those labels and then you have you know the need to understand
the label specs how exactly are you
going to label this etc so labeling is a problem it's error prone like a lot of faults and mistakes
i've experienced this we i worked with labeling teams for example the definition of how we need
to label something could vary widely from a person to the other not to mention from a company or a
team to the other right like like when you say for example let's label a you know bicycle bicycles do you have a separate label for the bicycle itself and the
rider or they are all in the same bounding box so that's a different that's a label spec that you
need to align on right and then like high impact exactly so you spend a lot of time collecting
data you spend a lot of time labeling data, you spend a lot of time labeling data, you usually have a lot of errors in these labels. And sometimes some labels aren't that, you know,
feasible to collect in the first place. Like how, like, let's say, for example, estimating depth
for cars, right? So for any self-driving car, they need to be able to accurately estimate the depth
from images, right? Some of them would use LiDAR to estimate the depth from images.
Some of them would use LiDAR to estimate the depth.
But in addition to LiDAR, you also need to use images.
So how are you going to label depth?
It's very hard for a person to actually label depth.
But it's easy to do in synthetic data.
Because you have total control over that.
How are you going to label for optical flow?
Which is understanding the motion of objects.
It's very hard.
You got to keep track of the different frames of the images.
And say if this moved or not.
So some labels are almost virtually impossible to label.
And finally, the iteration speed. Let's say you come up with a new idea you want to have another
label another object so it's slow getting labels requires a lot of time it's error prone
some labels are almost impossible to get in real life and it's not so efficient for iteration
this is not saying that real life data we should not be using real life data i think it's very important but complementing it with something better that can solve all of these problems
makes sense and that's why parallel domain started right so the company started about five years ago
you know founder the founder kevin mcnamara actually had background in like you know graphics
and gaming you know built a lot of cool projects
in major companies.
He led simulation-related programs in Apple, for example, etc.
And the idea is, hey, let's build something that can simulate real worlds and use this
in different contexts.
And it happened that one of the early customers wanted to use these virtual worlds for machine learning purposes, right? To train better perception models for,
you know, self-driving cars. And that actually was the moment where the company actually focused
big time on creating these virtual worlds, synthetic data to empower perception machine
learning teams. So what Parallel Domain does is we specialize in building and generating synthetic data
to help AI perception teams develop better machine learning models.
And by develop, I mean both training and testing, right?
So we generate highly realistic and diverse and labeled synthetic data for both for camera, LiDAR and radar. And we've been, we're mostly
focusing on outdoor scenarios, empowering, you know, perception teams working with self-driving
cars, you know, desk systems, you know, for smart features in cars, helping them, you know,
better parking and stuff like that. Delivery drones, outdoor robotics, you name it. We're
working with the top names in the industry. We're working with like, you know, Google, Toyota Research Institute, Continental,
Woven Planet, etc. And what we do is we work closely with those machine learning teams
on helping them improve the performance of their machine learning models by providing synthetic
data, especially for the cases where their models aren't performing greatly, right?
So in many cases, this would be like, you know, the edge cases or the long tail of scenarios.
Like think, for example, helping those cars better detect partially occluded objects.
Let's say you have a kid, you know, partially occluded by a bus.
You definitely don't want the car to miss that detection, right?
And here's the question.
How many of that kid, in that case, can you have in real life as training data to better train your model, right?
Like you barely can have like, you know, what, tens of these scenarios?
You need millions or hundreds of thousands at least, right?
And that's what the synthetic data excel at.
So we provide highly realistic and labeled data for these edge cases and long tail scenarios.
You know, jaywalking, vulnerable road users standing on the side, partially included, parking scenarios, debris detection.
We simulate a lot of debris scenarios, for example, etc.
And we help customers get that data, train their models, and the result, most of our customers will get better performance detecting these edge cases or improving on these edge cases.
And that happens actually across the board. And yeah, so I've been five years in the business
working with these customers. And throughout that journey, we have built a team of about
70-something people in both San Francisco, mostly Canada, Vancouver.
But we are a global company.
We have people in Germany and different other parts of the world.
So yeah, we're pretty excited about that.
Amazing.
Wow, I have so many questions, but I know Costas probably has many of them.
I said two, but just a couple more questions
for me.
The first thing that sticks out that's really interesting is, you know, there's that old
phrase, you know, time doesn't slow down for any man.
You know, like everyone gets the same number of hours on the day.
But it's really interesting to think about, you know, a partially exposed kid, you know,
in front of a bus, right?
Time doesn't speed up to train an AI model either.
And it doesn't happen that often, but it changes a family, a community, a city when it does happen. You know, and that's really interesting to think about. Let me put on my hat as a parent, you know, and I know you're a parent too.
Yep.
And if you and I didn't work in the data industry, AI already seems a little bit mysterious.
And then if you told me, well, we're training this model that makes the car brake, but we're
using a lot of fake data, do this. Fake may be,
fake is unfair, right? It's synthetic data. But if you told me that and I didn't work in the data
industry, that might be a little worrisome to me, right? Where it's like, I don't quite understand
AI. And then I find out that you want to make this vehicle break in front of my child, but
you're using a bunch of synthetic data or generated data to do that.
What would you say to that person who doesn't work in the data space to demystify this process
of using synthetic data to make things safer? I think my immediate answer would be,
hey, does it really lead to a safer self-driving
in that case or not?
If the answer is yes,
then I would personally care less
about what might have led to that, right?
Like whether it's synthetic or real.
And that's the real benchmark here.
The kind of tests that these companies are doing is just like tremendous.
They do a ton of tests on real world data to measure the accuracy and safety of these
cars, right?
So for example, like most of them would run like daily regression tests on multiple scenarios
to see if the accuracy of the car detecting different objects and pedestrians and kids
and vehicles is increasing or decreasing.
And that's the real benchmark that I think we should focus on.
If using synthetic data or fake data or whatever it may be is leading to performance improvement on detection in general,
and specifically edge cases that we know for a fact are so limited to collect in real life then i would be happy here because the bottom line the end result is this car is becoming safer or this robot or this drone
is becoming smarter etc and i think this is one of the moments in like where where we should
like you know the attention is mostly on the outcome versus the mechanism so for example
there is a parallel conversation adjacent conversation happening about like deep learning
being a black box, for example, right?
Like we don't understand a ton
of what's happening
inside deep neural networks,
but in some applications
like computer vision,
the end result that this network
can detect with high accuracy,
like, you know, faces or objects
or detect possible, for example,
diseases in like MRI scans
or stuff like that,
that is what matters, right?
So I think in summary, I think if it's really pushing the edge and improving the performance
and making driving safer, I think there would be less concern over whether this data is real or
fake. And the good thing is that it's easy to benchmark that. It's easy to test that when we
test it on real life data and see the performance. Yeah, for sure. That easy to benchmark that. It's easy to test that when we test it on real-life data and see the performance.
Yeah, for sure.
That's how you manage.
It sounds like that's how you manage divergence, right?
Like you control by actually doing, you know, you may speed up the process of learning,
but that gives you actually a faster testing cycle to understand how it happens in the real world in
testing. And so really, it's almost an accelerant. You're not removing the real world element.
You're actually just shifting it to the testing component to understand how it's going to play
out in the real world. Would you say that's fair? Exactly. And we still use it for training too. Like, you know, most of the successful experiments slash work that we do with customers,
it's usually a combination between synthetic and real data, right? So usually it's much easier to
get larger amounts of good data in the synthetic world. So you would train your models, let's say,
on synthetic data for a certain use case. And then you would fine tune those models with real data that you have collected already, right? Let's say, hypothetically,
you're using a million image synthetic, and you're fine tuning with like, you know,
30,000, 40,000 images in real world, that usually yields the best performance so far,
sometimes synthetic only yields great performance. But so far, it's combination between both. But as
you mentioned, at the end of the day, when we test, it's a combination between both. But as you mentioned,
at the end of the day, when we test, it's mostly on real data because that's where the cars or the drones or the robots are going to be interacting at the end of the day. Sometimes we use synthetic
data for testing too. I can share with you some context on that later, but it's mostly testing
on real data for now. Fascinating. Okay, Kostas, I am enamoredored here but i have to stop myself and i'm actually
disappointed that you didn't stop me before because i've been going for so long so please
jump it's okay you can continue you want like it's fine it's fine like we are doing first of
all like you are having an amazing conversation so it was like great pleasure for me, like to listen to all that stuff.
But I have a question based on this conversation actually.
And Omar, I'd like to ask you, you mentioned like labels many times, right?
So we have like the data and then the data needs to be labeled so we can go and like do the training and all that stuff.
Would you say that like synthetic data in the domain you are primarily
working with at least, right?
Like with let's say more of like I'll say that like visual information,
like radar data and like all that stuff.
Would you say that like synthetic data is the process of starting from the
labels and generating like the images, for example?
Is this an accurate way
of defining and describing
let's say
synthetic data?
Defining the
labels and working with the labels is definitely
a critical piece in the pipeline.
It's not necessarily the starting
point though, right? The conversation
can definitely start from labels, right?
Like in many cases, we have conversations with customers,
hey, I have a problem with that model
that is trying to detect, let's say, the depth of objects.
So we immediately understand
that we're going to need to provide labels for depth,
which is easy to do with synthetic data, right?
But in reality, when you start generating synthetic data,
there is a whole pipeline slash process that you go through ranging from like you know accurately
you know mimicking that location so whether it's urban suburban or highway for example so
imperial domain for example we support a lot of maps a lot of locations and then there is a piece
for procedural generation where you start using procedural generation as to generate like you know these buildings and you know structures and all that and generate you know the you know
the agents themselves the pedestrians the vehicles so there are a lot of technology in place already
that has been used throughout the previous years to generate this there is another piece for
rendering where you start like visually rendering visually rendering those pieces in the puzzle.
So you render the buildings, the grass, the objects, the pedestrians, the vehicles.
And that's the graphical piece, where you want to make sure that these look as realistic as possible.
And on top of those, there is the label piece.
How do you want to label this data?
Do you want to go with only one type of label, like 2D object, 2D bounding boxes, for example? Or do you want to do more? For example,
depth, optical flow, 3D bounding boxes, motion segmentation, or things that convey to motion,
like optical flow. Sometimes it's like instant segmentation, where you want to only label
the boundaries of the object. So I think labels are pretty critical.
They usually come on top or as part of the whole pipeline,
which includes location, procedure generation, simulation.
Simulation actually is very important.
So we have whole teams working on how the reality should be simulated,
like how those agents should be interacting in real life.
Because you can have some pretty images
showing pedestrians, et cetera,
but if you want to do sequential
or you want to have like realistic images,
you want to make sure that the pedestrians,
for example, in this scenario,
are behaving as pedestrians in real life.
The cars are behaving.
You don't want to find cars like, you know,
driving on buildings, for example, right?
So having accurate simulations,
both for the agents in the world
and for the sensors,
because we spent so much time,
we have spent much time
to accurately mimic a lot of the sensors
that our customers are using.
So how do you simulate, you know,
different types of cameras,
different types of LiDAR sensors,
different types of radar sensors.
And on the top of that,
there is the labels.
So as you can see,
different pieces working together to generate like, you know generate highly realistic data, because that's important if you want to
use it in conjunction with real-world data. Yeah, that makes total sense. So just like that,
there's something a little bit better. I mean, I assume like, okay, you have like a modern car,
even if it's not like, let's say, a car with an autopilot or anything like that,
it has a couple of different sensors, right?
Things that can measure distances.
Some of them have cameras.
Some of them are coming with lidars.
So when we are talking about generating synthetic data, because the first thing that someone
can visualize about that is that we recreate a video, right?
Or an image that has a scene inside of, I don't know, let's say, a car violating a red or something like that, right?
But it's interesting what you say because it's not just, let's say, that you are trying to simulate the environment.
You are also trying to simulate the environment you are
also trying to simulate the perception that the machine has right so tell us a little bit more
about that but where are the boundaries and where the generation actually starts is it let's say
do you do like something like you create the scene without considering the sensor and then you also put the sensor and the output
of the sensor. How does this work when we're talking about this type of data?
That's a great question. Actually, I want to answer this question, but I wanted to use something first
that would lead to the answer to that question, which is like the domain gap. Okay. So the domain gap, it's a pretty common thing
that almost every perception engineer
who has been working with synthetic data knows about.
This is the kind of the gap between the real data
that the models are going to be operating against
at the end of the day
and the synthetic data that we could be using.
This gap actually could happen for multiple reasons.
It could be like on the visual aspect, like the data would look different.
And it could look different for multiple reasons.
Let's say the graphics, for example, engine of the synthetic data isn't of high quality.
So it would look like graphics versus real-world data.
The lights might be different.
The weather conditions might be different.
The agent distribution.
In real life, for example,
in the area that you would like to launch these models,
you might have high density of pedestrians and vehicles.
So if you are training with synthetic data
with low densities,
it's not going to be of much help, right?
Because you would like to mimic the real world.
So I guess what I'm trying
to say, the gap between synthetic and real is a big topic when it comes to synthetic data. And
closing that gap or bridging that gap or shrinking it is usually like a major thing. You would try to
as much as possible to make it minimal because that does impact the model performance. So in summary, we simulate the
world itself. And we then simulate the sensor, like where is it placed? And what kind of sensor
is it and the angle of view and all of that things. And we simulate where is it placed on
the vehicle. And the output of that is that you have a realistically simulated world, you have a realistically simulated vehicle and placement of that sensor. And we simulate the intrinsics of
the sensor itself too, like, you know, the specifics of what goes inside that sensor.
So you are simulating all these elements together. So the output is highly realistic images or scenes captured from that very sensor in that very location
within that world, right? So let's go into some detail. So first piece, like you simulate the
actual world. So this goes back to the pipeline thing that I was discussing, right? So first of
all, we start with the location. Is it urban, suburban, or highway? So we have different maps
for different locations. Like for example, a lot of our customers, like, you know, they test these vehicles in the Bay Area, for example. So we have maps for
San Francisco for all the streets and stuff like that. So we can mix and match some urban and
highway scenes, for example, there. And then you start adding the different pieces of the pipeline
on top, you know, the simulation to nicely simulate the agents, the rendering, so visually render those agents and structures, etc, etc. So that's the kind of the world build,
if you think about it. So you build actual simulated world. And within those worlds,
you have the vehicles. One of those vehicles is the ego vehicle, which is the vehicle that
the model is going to be deployed in. And think of it as the kind of the ego view of that vehicle driving right and then
we have different models of vehicles depending on the dimensions of the vehicle etc and then one
thing our customers can configure is where the would they like to place the sensor like what is
the actual sensor placement is at the front at the back at the rear you know what where exactly
what's the location so that's usually something that we do. And once you do that, we also
simulate the intrinsics of the sensor, like what model exactly,
how does it work from inside, etc. And we start simulating the kind of
scenes that would be captured from that sensor. So in that case, it would be
as realistic as possible to the actual data captured by the real sensors in the real world.
Because we simulated almost everything in that pipeline, the world, the sensor, the placement,
the location, the vehicle, and that would result into like data that looks similar to how the real
world data would be captured from that sensor. So for example, sometimes our customers have like
fisheye sensor, which makes the image like work in some way. It has some advantages for machine
learning applications. You can capture
wider angles and detect more things.
Some others have
normal camera sensors
that are not that wide.
We take that into consideration to
generate realistic images. It's a long answer.
Sorry for that.
No worries. It's a very fascinating
topic, to be honest.
I don't know. I think it's a very fascinating topic to be honest because it's I don't know I think like it's
it's one of these things
that it's easy like to visualize
because it is visual right like we walk
outside and we take it for granted
that we recognize
everything like we can
understand depth you know like we can
do all these things that
okay we generate
all this synthetic data to go and train the
models to be able to approximate what a human does.
But there's so much information in these scenes, right?
And it is kind of amazing to think how complicated it is and how much information like has to be
managed there and like how you do that in order like to approximate reality at the end right
and like something related to that like you mentioned that you are in a way let's take like
San Francisco right you have I'm sure like probably it's one of the most like I don't know
well documented in terms of like data cities cities in the world, like for these kind of scenarios.
So this model that you like, this simulation that you are creating, right?
How realistic it is, like, if I as a human, like would be watching a video of this, right?
Like on my laptop, like how realistic this would look to me?
We can play the game actually. I can show you some images, real and synthetic from parallel domain,
and you would judge. My guess is that at least 50% of those you will not be able to define.
Okay. Because this is not like trying to brag or anything, but we have a lot of like dedicated teams and people
just trying to,
you know,
perfectly simulate
like, you know,
lighting and optics,
for example.
You know,
how would this visually
look like with
different kinds of lights,
whether it's natural light
or artificial light,
you know.
People,
artists, for example,
working on
how the looks
of different agents
and animals and cars would look like, et cetera. And then there is the looks of different agents and animals and cars would look like etc
and then there is the behavior of these agents too so i think in summary we i think most of the
images that we generate in general would look as highly realistic as possible sometimes when you
look super close especially like with pedestrians and something you guess that this
might be like similar to games and synthetic not real but i would say in many cases you would not
be able to differentiate between both and that is generally required to be as highly realistic as
possible and what goes inside that is like you know behavior simulation and visual accurate
visual rendering and you know trying to bring in like the impact of things
like light and weather and nicely simulating you know rain and stuff like that okay that's super
interesting so do you see first of all there are like other domains out there that they are trying
like to do for different reasons like similar similar things, right? Like you have computer games, for example, it's one thing, right?
Then you have CGI in movies, right?
More on the entertainment side of things.
But do you, or like VR, right?
Okay.
Let's say even if you could generate such detailed representation of reality,
probably the problem with VR
is the fidelity of the hardware right now.
But outside of this,
what's the overlap
between the things that you are doing
and these domains?
How many, let's say,
best practices or techniques or whatever are coming from there?
And how what you are doing can give back to them, right?
Yep, that's a great question.
Actually, I want to start by saying that I think about 20% or something, if not more, of the engineers working in parallel domain or the teams in general are coming from a gaming background.
So they've been working in building games and simulation engines, etc.
And I think there is a huge benefit that we get from these established industries because
they have already been building technologies, whether it's for simulation or rendering or
visualization or graphics, CGI, procedural generation, etc.,
that we can leverage to build these simulated worlds.
Because if you think about it, we're trying to do something similar to what they're doing,
maybe for a different purpose, which is enhancing machine learning models, in our case,
versus entertainment.
So I think we are standing on the shoulder of a giant,
because we're pretty much using all the advancements in that capacity.
On the other side, we do two things.
One, we bring in the advancements in content generation using generative AI,
which is, as you know, like exploding these days.
So we are using, you know, different generative AI techniques to scale content generation.
So instead of like manually building these assets and
crafting these assets, et cetera, there is a way that you can, for example, ask generative AI to
come up with a hundred different variations for raw debris that you don't need to create manually
with the right prompts. So techniques like stable diffusion, for example, is empowering a lot of
that, right? And diffusion techniques. And we are empowering a lot of that. And diffusion techniques.
And we are using a lot of that.
We're using this to create different pieces of content inside our virtually simulated world
at a scale that could not be possible with traditional means.
So bringing both worlds, the advancement in the gaming industry and simulation and CGI, etc.,
along with the advancements in AI, making both work together, in my opinion,
would lead to the best results.
And that's what we're seeing inside.
We are actually generating different content using generative AI techniques like stable
diffusion.
And that generated content has resulted into improving a lot of the model's performance
that we're seeing, right?
So detecting some objects, actually, the detection accuracy increased
after it was trained with content generated from both the existing engines that we have
and, you know, generative AI techniques.
Regarding your question of, like, could this go back to gaming industry?
I think absolutely, yes.
Maybe we are not actively doing this as a company today
because our major focus is working with machine learning teams.
But I cannot see a reason why the advancements that we're building
would not feed into the gaming industry.
I think it can go both ways.
Okay, that's awesome.
All right, one last question from me,
and then I'll give the microphone back to Eric.
So, okay, we talked primarily about synthetic data
in what you called the...
It has to do with perception,
so it's more like visual, let's say, in a way.
But data can be many different things, right?
We can have textual data, obviously.
We can have, let's say, tabular data.
We can have structural structure, blah, blah, blah.
So many different types of data, right?
What do you see happening when it comes to all this technology
about synthetic data to other data domains?
Do you see opportunities there?
Do you see a need there?
Tell us a little bit more about that.
How we can generalize like what you are doing
to other types of data too.
Absolutely. I think
synthetic data is pretty much
required across the board, to be honest.
I think any domain slash vertical
that it's hard to get
high quality level data to train machinery
models, it's pretty much going to require. So let me give you
some examples.
In other careers, I worked with health and getting access to health data on the individual level, for example, is super hard, right? If not impossible due to privacy
concerns that are very well understood. In that case, I think synthetic tabular data would make
perfect sense for health-related scenarios where you're trying to build machine learning models that would detect fraud, waste, and abuse, for example. So fraud, waste, and abuse
happens and costs the US alone like more than, I think, I remember the number correctly, more than
100 billion dollars annually, right? And there is a lot of demand to apply analytics and machine
learning to detect fraud, witness, and abuse.
And using that with real-life data is not so feasible because you have a ton of researchers across the country who can do innovative work,
but they might not be able to access this data due to different reasons.
And giving them highly realistic,
synthetic data that mimics the real world would enable them to build these
kind of fraud know fraud detection
algorithms same thing happens with financial data and transactions right if you want to build fraud
detection systems if you want to enable researchers to do that you can give them data to like private
personal data for financial transactions so same thing i can think of many other cases but also
with like unstructured data it's not just computer vision. Let's say with speech, right?
So I think if you want to build models, that would be great at transcribing audio to text.
You need a lot of labeled data.
And how can you get that?
Not a lot of companies have that.
So having highly realistic and labeled audio data would help with that.
I can also think of like other domains like, you know, text.
I mean, we are seeing the explosion that's happening with GPT, et cetera.
And it's amazing because it's using self-supervised techniques, right?
So it's trying to predict the next world.
And this is already included somehow as labeled.
So it does not need explicit labels necessarily to do these tasks.
But I think for like tabular data, things like financial transactions, health, insurance,
same thing with audio data.
I think these are all examples of domains that would benefit from synthetic data.
That's awesome.
Thank you so much.
Eric, it's all yours again.
All right.
Well, we are at the buzzer here.
But Omar, question for you.
And I hope you're okay with this, but I want to just dip our toe into the water of the
ethical question as it relates to AI.
And I think specifically, you deal with a very interesting component of this in that you have a lot of experience with sort of labeling.
And traditionally, labeling has been handled by a large human workforce.
But companies like Parallel Domain can create synthetic data and automatically label that, right?
And so there are lots of
different opinions about this. And we're not a show that has expertise on economics or anthropology
or politics, but maybe you were employing people who labeled a bunch of data and now
parallel domain is doing that. How do you see that playing out? Because there are a bunch of data and now Parallel Domain is doing that. How do you see that playing out?
Because there are a lot of people who think this will create new career opportunities. Some people
are worried. Can you help our listeners and us think through the impact of that? Because you're
really kind of on the bleeding edge of this and probably have already seen some of it, you know,
within the realm of labeling even. Absolutely. in short we're still definitely gonna require
labelers to help with improving machine learning models and making them safer
it's just the nature of the work could differ a bit right so instead of spending a ton of time
for example drawing bonding boxes around objects or you know doing you know sophisticated semantic you know instance
masks on objects which takes a lot of time these efforts are going to be required into let's say
higher level forms of quality assurance for example right so for example when you start
training those models and running them in the real world you come up with you know different performance degrees right so having someone to understand where these detections are missed
and you know where do we need to improve those models this is still going to require some level
of supervision and quality right so that's one thing another thing is like you know providing
some sort of quality assurance and monitoring for the synthetic data itself.
So even with like highly stable synthetic data engines, when you generate the data, sometimes there are like problems.
Sometimes there are things that you don't like to be there or that you would like to modify.
So this is still going to require some level of supervision.
So you see where I'm going?
I think we still have a lot of tasks,
a lot of need to employ this expertise.
Maybe in a place that will lead to better results
or faster results or safer, you know, autonomy.
So I still, it's just,
and this is similar to any kind of technological advancement.
You start to have some jobs, you know,
shifting to have like
different shapes and structures sure so that's one thing the other thing i think we're still
gonna require real world labels anyways like i don't see in the near future at least that
synthetic data will totally replace real world data like 100 i think we're still gonna need
some level of real data whether this is like 10 30%. We can debate that or even better, we can find out.
So I think we're still going to require it.
So yeah, that's my answer.
Yeah, I love it.
Well, I mean, the real world changes, right?
I mean, we don't live in a static world.
And so managing the dynamic nature of reality, I agree, certainly requires
a human interface. Well, Omar, this has been absolutely fascinating. The time flew by.
We have many more questions. So come back and join us. And thank you for the time that you've
already given us. Thank you so much for hosting me. I really enjoyed the us. And thank you for the time that you've already given us. Thank you so much for hosting me.
I really enjoyed the discussion.
And thank you for reminding me
of the beautiful moments
of my early data career, actually.
That's a nice starting point.
Thank you for sharing those with us.
That was special.
Absolutely.
Thanks, Eric.
Thanks, Costas.
Very nice to meet you.
Wow, Costas, what a conversation
with Omar from Parallel Domain. I mean,
it's almost impossible not to have fun if you're talking about synthetic data to train
AI models that are driving, learning for self-driving cars and other imagery use cases. I mean, just unbelievable.
So I loved it. I think one of the biggest things that stuck out to me was, you know, when you
scan Hacker News, when you see news headlines, especially related to chat GPT and all these other AI technologies,
they tend to be extreme. AI is taking away jobs or we need to stop AI. We need to slow down.
We have all these people who signed a letter. Omar was so balanced. It almost felt personal to him where
he felt like he could stop a self-driving vehicle from hitting someone in the street
because he can provide better data. It felt very personal. It felt very practical. There was no mystery.
And he had a level of confidence that I think, you know, just really, to me, invalidated just
a lot of the headlines that we see. I mean, he's really doing work on the front lines with,
you know, companies who are building cars that are trying to drive themselves, you know, and
what was that encouraging to me? I think that's my big takeaway is that he's balanced.
He's brilliant, obviously.
But he's very confident that this is a way to make things safer,
even using synthetic data.
We shouldn't be scared.
We actually should be excited.
And that was a big takeaway for me.
Yeah, 100%.
I think like a big difference between what omar is doing and like
what is happening with things like sub gpts that in omar's case like the use case is very explicit
right it's like we know like we understand exactly what's the output of like improving
these models right it's like more safety it's yeah. We are not going to have so many accidents.
We are not going to be worrying
if a child is going to be hit by a car
and all that stuff.
And by the way, it's not just the self-driving cars.
As he said, many of these models are used today
as part of other sensors that cars have like you know like the
regular cars that we all like drive like to make sure that if an accident's going to happen like
the car can help you react faster and stuff like that right so it's not just like the extreme of
having like a fully autonomous car right and on the on the other hand, you have like ChatGPT,
which is like a very impressive
and very accessible technology, right?
Everyone can go and type on this thing.
You don't easily go and get a fully autonomous car
and like see it like moving, right?
So, but people at the same time,
they see something big, something great there,
but they don't necessarily understand what the impact is going to be.
We are still trying to figure out this, right?
And that generates fear.
So I think that's one of the reasons that you see such a diverse,
let's say, reaction between the different technologies.
And I mean, at the same time, I agree with Omar.
At the end, we are going to see something similar happening with all the chat GPT-like kind of technologies. And I mean, at the same side, like I agree with Omar, like at the end, like we are going
to see like something similar happening with all like the chat GPT like kind of technologies.
That's one thing.
The other thing that I'm going to keep like from this conversation that we had is that
there's so much overlap between like the state of the art that's in Omar's domain and things
like virtual reality, CGI,gi games all these things and like i can't wait to
see how these industries are going to be using whatever is happening like in synthetic data today
and generative ai to create even more innovation so at least we are going to have fun. That's the feeling I get.
At the minimum, we're going to have fun.
Yeah.
And I think people like Omar
are the right people to be working on this
because of their disposition and value system.
So thank you for joining us again
on the Data Stack Show.
Many great episodes coming up. If you like this one, I just wait, we have so many more good ones
coming out soon. If you haven't subscribed, subscribe on Spotify, Apple Podcasts,
whatever your favorite network is, and tell a friend if it's valuable to you.
And we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.