The Data Stack Show - 145: What is Synthetic Data? Featuring Omar Maher of Parallel Domain

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack. They've been helping us put on the show for years and they just launched an awesome new product called Profiles. It makes it easy to build an identity graph and complete customer profiles right in your warehouse or data lake. You should go check it out at ruddersack.com today. Welcome back to the Data Sack Show. Costas, we have a really exciting subject, synthetic data, but in an even more exciting context, which is imagery, video, and self-driving

Starting point is 00:00:48 cars. So Omar from Parallel Domain is going to, I hope, I'm confident he's going to teach us so much about synthetic data. And I think we're going to just learn a ton about self-driving cars and what it takes to even get training data and go through that entire process. I'm really interested in synthetic data in general. We talked with other, I think we had one other guest on the episode, actually, very few episodes on synthetic data. And parallel domain is pretty specific, right? They're pretty opinionated on the type of data that they work with. And it's like the most extreme type, right? You're talking about imagery, you're talking about, you know,

Starting point is 00:01:36 labeling that is geometric. I mean, it's crazy. So I guess on a personal level, I want to know what attracted Omar to that sort of difficult problem. So, that's what I'm going to ask him about. How about you? Yeah. You're right. I think it's the second show that we have about synthetic data.

Starting point is 00:02:00 And the first one, if I remember correctly, it was more about kind of tabular data, right? So in here, it's going to be more about visual data. I mean, I have so many questions, to be honest. I want to understand what the difference is between this data. What does it mean to have synthetic data for that compared to have synthetic data for tabular data? Understand what it means to simulate this, what are the types of the labels that you use,

Starting point is 00:02:33 and why, and all that stuff. And I want to see what's the relationship also between whatever like a company like Parallel Domains is doing and things like 3D graphics and computer games. Why? You know, like all these things that, I mean, since we had the first computer, we were trying to simulate reality in a way, right? So there's a lot of overlap between so many different domains. And I'd love to hear from him like what's how much of an overlap we have there and what that overlap means in terms of like moving knowledge from one domain to the other, right? And what does this mean for the future of like all these very interesting technologies. So let's dive in and let's have this conversation with him. Let's do it.

Starting point is 00:03:29 Omar, welcome to the Data Stack Show. So great to have you. Eric, pleasure to be here. Thank you for hosting me. All right, you have a rich background in all sorts of machine learning and really sort of interesting AI use cases. So give us your journey.

Starting point is 00:03:48 How did you get into data? And then what led you to parallel domain? Awesome. So I started playing with data since I was in college, when I was interning at a major software company in Egypt, where I'm from originally. And I was applying for a web dev kind of, you know, role for the internship. And then I met one of my best friends now, who inspired me about, you know, business intelligence, data warehousing, you know,

Starting point is 00:04:16 data engineering, stuff like that. And my journey began since then. That was like, you know, what, 13 years ago, 14 years years something like that so i started business intelligence data warehousing i moved to data mining once i graduated i co-founded two technical startups that use machine learning heavily for personalized recommendations we were building some sort of like the yelp of egypt you know social reviews and stuff like that and then from there i started assuming like you know director director slash VP roles in multiple companies for advanced analytics and machine learning, right? So I worked for some time in Dubai, and then I moved to the United States to build a machine learning team in a company called Esri, the world leader in geospatial analytics. And it was super

Starting point is 00:05:01 fun. We put together a team of, you know of AI experts, data data throughout my life. So yeah, about 13, 14 years of playing with data, mostly machine learning related, working with customers in different places and building products around that. Awesome. Do you remember, now this is rewinding in history a bit, but do you remember that friend who sort of got you into business intelligence, using databases to do analytics? Do you remember the first thing he showed you that made you think like, oh, wow, this is like, I have to dig into this? He showed me a dashboard. And yes, actually, it's fun, because actually, it's the first time I think about that. So thank you for reminding me of this beautiful moment.

Starting point is 00:06:06 I literally remember that moment when I was sitting, you know, at his desk and he was like sharing the news with me that, hey, Omar, we don't really have openings for like web dev, but we have a super interesting thing on the side that I'm doing that I'd love to have you work on BI data housing. And he started me showing, you know, this beautiful dashboard. I think it was some sort of like Microsoft technology or something, you know, and then he started talking about, you know, the process behind that dashboard, what they're doing to clean the data and store it in data houses, etc.

Starting point is 00:06:34 So yeah, that was the moment. Interesting. Maybe Power BI or something. Did it feel like you were interested in web dev? Did it actually feel a little bit contiguous because you write code on the back end and then you're like displaying it or giving some sort of visual experience it was it was an interesting moment because i was like i spent almost a year preparing myself to become an expert web dev you know i took a course i read a lot of things i built different systems for my

Starting point is 00:07:06 gym for example you know that i was going to etc but the twist of like having a look at the world of data and analytics and you know data warehouses and that case that point was interesting because i started seeing the kind of business value that this can drive. Because you can build web applications that people can use and this is going to generate data, but what are you going to do with this after that? Unless you do something to make that data accessible, fresh, clean, useful, it's not going to be of much use. And that's the art and science of data analytics, which got me eventually into data mining and stuff. So I think the pivot was nice. Yeah. describe for us a project that you were working on where you needed to train a model and it was

Starting point is 00:08:07 just so painful because you didn't have the raw material that you needed yeah that's easy because i can reference like literally like 30 40 projects that didn't start every project no like literally let's if i just take like the last five years of my life here in the States or six years working as a director of AI in Esri, I think at least 50% of the computer vision projects. So we were using satellite imagery and drone imagery, for example, to help our customers extract intelligence out of that, right? everything from like detecting building footprints to assessing damage after hurricanes to you know for working with insurance to quantify the impact of like storms or something on houses to working with agriculture on understanding the crop health and you know assessing you know the crop growth and stuff like that all of these projects were computer vision related and we needed high quality labeled data to train the deep learning

Starting point is 00:09:06 models to make these detections. And unfortunately, at least half of those projects didn't start because we didn't have that good quality labeled data. So customers, for example, would have very few labeled data. And by labeled data, as you guys obviously know, they have the imagery and they have the labels, which is like the polygons around the houses or the labels for the crops whether you're healthy etc so either they would have zero labeled data or they have some that it's not enough and even when we showed them how to label data themselves like the tools they either didn't have the time nor the expertise nor the like you know workflow to to support it yeah so unfortunately most of these like 50 at least of these projects didn't start in agriculture

Starting point is 00:09:51 in national government in insurance you know even in retail um because of that reason and that's why i was so excited to like you know join a company that's doing synthetic data because I've been living in this pain for at least half of my career life. Yeah. Okay. I have a dumb business question though. So you're working with satellite imagery at a multi-billion dollar company. Is there not a revenue line to launch some satellites and start taking those images and create, almost vertically

Starting point is 00:10:26 integrate the data that you need? I don't know anything about the regulations of launching a satellite into space, but it's actually interesting to me, can you not just take more pictures from a satellite? I mean, I'm sure you can't, but that's interesting to me. Well, actually, here's the thing. So the good news is that we are in no shortage of 3D pictures, right? Like we have a ton of satellite imagery already by the existing providers. It's just the fact that we need to label those images, right? So it's a labeling issue, not an image issue. Exactly. I mean, yeah, there is definitely a need to have like, you know, high quality and recent images and stuff like that. But the good thing is that there are a lot

Starting point is 00:11:10 of companies out there already doing that. And, you know, you can contract them, you can work with them, you can buy data from them, etc. The thing is, do you have the needed label data on top of this imagery to do the workflows that you know, that's usually the bottom line, right? And there are a lot of companies out there like, you know that's usually the bottom line right and there are a lot of companies out there like you know providing labeling services etc it's just when you have a customer interested in doing something and they don't have that data right now that's usually the problem right sure yeah that makes total sense okay so what tell us about synthetic data like what when did you first experience it, actually?

Starting point is 00:11:46 I'm just going to ask you about all these pivotal moments throughout the last decade and a half. When did you first experience synthetic data? And what was the light bulb that went off where you said, wow, I think this is actually the solution? Totally. So I used to hear about synthetic data like early on like i think what like four or five years ago there were different attempts to use gans you know generative adversarial networks to come up with you know synthetic images and use those as input training data for like let's say geospatial work right so there are a bunch of companies out there doing these kind of experiments

Starting point is 00:12:22 you know hey we're trying to detect this kind of vehicle, for example, from satellite imagery, but we don't have a lot of data. So let's go generate some synthetic data using GANs. The quality wasn't great, to be honest. The results wasn't great. So I, yeah, you know, I heard about it, but didn't read the news. I think the pivotal moment when I was, and I remember that moment when I was like, you know, doing my daily exercise on the treadmill, and I was watching the Tesla AI Day, the first one, where they showed for the first time the simulation and synthetic data engines and is amazing. Because they showcased really impressive, you know, technology that can simulate with high realism, different locations, different weather conditions, and different edge cases, long tail scenarios, where you want to

Starting point is 00:13:17 train those self-driving cars on, you know, like, you know, pedestrians jaywalking, for example, or cars or objects partially occluded, etc. And it was pretty realistic. And they spoke about the different components that go into this kind of synthetic data engine, the simulation, the rendering, the graphics, the weather parameters, etc. So that was the first time that I saw that, and it was pretty impressive. And I thought, wow, this is much needed in many different domains, not just self-driving cars. Yeah, absolutely.

Starting point is 00:13:52 Well, let's step back just a minute. Do you have, so synthetic data makes sense, right? You're sort of creating something that isn't naturally occurring, right? Like we hear, you know, synthetic drugs or other things like that, where it's like, okay, you're combining different things to make something that doesn't necessarily naturally occur in the real world. It makes sense that synthetic data follows along those same lines, but how would you define synthetic data? Do you have sort of a top level definition of like

Starting point is 00:14:26 what demarcates synthetic data or makes it categorically synthetic data as opposed to, you know, a training data set or something else? Absolutely. So I think at its heart, it's any data that is generated artificially by computers instead of being generated or captured in real life, right? So that's the bottom line, right? And there are different ways to generate this data. You can use classical ways like procedural generation, for example, game engines, et cetera, or you can use AI-based methods like GANs or stable diffusion or self-supervised learning, any kind of techniques that we hear about today. So there are different ways to generate this data,

Starting point is 00:15:11 but the bottom line is it's data that's artificially generated by computers trying to be as realistic as similar as possible to the real world data instead of being explicitly captured in real life, right? So a car having a camera collecting data out there this is real life data a form collecting you know input data from users that they are inputting this is real world data now you start using statistics to generate tabular data that looks similar to to that synthetic you start using like start using graphics and procedural generation and other AI-based methods to generate good-looking images

Starting point is 00:15:49 that could be used for training machine learning models, that's synthetic. Now, one thing to mention, though, usually when synthetic data is mentioned, it's usually mentioned in a machine learning context as data suitable to train machine learning models. So, theoretically, you can generate synthetic data to do different things with it, right?

Starting point is 00:16:06 But usually when it's mentioned these days, it's like for machine learning purposes to make machine learning models better, whether those are computer vision or not. Yeah, totally. Now, that's what parallel domain does, right? So can you give us... So parallel domain is entirely in the world of synthetic data.

Starting point is 00:16:29 What does parallel domain do? And what specific type of synthetic data do you specialize in? What do you create? How do your customers use it? Absolutely. So I want to expand on this, but maybe just a little before that, I want to share something about the inspiration or the motivation of why this company was built and why this whole journey started. And I think we touched partially on that, right? I mean, here's the thing, getting back to the

Starting point is 00:16:53 example that I was saying about geospatial imagery not having labels, not having good labels is one problem with the real-world data. In fact, there are a lot of other problems or challenges. One, it's way time-consuming. The time you spend on collecting real-world data. In fact, there are a lot of other problems or challenges. One, it's way time consuming. Like the time you spend on collecting real-world data. Let's take an example. Let's take you're trying to get some data for a self-driving car or a smart car to figure out its way and stuff like that, right? You go out and you start collecting a ton of data. And then you need to process this data. And then you need an army of people to label this data because you need different kinds of labels, right're like you need bounding boxes you need semantic

Starting point is 00:17:27 segmentation you need you know a depth you know it needs a lot of things and while you're doing the labels it's usually like you know it needs a lot of quality assurance right so you have people labeling you have people revising those labels and then you have you know the need to understand the label specs how exactly are you going to label this etc so labeling is a problem it's error prone like a lot of faults and mistakes i've experienced this we i worked with labeling teams for example the definition of how we need to label something could vary widely from a person to the other not to mention from a company or a team to the other right like like when you say for example let's label a you know bicycle bicycles do you have a separate label for the bicycle itself and the

Starting point is 00:18:10 rider or they are all in the same bounding box so that's a different that's a label spec that you need to align on right and then like high impact exactly so you spend a lot of time collecting data you spend a lot of time labeling data, you spend a lot of time labeling data, you usually have a lot of errors in these labels. And sometimes some labels aren't that, you know, feasible to collect in the first place. Like how, like, let's say, for example, estimating depth for cars, right? So for any self-driving car, they need to be able to accurately estimate the depth from images, right? Some of them would use LiDAR to estimate the depth from images. Some of them would use LiDAR to estimate the depth. But in addition to LiDAR, you also need to use images.

Starting point is 00:18:54 So how are you going to label depth? It's very hard for a person to actually label depth. But it's easy to do in synthetic data. Because you have total control over that. How are you going to label for optical flow? Which is understanding the motion of objects. It's very hard. You got to keep track of the different frames of the images.

Starting point is 00:19:16 And say if this moved or not. So some labels are almost virtually impossible to label. And finally, the iteration speed. Let's say you come up with a new idea you want to have another label another object so it's slow getting labels requires a lot of time it's error prone some labels are almost impossible to get in real life and it's not so efficient for iteration this is not saying that real life data we should not be using real life data i think it's very important but complementing it with something better that can solve all of these problems makes sense and that's why parallel domain started right so the company started about five years ago you know founder the founder kevin mcnamara actually had background in like you know graphics

Starting point is 00:20:03 and gaming you know built a lot of cool projects in major companies. He led simulation-related programs in Apple, for example, etc. And the idea is, hey, let's build something that can simulate real worlds and use this in different contexts. And it happened that one of the early customers wanted to use these virtual worlds for machine learning purposes, right? To train better perception models for, you know, self-driving cars. And that actually was the moment where the company actually focused big time on creating these virtual worlds, synthetic data to empower perception machine

Starting point is 00:20:42 learning teams. So what Parallel Domain does is we specialize in building and generating synthetic data to help AI perception teams develop better machine learning models. And by develop, I mean both training and testing, right? So we generate highly realistic and diverse and labeled synthetic data for both for camera, LiDAR and radar. And we've been, we're mostly focusing on outdoor scenarios, empowering, you know, perception teams working with self-driving cars, you know, desk systems, you know, for smart features in cars, helping them, you know, better parking and stuff like that. Delivery drones, outdoor robotics, you name it. We're working with the top names in the industry. We're working with like, you know, Google, Toyota Research Institute, Continental,

Starting point is 00:21:28 Woven Planet, etc. And what we do is we work closely with those machine learning teams on helping them improve the performance of their machine learning models by providing synthetic data, especially for the cases where their models aren't performing greatly, right? So in many cases, this would be like, you know, the edge cases or the long tail of scenarios. Like think, for example, helping those cars better detect partially occluded objects. Let's say you have a kid, you know, partially occluded by a bus. You definitely don't want the car to miss that detection, right? And here's the question.

Starting point is 00:22:05 How many of that kid, in that case, can you have in real life as training data to better train your model, right? Like you barely can have like, you know, what, tens of these scenarios? You need millions or hundreds of thousands at least, right? And that's what the synthetic data excel at. So we provide highly realistic and labeled data for these edge cases and long tail scenarios. You know, jaywalking, vulnerable road users standing on the side, partially included, parking scenarios, debris detection. We simulate a lot of debris scenarios, for example, etc. And we help customers get that data, train their models, and the result, most of our customers will get better performance detecting these edge cases or improving on these edge cases.

Starting point is 00:22:51 And that happens actually across the board. And yeah, so I've been five years in the business working with these customers. And throughout that journey, we have built a team of about 70-something people in both San Francisco, mostly Canada, Vancouver. But we are a global company. We have people in Germany and different other parts of the world. So yeah, we're pretty excited about that. Amazing. Wow, I have so many questions, but I know Costas probably has many of them.

Starting point is 00:23:22 I said two, but just a couple more questions for me. The first thing that sticks out that's really interesting is, you know, there's that old phrase, you know, time doesn't slow down for any man. You know, like everyone gets the same number of hours on the day. But it's really interesting to think about, you know, a partially exposed kid, you know, in front of a bus, right? Time doesn't speed up to train an AI model either.

Starting point is 00:23:57 And it doesn't happen that often, but it changes a family, a community, a city when it does happen. You know, and that's really interesting to think about. Let me put on my hat as a parent, you know, and I know you're a parent too. Yep. And if you and I didn't work in the data industry, AI already seems a little bit mysterious. And then if you told me, well, we're training this model that makes the car brake, but we're using a lot of fake data, do this. Fake may be, fake is unfair, right? It's synthetic data. But if you told me that and I didn't work in the data industry, that might be a little worrisome to me, right? Where it's like, I don't quite understand AI. And then I find out that you want to make this vehicle break in front of my child, but

Starting point is 00:24:46 you're using a bunch of synthetic data or generated data to do that. What would you say to that person who doesn't work in the data space to demystify this process of using synthetic data to make things safer? I think my immediate answer would be, hey, does it really lead to a safer self-driving in that case or not? If the answer is yes, then I would personally care less about what might have led to that, right?

Starting point is 00:25:18 Like whether it's synthetic or real. And that's the real benchmark here. The kind of tests that these companies are doing is just like tremendous. They do a ton of tests on real world data to measure the accuracy and safety of these cars, right? So for example, like most of them would run like daily regression tests on multiple scenarios to see if the accuracy of the car detecting different objects and pedestrians and kids and vehicles is increasing or decreasing.

Starting point is 00:25:48 And that's the real benchmark that I think we should focus on. If using synthetic data or fake data or whatever it may be is leading to performance improvement on detection in general, and specifically edge cases that we know for a fact are so limited to collect in real life then i would be happy here because the bottom line the end result is this car is becoming safer or this robot or this drone is becoming smarter etc and i think this is one of the moments in like where where we should like you know the attention is mostly on the outcome versus the mechanism so for example there is a parallel conversation adjacent conversation happening about like deep learning being a black box, for example, right? Like we don't understand a ton

Starting point is 00:26:28 of what's happening inside deep neural networks, but in some applications like computer vision, the end result that this network can detect with high accuracy, like, you know, faces or objects or detect possible, for example,

Starting point is 00:26:41 diseases in like MRI scans or stuff like that, that is what matters, right? So I think in summary, I think if it's really pushing the edge and improving the performance and making driving safer, I think there would be less concern over whether this data is real or fake. And the good thing is that it's easy to benchmark that. It's easy to test that when we test it on real life data and see the performance. Yeah, for sure. That easy to benchmark that. It's easy to test that when we test it on real-life data and see the performance. Yeah, for sure.

Starting point is 00:27:07 That's how you manage. It sounds like that's how you manage divergence, right? Like you control by actually doing, you know, you may speed up the process of learning, but that gives you actually a faster testing cycle to understand how it happens in the real world in testing. And so really, it's almost an accelerant. You're not removing the real world element. You're actually just shifting it to the testing component to understand how it's going to play out in the real world. Would you say that's fair? Exactly. And we still use it for training too. Like, you know, most of the successful experiments slash work that we do with customers, it's usually a combination between synthetic and real data, right? So usually it's much easier to

Starting point is 00:27:54 get larger amounts of good data in the synthetic world. So you would train your models, let's say, on synthetic data for a certain use case. And then you would fine tune those models with real data that you have collected already, right? Let's say, hypothetically, you're using a million image synthetic, and you're fine tuning with like, you know, 30,000, 40,000 images in real world, that usually yields the best performance so far, sometimes synthetic only yields great performance. But so far, it's combination between both. But as you mentioned, at the end of the day, when we test, it's a combination between both. But as you mentioned, at the end of the day, when we test, it's mostly on real data because that's where the cars or the drones or the robots are going to be interacting at the end of the day. Sometimes we use synthetic data for testing too. I can share with you some context on that later, but it's mostly testing

Starting point is 00:28:38 on real data for now. Fascinating. Okay, Kostas, I am enamoredored here but i have to stop myself and i'm actually disappointed that you didn't stop me before because i've been going for so long so please jump it's okay you can continue you want like it's fine it's fine like we are doing first of all like you are having an amazing conversation so it was like great pleasure for me, like to listen to all that stuff. But I have a question based on this conversation actually. And Omar, I'd like to ask you, you mentioned like labels many times, right? So we have like the data and then the data needs to be labeled so we can go and like do the training and all that stuff. Would you say that like synthetic data in the domain you are primarily

Starting point is 00:29:28 working with at least, right? Like with let's say more of like I'll say that like visual information, like radar data and like all that stuff. Would you say that like synthetic data is the process of starting from the labels and generating like the images, for example? Is this an accurate way of defining and describing let's say

Starting point is 00:29:51 synthetic data? Defining the labels and working with the labels is definitely a critical piece in the pipeline. It's not necessarily the starting point though, right? The conversation can definitely start from labels, right? Like in many cases, we have conversations with customers,

Starting point is 00:30:08 hey, I have a problem with that model that is trying to detect, let's say, the depth of objects. So we immediately understand that we're going to need to provide labels for depth, which is easy to do with synthetic data, right? But in reality, when you start generating synthetic data, there is a whole pipeline slash process that you go through ranging from like you know accurately you know mimicking that location so whether it's urban suburban or highway for example so

Starting point is 00:30:35 imperial domain for example we support a lot of maps a lot of locations and then there is a piece for procedural generation where you start using procedural generation as to generate like you know these buildings and you know structures and all that and generate you know the you know the agents themselves the pedestrians the vehicles so there are a lot of technology in place already that has been used throughout the previous years to generate this there is another piece for rendering where you start like visually rendering visually rendering those pieces in the puzzle. So you render the buildings, the grass, the objects, the pedestrians, the vehicles. And that's the graphical piece, where you want to make sure that these look as realistic as possible. And on top of those, there is the label piece.

Starting point is 00:31:20 How do you want to label this data? Do you want to go with only one type of label, like 2D object, 2D bounding boxes, for example? Or do you want to do more? For example, depth, optical flow, 3D bounding boxes, motion segmentation, or things that convey to motion, like optical flow. Sometimes it's like instant segmentation, where you want to only label the boundaries of the object. So I think labels are pretty critical. They usually come on top or as part of the whole pipeline, which includes location, procedure generation, simulation. Simulation actually is very important.

Starting point is 00:31:58 So we have whole teams working on how the reality should be simulated, like how those agents should be interacting in real life. Because you can have some pretty images showing pedestrians, et cetera, but if you want to do sequential or you want to have like realistic images, you want to make sure that the pedestrians, for example, in this scenario,

Starting point is 00:32:14 are behaving as pedestrians in real life. The cars are behaving. You don't want to find cars like, you know, driving on buildings, for example, right? So having accurate simulations, both for the agents in the world and for the sensors, because we spent so much time,

Starting point is 00:32:30 we have spent much time to accurately mimic a lot of the sensors that our customers are using. So how do you simulate, you know, different types of cameras, different types of LiDAR sensors, different types of radar sensors. And on the top of that,

Starting point is 00:32:42 there is the labels. So as you can see, different pieces working together to generate like, you know generate highly realistic data, because that's important if you want to use it in conjunction with real-world data. Yeah, that makes total sense. So just like that, there's something a little bit better. I mean, I assume like, okay, you have like a modern car, even if it's not like, let's say, a car with an autopilot or anything like that, it has a couple of different sensors, right? Things that can measure distances.

Starting point is 00:33:11 Some of them have cameras. Some of them are coming with lidars. So when we are talking about generating synthetic data, because the first thing that someone can visualize about that is that we recreate a video, right? Or an image that has a scene inside of, I don't know, let's say, a car violating a red or something like that, right? But it's interesting what you say because it's not just, let's say, that you are trying to simulate the environment. You are also trying to simulate the environment you are also trying to simulate the perception that the machine has right so tell us a little bit more

Starting point is 00:33:50 about that but where are the boundaries and where the generation actually starts is it let's say do you do like something like you create the scene without considering the sensor and then you also put the sensor and the output of the sensor. How does this work when we're talking about this type of data? That's a great question. Actually, I want to answer this question, but I wanted to use something first that would lead to the answer to that question, which is like the domain gap. Okay. So the domain gap, it's a pretty common thing that almost every perception engineer who has been working with synthetic data knows about. This is the kind of the gap between the real data

Starting point is 00:34:37 that the models are going to be operating against at the end of the day and the synthetic data that we could be using. This gap actually could happen for multiple reasons. It could be like on the visual aspect, like the data would look different. And it could look different for multiple reasons. Let's say the graphics, for example, engine of the synthetic data isn't of high quality. So it would look like graphics versus real-world data.

Starting point is 00:35:02 The lights might be different. The weather conditions might be different. The agent distribution. In real life, for example, in the area that you would like to launch these models, you might have high density of pedestrians and vehicles. So if you are training with synthetic data with low densities,

Starting point is 00:35:18 it's not going to be of much help, right? Because you would like to mimic the real world. So I guess what I'm trying to say, the gap between synthetic and real is a big topic when it comes to synthetic data. And closing that gap or bridging that gap or shrinking it is usually like a major thing. You would try to as much as possible to make it minimal because that does impact the model performance. So in summary, we simulate the world itself. And we then simulate the sensor, like where is it placed? And what kind of sensor is it and the angle of view and all of that things. And we simulate where is it placed on

Starting point is 00:35:58 the vehicle. And the output of that is that you have a realistically simulated world, you have a realistically simulated vehicle and placement of that sensor. And we simulate the intrinsics of the sensor itself too, like, you know, the specifics of what goes inside that sensor. So you are simulating all these elements together. So the output is highly realistic images or scenes captured from that very sensor in that very location within that world, right? So let's go into some detail. So first piece, like you simulate the actual world. So this goes back to the pipeline thing that I was discussing, right? So first of all, we start with the location. Is it urban, suburban, or highway? So we have different maps for different locations. Like for example, a lot of our customers, like, you know, they test these vehicles in the Bay Area, for example. So we have maps for San Francisco for all the streets and stuff like that. So we can mix and match some urban and

Starting point is 00:36:54 highway scenes, for example, there. And then you start adding the different pieces of the pipeline on top, you know, the simulation to nicely simulate the agents, the rendering, so visually render those agents and structures, etc, etc. So that's the kind of the world build, if you think about it. So you build actual simulated world. And within those worlds, you have the vehicles. One of those vehicles is the ego vehicle, which is the vehicle that the model is going to be deployed in. And think of it as the kind of the ego view of that vehicle driving right and then we have different models of vehicles depending on the dimensions of the vehicle etc and then one thing our customers can configure is where the would they like to place the sensor like what is the actual sensor placement is at the front at the back at the rear you know what where exactly

Starting point is 00:37:43 what's the location so that's usually something that we do. And once you do that, we also simulate the intrinsics of the sensor, like what model exactly, how does it work from inside, etc. And we start simulating the kind of scenes that would be captured from that sensor. So in that case, it would be as realistic as possible to the actual data captured by the real sensors in the real world. Because we simulated almost everything in that pipeline, the world, the sensor, the placement, the location, the vehicle, and that would result into like data that looks similar to how the real world data would be captured from that sensor. So for example, sometimes our customers have like

Starting point is 00:38:18 fisheye sensor, which makes the image like work in some way. It has some advantages for machine learning applications. You can capture wider angles and detect more things. Some others have normal camera sensors that are not that wide. We take that into consideration to generate realistic images. It's a long answer.

Starting point is 00:38:38 Sorry for that. No worries. It's a very fascinating topic, to be honest. I don't know. I think it's a very fascinating topic to be honest because it's I don't know I think like it's it's one of these things that it's easy like to visualize because it is visual right like we walk outside and we take it for granted

Starting point is 00:38:53 that we recognize everything like we can understand depth you know like we can do all these things that okay we generate all this synthetic data to go and train the models to be able to approximate what a human does. But there's so much information in these scenes, right?

Starting point is 00:39:16 And it is kind of amazing to think how complicated it is and how much information like has to be managed there and like how you do that in order like to approximate reality at the end right and like something related to that like you mentioned that you are in a way let's take like San Francisco right you have I'm sure like probably it's one of the most like I don't know well documented in terms of like data cities cities in the world, like for these kind of scenarios. So this model that you like, this simulation that you are creating, right? How realistic it is, like, if I as a human, like would be watching a video of this, right? Like on my laptop, like how realistic this would look to me?

Starting point is 00:40:06 We can play the game actually. I can show you some images, real and synthetic from parallel domain, and you would judge. My guess is that at least 50% of those you will not be able to define. Okay. Because this is not like trying to brag or anything, but we have a lot of like dedicated teams and people just trying to, you know, perfectly simulate like, you know, lighting and optics,

Starting point is 00:40:30 for example. You know, how would this visually look like with different kinds of lights, whether it's natural light or artificial light, you know.

Starting point is 00:40:39 People, artists, for example, working on how the looks of different agents and animals and cars would look like, et cetera. And then there is the looks of different agents and animals and cars would look like etc and then there is the behavior of these agents too so i think in summary we i think most of the images that we generate in general would look as highly realistic as possible sometimes when you

Starting point is 00:41:00 look super close especially like with pedestrians and something you guess that this might be like similar to games and synthetic not real but i would say in many cases you would not be able to differentiate between both and that is generally required to be as highly realistic as possible and what goes inside that is like you know behavior simulation and visual accurate visual rendering and you know trying to bring in like the impact of things like light and weather and nicely simulating you know rain and stuff like that okay that's super interesting so do you see first of all there are like other domains out there that they are trying like to do for different reasons like similar similar things, right? Like you have computer games, for example, it's one thing, right?

Starting point is 00:41:46 Then you have CGI in movies, right? More on the entertainment side of things. But do you, or like VR, right? Okay. Let's say even if you could generate such detailed representation of reality, probably the problem with VR is the fidelity of the hardware right now. But outside of this,

Starting point is 00:42:14 what's the overlap between the things that you are doing and these domains? How many, let's say, best practices or techniques or whatever are coming from there? And how what you are doing can give back to them, right? Yep, that's a great question. Actually, I want to start by saying that I think about 20% or something, if not more, of the engineers working in parallel domain or the teams in general are coming from a gaming background.

Starting point is 00:42:46 So they've been working in building games and simulation engines, etc. And I think there is a huge benefit that we get from these established industries because they have already been building technologies, whether it's for simulation or rendering or visualization or graphics, CGI, procedural generation, etc., that we can leverage to build these simulated worlds. Because if you think about it, we're trying to do something similar to what they're doing, maybe for a different purpose, which is enhancing machine learning models, in our case, versus entertainment.

Starting point is 00:43:17 So I think we are standing on the shoulder of a giant, because we're pretty much using all the advancements in that capacity. On the other side, we do two things. One, we bring in the advancements in content generation using generative AI, which is, as you know, like exploding these days. So we are using, you know, different generative AI techniques to scale content generation. So instead of like manually building these assets and crafting these assets, et cetera, there is a way that you can, for example, ask generative AI to

Starting point is 00:43:51 come up with a hundred different variations for raw debris that you don't need to create manually with the right prompts. So techniques like stable diffusion, for example, is empowering a lot of that, right? And diffusion techniques. And we are empowering a lot of that. And diffusion techniques. And we are using a lot of that. We're using this to create different pieces of content inside our virtually simulated world at a scale that could not be possible with traditional means. So bringing both worlds, the advancement in the gaming industry and simulation and CGI, etc., along with the advancements in AI, making both work together, in my opinion,

Starting point is 00:44:27 would lead to the best results. And that's what we're seeing inside. We are actually generating different content using generative AI techniques like stable diffusion. And that generated content has resulted into improving a lot of the model's performance that we're seeing, right? So detecting some objects, actually, the detection accuracy increased after it was trained with content generated from both the existing engines that we have

Starting point is 00:44:55 and, you know, generative AI techniques. Regarding your question of, like, could this go back to gaming industry? I think absolutely, yes. Maybe we are not actively doing this as a company today because our major focus is working with machine learning teams. But I cannot see a reason why the advancements that we're building would not feed into the gaming industry. I think it can go both ways.

Starting point is 00:45:16 Okay, that's awesome. All right, one last question from me, and then I'll give the microphone back to Eric. So, okay, we talked primarily about synthetic data in what you called the... It has to do with perception, so it's more like visual, let's say, in a way. But data can be many different things, right?

Starting point is 00:45:40 We can have textual data, obviously. We can have, let's say, tabular data. We can have structural structure, blah, blah, blah. So many different types of data, right? What do you see happening when it comes to all this technology about synthetic data to other data domains? Do you see opportunities there? Do you see a need there?

Starting point is 00:46:03 Tell us a little bit more about that. How we can generalize like what you are doing to other types of data too. Absolutely. I think synthetic data is pretty much required across the board, to be honest. I think any domain slash vertical that it's hard to get

Starting point is 00:46:20 high quality level data to train machinery models, it's pretty much going to require. So let me give you some examples. In other careers, I worked with health and getting access to health data on the individual level, for example, is super hard, right? If not impossible due to privacy concerns that are very well understood. In that case, I think synthetic tabular data would make perfect sense for health-related scenarios where you're trying to build machine learning models that would detect fraud, waste, and abuse, for example. So fraud, waste, and abuse happens and costs the US alone like more than, I think, I remember the number correctly, more than 100 billion dollars annually, right? And there is a lot of demand to apply analytics and machine

Starting point is 00:47:02 learning to detect fraud, witness, and abuse. And using that with real-life data is not so feasible because you have a ton of researchers across the country who can do innovative work, but they might not be able to access this data due to different reasons. And giving them highly realistic, synthetic data that mimics the real world would enable them to build these kind of fraud know fraud detection algorithms same thing happens with financial data and transactions right if you want to build fraud detection systems if you want to enable researchers to do that you can give them data to like private

Starting point is 00:47:36 personal data for financial transactions so same thing i can think of many other cases but also with like unstructured data it's not just computer vision. Let's say with speech, right? So I think if you want to build models, that would be great at transcribing audio to text. You need a lot of labeled data. And how can you get that? Not a lot of companies have that. So having highly realistic and labeled audio data would help with that. I can also think of like other domains like, you know, text.

Starting point is 00:48:07 I mean, we are seeing the explosion that's happening with GPT, et cetera. And it's amazing because it's using self-supervised techniques, right? So it's trying to predict the next world. And this is already included somehow as labeled. So it does not need explicit labels necessarily to do these tasks. But I think for like tabular data, things like financial transactions, health, insurance, same thing with audio data. I think these are all examples of domains that would benefit from synthetic data.

Starting point is 00:48:37 That's awesome. Thank you so much. Eric, it's all yours again. All right. Well, we are at the buzzer here. But Omar, question for you. And I hope you're okay with this, but I want to just dip our toe into the water of the ethical question as it relates to AI.

Starting point is 00:49:00 And I think specifically, you deal with a very interesting component of this in that you have a lot of experience with sort of labeling. And traditionally, labeling has been handled by a large human workforce. But companies like Parallel Domain can create synthetic data and automatically label that, right? And so there are lots of different opinions about this. And we're not a show that has expertise on economics or anthropology or politics, but maybe you were employing people who labeled a bunch of data and now parallel domain is doing that. How do you see that playing out? Because there are a bunch of data and now Parallel Domain is doing that. How do you see that playing out? Because there are a lot of people who think this will create new career opportunities. Some people

Starting point is 00:49:48 are worried. Can you help our listeners and us think through the impact of that? Because you're really kind of on the bleeding edge of this and probably have already seen some of it, you know, within the realm of labeling even. Absolutely. in short we're still definitely gonna require labelers to help with improving machine learning models and making them safer it's just the nature of the work could differ a bit right so instead of spending a ton of time for example drawing bonding boxes around objects or you know doing you know sophisticated semantic you know instance masks on objects which takes a lot of time these efforts are going to be required into let's say higher level forms of quality assurance for example right so for example when you start

Starting point is 00:50:40 training those models and running them in the real world you come up with you know different performance degrees right so having someone to understand where these detections are missed and you know where do we need to improve those models this is still going to require some level of supervision and quality right so that's one thing another thing is like you know providing some sort of quality assurance and monitoring for the synthetic data itself. So even with like highly stable synthetic data engines, when you generate the data, sometimes there are like problems. Sometimes there are things that you don't like to be there or that you would like to modify. So this is still going to require some level of supervision. So you see where I'm going?

Starting point is 00:51:23 I think we still have a lot of tasks, a lot of need to employ this expertise. Maybe in a place that will lead to better results or faster results or safer, you know, autonomy. So I still, it's just, and this is similar to any kind of technological advancement. You start to have some jobs, you know, shifting to have like

Starting point is 00:51:46 different shapes and structures sure so that's one thing the other thing i think we're still gonna require real world labels anyways like i don't see in the near future at least that synthetic data will totally replace real world data like 100 i think we're still gonna need some level of real data whether this is like 10 30%. We can debate that or even better, we can find out. So I think we're still going to require it. So yeah, that's my answer. Yeah, I love it. Well, I mean, the real world changes, right?

Starting point is 00:52:17 I mean, we don't live in a static world. And so managing the dynamic nature of reality, I agree, certainly requires a human interface. Well, Omar, this has been absolutely fascinating. The time flew by. We have many more questions. So come back and join us. And thank you for the time that you've already given us. Thank you so much for hosting me. I really enjoyed the us. And thank you for the time that you've already given us. Thank you so much for hosting me. I really enjoyed the discussion. And thank you for reminding me of the beautiful moments

Starting point is 00:52:50 of my early data career, actually. That's a nice starting point. Thank you for sharing those with us. That was special. Absolutely. Thanks, Eric. Thanks, Costas. Very nice to meet you.

Starting point is 00:53:01 Wow, Costas, what a conversation with Omar from Parallel Domain. I mean, it's almost impossible not to have fun if you're talking about synthetic data to train AI models that are driving, learning for self-driving cars and other imagery use cases. I mean, just unbelievable. So I loved it. I think one of the biggest things that stuck out to me was, you know, when you scan Hacker News, when you see news headlines, especially related to chat GPT and all these other AI technologies, they tend to be extreme. AI is taking away jobs or we need to stop AI. We need to slow down. We have all these people who signed a letter. Omar was so balanced. It almost felt personal to him where

Starting point is 00:54:10 he felt like he could stop a self-driving vehicle from hitting someone in the street because he can provide better data. It felt very personal. It felt very practical. There was no mystery. And he had a level of confidence that I think, you know, just really, to me, invalidated just a lot of the headlines that we see. I mean, he's really doing work on the front lines with, you know, companies who are building cars that are trying to drive themselves, you know, and what was that encouraging to me? I think that's my big takeaway is that he's balanced. He's brilliant, obviously. But he's very confident that this is a way to make things safer,

Starting point is 00:54:53 even using synthetic data. We shouldn't be scared. We actually should be excited. And that was a big takeaway for me. Yeah, 100%. I think like a big difference between what omar is doing and like what is happening with things like sub gpts that in omar's case like the use case is very explicit right it's like we know like we understand exactly what's the output of like improving

Starting point is 00:55:20 these models right it's like more safety it's yeah. We are not going to have so many accidents. We are not going to be worrying if a child is going to be hit by a car and all that stuff. And by the way, it's not just the self-driving cars. As he said, many of these models are used today as part of other sensors that cars have like you know like the regular cars that we all like drive like to make sure that if an accident's going to happen like

Starting point is 00:55:53 the car can help you react faster and stuff like that right so it's not just like the extreme of having like a fully autonomous car right and on the on the other hand, you have like ChatGPT, which is like a very impressive and very accessible technology, right? Everyone can go and type on this thing. You don't easily go and get a fully autonomous car and like see it like moving, right? So, but people at the same time,

Starting point is 00:56:21 they see something big, something great there, but they don't necessarily understand what the impact is going to be. We are still trying to figure out this, right? And that generates fear. So I think that's one of the reasons that you see such a diverse, let's say, reaction between the different technologies. And I mean, at the same time, I agree with Omar. At the end, we are going to see something similar happening with all the chat GPT-like kind of technologies. And I mean, at the same side, like I agree with Omar, like at the end, like we are going

Starting point is 00:56:45 to see like something similar happening with all like the chat GPT like kind of technologies. That's one thing. The other thing that I'm going to keep like from this conversation that we had is that there's so much overlap between like the state of the art that's in Omar's domain and things like virtual reality, CGI,gi games all these things and like i can't wait to see how these industries are going to be using whatever is happening like in synthetic data today and generative ai to create even more innovation so at least we are going to have fun. That's the feeling I get. At the minimum, we're going to have fun.

Starting point is 00:57:30 Yeah. And I think people like Omar are the right people to be working on this because of their disposition and value system. So thank you for joining us again on the Data Stack Show. Many great episodes coming up. If you like this one, I just wait, we have so many more good ones coming out soon. If you haven't subscribed, subscribe on Spotify, Apple Podcasts,

Starting point is 00:58:02 whatever your favorite network is, and tell a friend if it's valuable to you. And we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.

Starting point is 00:58:28 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

The Data Stack Show - 145: What is Synthetic Data? Featuring Omar Maher of Parallel Domain

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.