The Data Stack Show - 128: The Possibilities Are Endless for Synthetic Data with Alex Watson of Gretel.ai

Starting point is 00:00:00 The Data Stack Show is brought to you by Rudderstack, who's doing something pretty cool. March is Data Transformation Month at Rudderstack, and they're running a competition with a $1,000 cash prize for data engineers. Tune into the show next week and follow Rudderstack on Twitter to get deets as soon as they drop. Welcome back to the Data Stack Show. Today, we are going to talk with Alex, the Chief Product Officer at Gretel, gretel.ai. And we actually have been talking about them for a while, Costas. It's a really interesting company. They do a number of things, but the primary thing they talk about on their website is synthetic data. And today we're going to talk all about machine learning models and training synthetic data

Starting point is 00:00:52 on real data and all the interesting use cases. So this sounds basic, but I want to ask Alex what their definition of synthetic data is. I mean, you can of synthetic data is. I mean, you can create synthetic data, you know, in a spreadsheet, you know, in Excel, right? But their flavor of synthetic data is, you know, is pretty specific and I think really powerful. So that's what I'm going to ask is for him to define it in Gretel terms, if you will. Yeah, and I want to get a little bit deeper into what it means to generate a synthetic data from a real data set. Like, how we can reason about the accuracy.

Starting point is 00:01:38 Like, what kind of, let's say, characteristics of the original data sets we want to recreate. So that's definitely something that I'm super curious to learn more about. And I think we'd have the right person to do that today. So let's go and do it. Let's do it. Alex, welcome to the Data Stack Show. Thanks, Eric.

Starting point is 00:02:03 We've wanted to talk about, actually, we've talked about Gretel Acosta for some time and synthetic data. So just super excited to actually cover that topic on the show. Been on the list for a while. Let's start where we always do. Give us your background and what led you to Gretel. Yeah, sure. Give the two minute version of it here. I started my academic career

Starting point is 00:02:26 in computer science, moved out to the East Coast right after September 11. And I actually joined the NSA. I was working there for about seven years. Awesome experience. I got to dive in on early applications of machine learning and also security, which has influenced my career quite a bit over the years since then. 2013-ish, I moved out to San Diego where I am now. I started my first company. It's a company called Harvest AI. We were helping large companies that were starting at the time to transition to use

Starting point is 00:03:01 SaaS applications like Google Suite, Office 365, Salesforce, AWS, things like that. Help them to identify where important data was inside their environment and protect it. So really cool experience there. We built that for about two years. We were at that point went out for a series A raise. Had some interest around acquisition and actually got acquired by AWS. And I went on to spend the next four years of my career at AWS as a general manager, launching the first security service for AWS, which was our product at Harvest called Macy. It's a service that customers use today within the AWS world to identify and protect important data in the cloud. And through, happy to kind of dive in on on you know that process here but i think through

Starting point is 00:03:45 you know both the incredible access that we had inside the walls to data at aws and then also talking to customers and realizing how difficult it was for them to enable access to the sometimes really incredible data sets that they had that they you know to enable decisions inside of their business was really some of the that led to the initial pieces that we have with Gruddle and Synthetic Data today. Very cool. So much to dive into there. Can I ask one question about working for the NSA?

Starting point is 00:04:14 Because, you know, like working in sort of intelligence type stuff for the government, I think a lot of times, probably because of Hollywood, you have like two views of it it's either like extremely advanced and like very scary big brother or it's like well it's the government and they move slow and so like maybe the technology is quite as good like where on that spectrum was the actual experience of working with the nsa if you can tell us the answer there is both though my uh you know my first job

Starting point is 00:04:47 was programming crazy for computers actually when i started so i got a chance to work with no way putting edge multi-million dollar machines on really cool um so you know this scale with which they were working also the caliber of the people there are you know almost unlike any other place i've ever worked. Wow. Also really incredible. Also, it's the government. So things don't move quite as quickly as you would hope, but incredible experience there.

Starting point is 00:05:14 Yeah, yeah. Very cool. Thanks for indulging me. Okay, let's talk about Harvest. So you were going out for a Series A and then you get, you know, sort of ingested into, you know, a company that, you know, provides, you know, maybe more data infrastructure than any other company in the world, right? What was that like? Because at Macy's you worked with data at a huge scale. So can you just talk about that experience a little bit and maybe, you know, especially as Macy sort of grew to be a, you know, a very widely used product. What were some of the lessons that you learned working at AWS scale providing a data service via AWS? Yeah. Yeah. You know, with scale, I think one of the things you learn

Starting point is 00:06:00 really fast is the details matter. So that's one thing that really stands out. Those things that you only have a couple of customers and you're, you know, even, you know, large scale customers, but those things that are okay to let slide become really big issues when you have thousands of customers. And that's really what we needed to prepare for. I think on kind of cool experiences or things that I learned, even during my time there, I think dealing with the scale, like how do we do natural language processing NLP at the scale of terabytes or petabytes of data that customers have in the cloud was really fascinating.

Starting point is 00:06:32 I also think the experience of taking, you know, at the time, a single tenant software that we'd written that would run inside a VPC per customer to a multi-tenant that needed to support, you know, thousands to tens of thousands of customers in the first month was quite the experience and had some, we had some, some pretty cool learnings during that process. One of the stories, maybe just to cover it really quickly that like really stood out to me and just kind of helped shape how I think about building software today was as most people know like at AWS like everything revolves around reInvent and New York Summit those two launches right and those are the two times that you launch service and we were hitting the

Starting point is 00:07:15 ground running and having really good you know traction I think with a couple customers and we were getting ready to launch um Macy to the world and the fully multi-tenant version of Macy. And one of the kind of challenges that we ran into was we had not enough time to finish, completely finish multi-tenancy before we launched. So our choice was either delay six months and launch it at reInvent or launch at New York Summit.

Starting point is 00:07:40 We really wanted to launch, you know, and how could we get there? What could we do? And one of our product managers had a really kind of ingenious idea and said, what if we launched the whole backend as multi-tenant? We launched the front end as single tenant. So what that meant is that each customer would have their own unique box in the cloud that would be running our complete user interface stack. And since it's AWS, it's never just one box.

Starting point is 00:08:09 You have three regions per zone. Oh, yeah. Sorry, you have three zones per region. So you have high availability within there. So for each region, we need to have three boxes per customer. We forecasted we would have about 6,000 customers at launch at the bar window. You know, that's somewhere around there. So that meant that for us to launch on time, we needed to run 18,000 virtual machines.

Starting point is 00:08:33 To power the user interface for these customers that might sign up. So incredible experience. It was wild. We almost broke CloudFormation doing a deployment. At the time, I'm sure it can handle it quite easily now. But at the time, I'm sure it can handle it quite easily now, but at the time that was pretty new. And we forecasted that if we could finish the multi-tenant version of the UI within 45 days and shut it down, that we would actually have a pretty conservative amount of cost for running all those user interfaces. So that was one of my more wild

Starting point is 00:09:03 experiences was exactly 45 days into the launch, we were able to turn 18,000 machines into nine machines. I'm sure a data center kind of collectively cooled down at that point, but that was a neat experience and it went without a hitch. So it's one of those things like just taking a step back and asking how you can do something when you're trying to hit a deadline

Starting point is 00:09:23 or do something like that. And being data-driven in the decisions you made, we felt like we could get there and we did. That was a really cool experience. Wow, what a story. What a story. That is so great. There's a lot of stress in there.

Starting point is 00:09:39 It sounds like it's a problem starting now. I can only imagine. It's the pendulum swinging between like, this is going to be awesome, we can pull it off, and are we completely crazy? Totally. Well, tell us about Gretel. When did you decide to start it?

Starting point is 00:09:59 And then give us an overview of the problems that you saw. So we started Gretdle with this thesis. And the thesis was that it was really difficult, as we saw. And as I saw running Macy and talking to our big customers that are trying to figure out whether all their important data is in the cloud and protect it and figure out if it's exposed to the world or answer those questions. Like how difficult of a problem it is to enable access to data inside of a business. And usually that revolves around privacy, right? So like a contract that you have with your customers for your brand or sometimes legally

Starting point is 00:10:34 enforced with things like GDPR. And a feeling that we had that kind of the existing methods that are like, oh, build a wall around your data, build a perimeter, build a better perimeter or VPC. Like those things are effective tools, but they don't work. And at some point they're going to break. And that's what kind of leads to breaches happening. And our initial thesis for Gretel was saying, what if we could train a generative AI model?

Starting point is 00:11:02 So, you know, very similar technology under the hood to what you see at like OpenAI with the GPT model on data instead of natural language text. And what if we could get that model to recreate another data set that looks just like your sensitive data set, except it's not based on actual people, objects, things.

Starting point is 00:11:20 And what effect would that have on privacy? And in theory, if you could pull it off it wouldn't matter if someone's you know computer got left at a starbucks and it got picked up and it had you know a lot of sensitive information on it and with things possible maybe we could unlock new ways to share data it's evolved quite a bit since then we've got a couple use cases but i think that's one of the still one of the primary ones that we see today is how to address privacy and how to essentially use these generative models to anonymize data. Yeah, super interesting.

Starting point is 00:11:51 You mentioned a few tools that companies, you know, turn to in order to mitigate concerns around, you know, privacy and security. You know, you mentioned VPC, for example. Those seem to be pretty pervasive. I mean, those are sort of like the default set. you mentioned VPC, for example. Those seem to be pretty pervasive. I mean, those are sort of like the default set. Would you agree with that? Is that the most common pattern that you see for sort of... I think perimeter was a great term, right? I mean, doing nothing is obviously not an option for many companies.

Starting point is 00:12:20 But to your point, there's data breaches in the news every single week. Yeah. At various levels levels you see customers you know some customers keeping data within like within their own kind of perimeter of their walls their private cloud you see other customers using the cloud that will you know embrace technologies which are awesome in my opinion like you know bpcss and using role-based access to things instead of passwords and things like that. So really good patterns all around. But, you know, access control still leads to the chance and the risk of raw data finding its way out. So it's one of the things like just to, you know, I would say I applaud the effort that a lot of companies put in making that work is really difficult like you start seeing permissions when you're trying

Starting point is 00:13:09 to set up a vpc or an s3 group and often a developer just makes a change and they say i'm going to do this real fast and see if it works and i'll fix it later and they forget you start seeing issues like that so there's you know a whole new class of tools that are being built to to address problems like that. But I've been in the security world long enough that you start to see the repeated patterns and that repeated pattern of a better way to build a perimeter around data is one that sounds good. And it works in some cases, but it's not an answer, a long-term answer. Sure. Yeah. I mean, all best practices for sure, right? That don't necessarily get to the root of the problem.

Starting point is 00:13:49 One thing that'd be helpful, I think, especially to give context to the rest of the conversation as we get more technical here, could you define synthetic data as Gretel sees it? Because it's, you know, creating synthetic data as Gretel sees it. Because it's, you know, creating synthetic data is a concept that's been around for a very long time. I guess we could argue how far back in history,

Starting point is 00:14:17 but especially as it relates to technology, people have been creating data sets synthetically for decades and decades. So could you help orient us around the term as it relates to specifically what Gretel does? Yeah. So maybe I'll start with a really broad term. We were describing the 1970s, someone sitting at a DOS terminal writing up their own or a Unix terminal writing up a CSV file

Starting point is 00:14:45 of data that you would use to test your program. That's synthetic data. So broadly speaking, I would define synthetic data as a computer simulation or algorithm that can simulate real world events, objects,

Starting point is 00:15:02 activities, things like that. So it could be a spreadsheet, it could be a mathematical formula, it could be a computer program that just spits out random temperatures, ages, things like that for people. So it can be that simple. You also hear the term a lot of times like fake data or things like that, where you have kind of mock data that might make sense for testing a user interface or something like that, but you wouldn't want to ever query that data or ask it questions. In the Cradle context, we use synthetic data to define data generated by a set of deep learning algorithms. So similar, once again, to use that analogy with OpenAI's GPT models or chat GPT or stable diffusion per image.

Starting point is 00:15:47 Essentially, we have models that learn to recreate data like what they've been trained on. And you can either create another data set that looks just like it, once again, with artificial people, places, things like that. Or you can prompt the model to create a new class or to boost the representation of a class in your data set where you want to see more examples. And we see a lot of that too. to power data science use cases inside your business or information exchanges or things like that. We see a lot of this in the life sciences world where you've got companies that are trying to share broadly, you know, research about COVID or about genetic diseases or things like that at scale while preserving privacy. So that is when we talk about, you know, synthetic data in the gradual context, it's data that can be used that has the same quality and accuracy as the original data it was based on.

Starting point is 00:16:51 Fascinating. Okay, can we talk about similarities to ChatGPT? It's been making its way across Hacker News and Twitter and all over our internal Slack channels with people doing interesting stuff with it. And you have a lot of experience with natural language processing and algorithms that run on language. Could you explain the differences in that flavor of deep learning as compared with running deep learning on data itself, right? I mean, that's an interesting concept to consider in general, right? Can you just explain the difference in sort of the, even the ergonomics of like how you would approach deep learning on data versus natural language, right? Because there's, it's just a really, it seems like a very different paradigm, but it sounds like they're actually pretty close. Yeah, they are. I think the underlying technology, maybe to talk about that first,

Starting point is 00:17:48 and then talk about the interface for how people interact with it. The underlying technology is using a class of machine learning models called language models or large language models for both open AI and what we're doing in Gradle. And that came out of a realization really on our part that a dataset is a language in its own kind of right that makes sense to computers that is harder for humans to assimilate. But under the hood, the technology is very similar. We're using language models. We use a recent class of language models called Transformers

Starting point is 00:18:20 that have a great ability to learn data from wide collections of data sets and be able to apply it to whatever context you're asking about. So you can essentially augment your data with better examples. So I think the OpenAI GPT examples is very close to what we do at Cradle. Chat GPT is a layer on top of GPT-3. And it is a slightly different mechanism that's used for training. And this is, you know, kind of wild to think about, and you really have to dive in here. But under the hood, it's hard to believe that it works at this scale. But under the hood, GPT-3, or, you know, Gretel at this time, is really predicting the next field. If I have a user that's from,

Starting point is 00:19:07 if you've got a movie review dataset and you've got people that have consistently rated a movie at this level, and you have a new user being generated, probably it's going to generate a rating within a certain range. So really it's just saying, okay, if I have all the information, what's the next most logical thing for me to do?

Starting point is 00:19:24 Cat GPT put a layer on top of that and did two things that I think were really significant. One, essentially uses this concept called like human-based reinforcement learning, where you have humans that are kind of, instead of just getting an algorithm that is the single best thing at predicting what the next token in a sequence is going to be, it takes a look at the whole result and says, is this the result that I want as a human or not? So there's human labelers that are looking at it and they're saying,

Starting point is 00:19:50 I ask it to create me a list of to-do items for today. Like, do these make sense to me? And a human reviewer at opening, I will look at it and they'll say, that is the best answer. And then they'll use that to feed back into the algorithm and come up with better results. So two things that I think that are really significant are kind of happening right now.

Starting point is 00:20:08 One, we're orienting machine learning algorithms to have responses more like what humans want to see. Which is good, right? The robots won't be rising up against us if we're teaching them to do the things we want them to do. At least we have a say in the uprising. We get a say in the uprising. Right. All it takes is one person to train in a different way, but fortunately our training was going the right direction.

Starting point is 00:20:32 So that part is really neat. I think the other part that I love about it that I'm really excited to see in the synthetic data world is this natural language interface that you use to talk to models, right? So instead of GPT-22 if we were to rewind back one major gpt version right and look at it and you would give it a couple examples you would give examples of tweets or blogs and it would create new tweets or blogs like what it was trained on

Starting point is 00:20:57 with chat gpt and increasingly with the other gpt models you can just say like brainstorm a list of to-do topics for me to look at today or something like that. Yeah. So you have this natural language interface similar to stable diffusion with images, right? Where you can say, Yep.

Starting point is 00:21:13 I want a unicorn on a surfboard on Mars, right? And it'll generate that. So I'm really excited to see this way that we interact with data becoming more based on natural language than SQL queries or, you know, data engineering that we all have to do to get that kind of level of answer right now. Yeah, absolutely fascinating.

Starting point is 00:21:32 All right. Well, I'm going to stop myself because I have a tidal wave of questions backed up. Costas, please jump in here because I know you have a ton of questions as well. Yeah, thank you, Eric. So Alex, let's talk a little bit more about synthetic data. And you mentioned

Starting point is 00:21:55 because Eric asked what synthetic data is. And I'd like to get a little bit more detailed on what it means from a data set to generate more data that are synthetic artificial, right? And they're similar. They share the same properties. What are these properties?

Starting point is 00:22:19 How should we think about that? A synthetic data set, and we use the term like if you were to query it, right? And so if you were to build a dashboard off of this data set or to send it to SQL query, that you would get a very similar response for an aggregate statistic. So what is the average age of a person that likes to buy this product? Or when did I have the spike in activity? Things like that will be very similar between synthetic data set and the real world data set. There's a couple ways we measure this. You know, at first, you create your first synthetic data set, you look at it, you're like, that looks great. I don't know how it's going to work for me. And that's the first question that, you know, we always hear

Starting point is 00:23:03 from our users, the data set looks awesome. You know, like when I look at it, it looks fine, but I don't know how accurate it is. How do I measure that? And so what we try to do, we have both like opinionated ways to measure the quality or the accuracy of synthetic data and unopinionated ways. The unopinionated ways make the most sense when you're just trying to create an artificial version of a dataset. You don't know how it's going to be used. So we don't know what type of machine learning tasks it's going to be used for. So we can't measure that. So what we do is we look at, I'll give you a couple of examples. We look at

Starting point is 00:23:36 the correlations that exist between each pair of records in the original dataset and the correlations that exist in the synthetic dataset. So if i knew that uh to go back to the movie review data set right that like you know movies with keanu reeves usually have like really high ratings right like i would expect another movie in the synthetic data set with keanu reeves to have high ratings and kind of it goes on much more deep than just kind of that two level correlation but that's the first thing that you know one of the things that we look at. Another really helpful tool we have is called field distribution stability. So we're looking at,

Starting point is 00:24:11 and this is just pretty common data science tactic. A lot of people will do this. We just automate it where you might have a data set that's got a hundred rows in it, a hundred columns. Essentially, we plot that. It's something called PCA,

Starting point is 00:24:26 principal component analysis. Now we plot it in a 2D plane, and we look at the difference between the plots, or the synthetic and the real-world dataset, and we say, when we map this 56 or 100-column dataset down to two dimensions, how similar do these distributions look? And that gives you a good insight as to

Starting point is 00:24:42 whether the model's overfitting, and it's just repeating a couple things inside of there or if it's capturing a whole distribution. And the third thing is the most probably intuitive way to think about it. And it's just looking at the per-field distributions, right? If you've got admission times where people in an EHR data set, or you're looking at financial data set and you have open, high, low, close, you know, type data. Do the distributions of each one of those match what you're expecting to see. And that's something we try to automate the whole process and give you a single score that helps you reason about how well the model's working. All right. And this is the unopinionated ways.

Starting point is 00:25:21 What are the opinionated ways? The opinionated way is when you know how you're going to use a dataset. If you're going to use it for downstream classification training, regression, you're using it for forecasting. We see this quite a bit in the financial space, right? Where you want to use a time-sensitive data, but to forecast a stock price is what that's going to be. Things like that. When you know how you're going to do that,

Starting point is 00:25:44 you can actually simulate running the synthetic data on the same downstream forecasting use case as the real world data and compare the two. There are some really great tools out there that make this easy. So there's a framework in Python called PyCaret. A lot of our customers like to use that quite a bit. Essentially, it simplifies this process of testing how your synthetic data works on classification tasks or QA, question answering tasks and stuff like that versus the real world data is based on. Okay. That's super interesting. I can't feel myself. I have to ask you that.

Starting point is 00:26:21 Where is the boundary between synthetic data and prediction of the future at the end, right? Because, and I am asking that because you are mentioning like, okay, like financial models, for example, where you're using like synthetic data to go and like run some models and do the, whatever they want to do there. And also, we had the conversation earlier about all these things about trying to predict what should be the next part of the text. So prediction is part of the whole thing that we are doing here. So what's the boundary there between trying to

Starting point is 00:27:09 predict what is going to happen at some point in the future, for example, or in some kind of dataset and actually just creating, let's say, data that they share some common characteristics, but at the same time they don't represent reality, right? Great question. I think where machine learning models in general have a really hard time is dealing with data that they've never seen before. When there is a market event,

Starting point is 00:27:38 to go back to the financial world, that never happened before, it's unlikely that your machine learning model is going to be proficient at detecting it. But that said, history repeats itself. So one of the really popular use cases that we see in the financial space is when you have rare events.

Starting point is 00:28:00 For example, the GameStop events that happened where significant market changes happened due to something that has happened for the first time. Crypto market crashes, things like that. When you want to train your machine learning models to be good at detecting this, and you can only pass it a single example, once again, it's not going to do well. So that is an area where I think synthetic data can really help. Today's synthetic data can really help is that you can give it an example of saying like, hey, look what happened with GameStop. I want you to create another 50 or 100 examples of something like that happening.

Starting point is 00:28:38 So it can be better detecting that if it happens in the future. So those are artificial. They're based on real world data and they're based off, you know, kind of learning off what happened in that one example, but they're not perfect. But in many cases that actually really helps we see. And that's kind of one of the neat kind of patterns we're starting to see with our journey, you know, kind of building synthetic data. I think that, you know, we're in year three now at Gretel and first year was like does it work on my data set right and the next year was like okay but how does it work against my real data and then now we're starting to see this kind of tipping point

Starting point is 00:29:15 where people are realizing that machine learning models are data hungry there will always be classes that you're not good at so this idea of augmenting your real-world dataset with additional synthetic examples that are perhaps trained off public data. So has the world seen this before? And can I incorporate some of that knowledge into my own dataset? Helps you build a better dataset that can have better accuracy

Starting point is 00:29:36 than you would have all by itself. That makes sense. All right, so talking about data, what are the types of data we are talking about here, because we can have like a synthetic picture and we can have a synthetic audio file, we can have a synthetic row in the table, right, in the database. So what are like the most common use cases that you see out there where like

Starting point is 00:30:04 synthetic data is like important today. And by the way, I know that's because here you mentioned many times like the natural language processing part of this, so it's probably more textual, but we have other things there. We have like time series data, we have structured versus, structured data. So yeah, I'm going to come across to what, you know, probably a little bias here because like I would say Gretel would be the one of the leading companies, not the leading company in working with tabular formats of data. That's really where we got our start.

Starting point is 00:30:43 That's, you know's where we built on. That said, our vision, and I think the vision you described for synthetic data, is much bigger than one type of data. Maybe to talk about the types of data that we see being used for synthetics quite often, I can even give some examples for different types,

Starting point is 00:30:57 but you have tabular data to start out. The stuff that you have inside a data warehouse, a database, things like that. Time series data, which mounts at first niche category of tabular data until you realize like 50% of the world's data sets, the time is such an important component that it's one we actually treat differently.

Starting point is 00:31:20 Text, so natural language text for different languages. And image synthetics are really big, right? So people are using images quite often to train models for self-driving cars or to recognize problems in a manufacturing line or things like that. So a lot of use cases around that. And I'd say increasingly getting into video and audio. So some of the new technologies recently, like stable diffusion, have really shown the ability to create new variations or artificial versions of images and videos where the companies that are trying to build, say you're an insurance company and you're trying to build something to give somebody a better insurance quote for their house and you want to look at the quality of the materials that they have and does it look like they have

Starting point is 00:32:09 fire extinguishers and things like that around the house just from a set of pictures you never have enough data to start with so this idea of augmenting images with you might have a room and you want to see a room with really fancy like furniture or with more like something that a college student might have, things like that. So we're seeing a lot of use cases there. And maybe the last place to touch on would be the simulation space. So we're talking about today, we've talked a lot about generative models, machine learning model, neural network that creates new examples of things. But there is, you know, in parallel to that, there is a simulation space where you might use something like a computer game

Starting point is 00:32:48 engine. So Unity would be a good example of this. NVIDIA has a neat product called the Omniverse as well, where essentially they have created a 3D world using a game engine that you can use to create and test these different kind of

Starting point is 00:33:03 simulation-based outcomes. Wow, that's super interesting. Okay, let's focus on tabular data. What do we mean by tabular data? Tabular data is any type of, and I'll use the term here, and I'd love to hear what you think on it too, like any type of structured or semi-structured data format. So it could be anything from a CSV file to use the formats again, JSON, where you don't necessarily have the same level of structure, but you can have arbitrary levels of nesting. More advanced data formats

Starting point is 00:33:34 like Parquet that are really efficient at encoding large amounts of data or just data that's inside a database or a data warehouse. Okay. When we are talking about like creating like synthetic data here, what I do the most common approach that you see out there is like, okay, I have, let's say a user table, right?

Starting point is 00:33:54 With like 1 million users. And I'd like to see like 2 million of these users having like, or the distribution of the users similar, let's say in terms of the age or the geography or whatever kind of information we capture already on this table. Is this something that's the most common use case that you see out there? Or people are actually coming and they're like, okay, that's my database here, right? Like I have users and the users have, I don't know, products that they procured at some point. And I also have, let's say, my inventory.

Starting point is 00:34:38 And I also have, like, let's say, we represent the whole domain of like what the company is dealing with or the user is dealing with, which can be quite complex, right? And they won't like to synthesize the whole database that is out there. The relational components, so not just capturing the relationships that have inside a single table like your users table, but capturing the relationships between users and the inventory table is a really cool challenge to the synthetic data space. Yeah. To answer your question, when we have users come in and use our platform

Starting point is 00:35:13 often, especially if you're doing pre-production testing, like a really big use case we haven't talked about yet as much as that you are trying to build a version of your production environment that you are trying to build a, a, a version of your production environment that you might use inside a development or a staging. You don't want to have real world data, but you want to have it reflect what's happening

Starting point is 00:35:34 in your production system. So this allows any of your developers to use it, to hammer away, to like investigate different records without worrying about privacy or things getting compromised or anything like that. So in this use case, we have customers often that will create a twin version, a diverse staging test version of a production database. Essentially, they'll queue it up to depending on how recent they need to keep it once an hour, once a day, we'll run the job, we'll bring in all the new records, the records that have changed, train a synthetic model on that

Starting point is 00:36:07 data and create another, essentially create another database that sits inside of your near test or your staging environment. The really neat thing is not just the database you're getting here, you're getting a model. And this model can be used to either subset that data. So if you

Starting point is 00:36:23 have, you know, 2 billion records inside your production database and you can't run that in your dev-taging environment without having insane DynamoDB costs, you can create a smaller data set that captures as many of the variations as possible. So it's much more efficient than just taking a slice of that data set. It's more innovative of it. Or, and I think it's another really neat

Starting point is 00:36:45 use case for scale testing. Yeah. You want to test the ability of your application to handle 10 or 100 times the amount of data you might encounter

Starting point is 00:36:53 on a typical day without just repeating the same records over and over again. You can use that same model to generate new variations of the data

Starting point is 00:37:01 that you can use to test. That's super interesting. And how efficient this process is, can you take us through the developer experience? Let's say I'm a developer, I have my database today, and I want to go and try the limits of my production environment, right? What am I going to be doing? How am I going to be using Gradle to do that?

Starting point is 00:37:29 Yeah. So this process with Gradle is two stages. You've got your production database that's sitting. Let's say it's a Postgres database or it's an Atlas database, hosted or not hosted, it really doesn't matter. You want to create a version of it

Starting point is 00:37:46 for your lower production version. There's two steps you need to do. One, you don't want the synthetic data model to memorize important customer information names, customer IDs, and things like that. So we have two steps. These are both powered by cloud APIs, so you can either run in the cloud.

Starting point is 00:38:05 Sometimes customers have really sensitive data requirements, so they need to run inside their own cloud. So you can deploy essentially these workers as containers to your own cloud. But the two steps are one, scan and use NLP to identify, for example, sensitive data, customer IDs, names, things like that. From there, you have a policy that says, whenever I see this, I'm going to redact it. I'm going to replace it with a fake version of it. I'm going to encrypt it in place or whatever your company feels is appropriate. Often we see people using fake data because it just, so the name, for example, my username, I might replace with another artificial name to make sure the model doesn't learn it.

Starting point is 00:38:49 You have a risk there that even when you do that traditional de-identification, the other attributes of your data set that by themselves aren't identifying, for example, like my age, my location, a lot of times the advertising data, this precise location will put you right at somebody's house, right? It becomes very identifying when you put those together. And that's the real power of these synthetic models is that they will create the artificial versions of those things. So you remove or replace the names inside your data set. You create new artificial locations, shopping cart activity, like whatever you have inside of your data set.

Starting point is 00:39:21 So the second stage is the data synthesis where you train a model and then you tell that model, I want to generate 10 times as much data or I want to generate one fifth as much data and you take the outputs and essentially put that right back into your database so you create a twin database that you can use for testing. Okay. And so how long that process takes? How long it takes like to train this model? That varies, and it varies based on what your use case is.

Starting point is 00:39:55 And so we have this kind of belief that there's no one machine learning model to rule them all, and each one has different advantages. So if you were a machine learning use case and you care about accuracy, you'd want to use a deep learning generative model, which gives you the best performance of anything. Alternatively, you have want to use a deep learning generative model, like HODL, which gives you the best performance of anything. Alternatively, you have GANs, so generative adversarial networks, which don't offer quite the performance of our language models, but they're pretty fast, faster at training. And we have built, and really working with customers when they have tremendous scale they need to run at,

Starting point is 00:40:26 we've built statistical models. So these are based on copulas. So instead of using a deep learning technique, they use a mathematical, really neat kind of technique to learn and recreate distributions and data. So essentially based on your use case, like do I care about accuracy? Am I training a machine learning model on this? Use a generative algorithm that might take an hour

Starting point is 00:40:47 to five or six hours to train on a dataset, depending on the size of the dataset. If you want speed and you want to generate data at 100 megs per second so I can create 40 billion records to test my dataset, that's where I really suggest using, we call it our Amplify model, but the statistical model. And on a 32-core machine, we've clocked it at about 100 megs per second it can generate.

Starting point is 00:41:12 So if you're generating billions of records, it's entirely possible to do that within a day instead of having to wait a month to have a model to do it. Yeah, that makes a lot of sense. And like how, that's very interesting, actually. Like there is like this trade-off between like, let's say, fidelity and time that you have to spend like training the model, right? Again, going back to what it means to represent with accuracy, like the characteristics of the data that you have initially, what does this mean from the user perspective?

Starting point is 00:41:54 How can I reason as a user about that stuff? Because on a high level, it's easy to understand. But I think that when you start working with a real example and you have your own data out there, things are much harder to figure out. So how you can reason about these things? Often, it's somewhat, at the end of the day, dependent on the domain and the use case you're going after. And we see a lot of repeated domains that we talk to. We've got a Discord chat channel where we talk about things from life sciences

Starting point is 00:42:31 to things that we see in the advertising space, another area that's really kind of picking up on data. So we can reason about those. I'd say between our top and most capable language models and GANs versus our statistical models, you'll see about a 10% decrease in inaccuracy. So were you to train that as a classifier, that downstream data set model with, you know, using the statistical method would be about 10% less accurate on average than this. And one of the things I can link to that I'll link to you guys after the show here,

Starting point is 00:43:06 we run all of our models against about 50 different data sets and then compare the results and the accuracy of each one. And you can kind of see how the state-of-the-art language model performs versus a state-of-the-art GAN versus the statistical model

Starting point is 00:43:21 and kind of make your own decision there. We also are realizing that so many of our users don't have time to make this decision so you know we're introducing these things called auto params that are on by default with many of our systems now and it just looks at the size of data set and it says are you trying to generate you know to use that example again 40 billion records if you are and use auto it will just pick the right algorithm to do this for you so increasingly I think all of our vision is that six months

Starting point is 00:43:50 from now people don't have to worry about what model do I choose for this use case we just pick it based on what we've observed for the data Alright and one last question from me because we are getting closer to the

Starting point is 00:44:05 buzzer here, as Eric usually says, and I want to give him some time to ask any questions that he has. If I'm new into the synthetic data worlds, where should I look to learn more and play around with technologies or tools or anything else out there that exists right now? Yeah. So in this world, I would, I mean, of course, would recommend starting with Gretel. So just a quick thing on that. And then I'll mention a couple other platforms

Starting point is 00:44:46 to check out as well. Our underlying models and code are all open source. So Gretel Synthetics on GitHub. So you can see how they work. You can introspect how we do privacy, things like that. Our service has a pre-tier. So all you need is a Gmail or GitHub to sign up.

Starting point is 00:45:02 And we have example data sets. So we have these low-code interfaces where you can just say, I'm trying to balance a data set. I'm trying to classify a data set on a create-a-synthetic version of the CSV that I have. You don't have to write a single line of code. You can do it yourself. And that's where I always recommend starting because it just makes so much more sense after you've tried it.

Starting point is 00:45:20 So that part is free. I would also definitely recommend OpenAI has a really great playground for the chat GPT and OpenAI GPT models. Just trying some prompts or trying to send some data in and tell it to summarize something for you or create a list for something, I think that kind of gives you a feel

Starting point is 00:45:38 for where models are today and where they're going. So that'd be another thing to try as well. Awesome. Thank you so much. Eric, the microphone is yours again. Oh, wow. So much power. I feel so empowered. Yeah.

Starting point is 00:45:55 This has been such a fascinating conversation. Alex, I want to pick your brain here in the last couple of minutes on your thoughts on sort of the impacts that these technologies will have. You know, I think as we think about Gretel, you know, one example we talked about as we were prepping for the show is, you know, hospitals being able to share records around, you know, a particular disease, right? In order to help researchers and medical professionals, you know, solve a problem, you know, and help treat that disease or even cure the disease, which is really incredible. And then you have, you know, sort of, I would say, things that are in a little bit more of a gray area with like stable diffusion, right? Or even chat GPT where the uses can vary, you know, widely, right? And can even be used for things that, you know, people would consider unethical, you know, sort of depending on what you're talking about. And you have really deep experience. I mean, I was thinking about this as Costas was talking,

Starting point is 00:47:12 and you were explaining a lot of the stuff. I mean, you have experience with, you know, intelligence in the government and building AI technologies, solving privacy problems. Maybe a good way to frame my question would be, do you think about stewardship of these deep learning models and artificial intelligence in general? And if so, what are the things that are top of mind for you as we break new ground with deep learning and the ability to produce all these novel outputs? Yeah. Great question. Maybe I have two parts to that question. Where could this be transformational or what are we going to see across these different technologies both for sharing data or creating

Starting point is 00:47:58 data and then what are the ethical implications we need to think about around that? Or where this is going and the potential of it there is a very good chance that you know these this kind of and i'll kind of back this up in a second let's start something a little bit more bold but like this will be the biggest innovations happen since cloud computing and the reason I believe so is because these models give you the ability to distill

Starting point is 00:48:28 and disseminate information or intelligence in a way that has never been possible before, right? A natural language interface you can query. So I think that's huge and we're just starting to see the use cases for it.

Starting point is 00:48:41 Speaking of the data sharing use case, for example, like with life sciences institutions, things like that, right? Like, so data-driven healthcare and medicine is like, you know, anyone in the space would say that is like the biggest potential for helping health, you know, that they can see. The biggest limitation that they have is that often that data is siloed within a particular region. If you're trying to create a cure that's going to work for people across the world, but you only have access to one demographic, for example, the UK biopic,

Starting point is 00:49:11 how do you know that it's not just you created or you found a signal that it's in a population that'll work everywhere? So the power of this, and we did a really cool study with Illumina working on genomic data, was showing that we could, in fact, synthesize the one of the most complex data sets that's ever been created. Sure.

Starting point is 00:49:31 We started with mice, which was kind of funny. So, you know, but even the mice had about it, but then recreate the results of a popular research paper that had been created using that data, which was cool. So a lot of work left to be done there, both on the sheer scale of human genomic data and then also the privacy. But the potential there is that a researcher anywhere in the world that has an idea on how you can cure rare disease could test that against every hospital in the world, which would just be. So really exciting example there on that, the chat GPT and open AI approach. I hear a lot, particularly stable diffusion that it's just for creative use cases. It's just for kind of like messing around.

Starting point is 00:50:24 And I would challenge that and say, that's just where it is today it's not going to be there for long yeah and what i think is missing right now is the um the confidence that you have that model is going to output what you're looking for right so you could say like you know generate a picture of me standing on a mountain like drinking a coffee or something like that right and like maybe the first time i'll do it second time third time it won't and in the data world in the conversations we have with our users right like there are tons of applications for machine learning training machine learning models based on being able to generate new images but you have to have confidence that what the model is outputting meets your expectations so i think that's going to be

Starting point is 00:51:03 the next big you know kind of big thing there But I do think that these models are going to, you know, in one way or another, they're going to be everywhere, right? So we're training, creating more training data for models, whether it's summarizing a meeting that you had automatically for you at the end, or things like that, you're going to see these models by a bit. And the last part on ethics, and, you know kind of stewardship of where this goes it's an interesting question particularly how you kind of phrased it with the background around you know intelligence and things like that and when technologies exist when they get created they will inevitably by some level be abused

Starting point is 00:51:39 so that will happen and so i i would personally vector a lot more towards openness and relying on society to solve these problems together than having the risk of trying to control it, but then essentially just creating a small set of governments and rich companies that have access to this technology. So I really kind of applaud the open source movement here and open source publishing and research and things like that. That approach, I think, is working well. Things start to, you know, I think historically have gotten a little more problematic when

Starting point is 00:52:11 that gets closed off or limited. And then you don't have the kind of the ability for a community to look at something and give you an opinion on whether it's ethically correct or we should do something about it. Sure. Such insightful answers. I will be considering those things definitely for the rest of this week and probably

Starting point is 00:52:31 long after. Alex, this has been an unbelievably thought-provoking show and we've learned a ton. So thank you so much for giving us some of your time. I appreciate it. I think my big takeaway from this show, Kostas, is that Alex is number one, so approachable as a person, but number two, has such a variety of deep experience in the space, you know, from government intelligence to startups to, you know, delivering things at scale on a crazy timeline within AWS.

Starting point is 00:53:10 And so I just grew more and more to respect his opinion throughout the show, which made his final thoughts on where these types of deep learning technologies are going, I think, even more poignant for me. And I really agree with him. I think it was a really fresh, honest take not to say, well, you shouldn't use it for this, or you should use it for this. I mean, he acknowledged outright that these new technologies are always used in ways that humanity probably shouldn't use. And doing things in the open is a really healthy antidote to that. And so I really appreciated his perspective on that.

Starting point is 00:53:56 It sounded simple, but I think was very powerful and something that I'll definitely keep from the show. Yeah, 100%. I totally agree with that. I mean, I think at the end, especially when you're talking about technologies or knowledge in general that can, like, I don't know, like, change in a very foundational way, like the way that we operate as humans.

Starting point is 00:54:23 Yeah, it might be scary. Obviously, we can make mistakes and use the technology in the wrong way. But at the end, that's how, I don't know, humanity models like to make progress, right? I don't think that we can change that. And I don't think that there's that much value at the end in not taking the risk of having access to these new tools or like this new knowledge. And again, the best way to protect humanity is to make these things available to

Starting point is 00:54:55 everyone. So I totally agree. I think we are going to hear more about these technologies. And okay, there's a lot of, let's say, also like kind of like hype right now. And we are still like just scratching the surface of what can be done with these technologies, but I have a feeling like in the next couple of months, we will see like much more like practical and interesting uses of these technologies. And we'll have more people on the show also to talk about that stuff. Absolutely. It was a great episode and we want to have them back on.

Starting point is 00:55:35 But like we said earlier, when we were wrapping up the year, we want to talk more about some of these emerging technologies like ChatGPT and Gradle that are forging new ground. So thank you for joining us. Subscribe if you haven't, tell a friend, and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.

Starting point is 00:56:18 Learn how to build a CDP on your data warehouse at rudderstack.com.

CODACE Plant Stand

The Data Stack Show - 128: The Possibilities Are Endless for Synthetic Data with Alex Watson of Gretel.ai

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

The Data Stack Show - 128: The Possibilities Are Endless for Synthetic Data with Alex Watson of Gretel.ai

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.