The Data Stack Show - 128: The Possibilities Are Endless for Synthetic Data with Alex Watson of Gretel.ai
Episode Date: March 1, 2023Highlights from this week’s conversation include:Alex’s background working for NSA and starting a company (1:51)The Gretel.ai journey (9:30)Defining synthetic data (13:26)The evolution of AI in de...ep learning data and language learning (16:28)The properties of synthetic data (21:31)Boundaries between synthetic data and prediction models (25:52)The developer experience in Gretel.ai (36:44)Stewardship and expansion of deep learning models in the future (45:36) Final thoughts and takeaways (52:17)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
The Data Stack Show is brought to you by Rudderstack, who's doing something pretty cool.
March is Data Transformation Month at Rudderstack, and they're running a competition with a $1,000
cash prize for data engineers. Tune into the show next week and follow Rudderstack on Twitter to get
deets as soon as they drop. Welcome back to the Data Stack Show. Today,
we are going to talk with Alex, the Chief Product Officer at Gretel, gretel.ai. And we actually have
been talking about them for a while, Costas. It's a really interesting company. They do a number of
things, but the primary thing they talk about on their website is synthetic data.
And today we're going to talk all about machine learning models and training synthetic data
on real data and all the interesting use cases.
So this sounds basic, but I want to ask Alex what their definition of synthetic data is.
I mean, you can of synthetic data is.
I mean, you can create synthetic data, you know, in a spreadsheet, you know, in Excel, right?
But their flavor of synthetic data is, you know, is pretty specific and I think really powerful.
So that's what I'm going to ask is for him to define it in Gretel terms, if you will. Yeah, and I want to get a little bit deeper into what it means to generate
a synthetic data from a real data set.
Like, how we can reason about the accuracy.
Like, what kind of, let's say, characteristics
of the original data sets we want to recreate.
So that's definitely something that I'm super curious to learn more about.
And I think we'd have the right person to do that today.
So let's go and do it.
Let's do it.
Alex, welcome to the Data Stack Show.
Thanks, Eric.
We've wanted to talk about, actually, we've talked about Gretel Acosta for some time and synthetic data.
So just super excited to actually cover that topic on the show.
Been on the list for a while.
Let's start where we always do.
Give us your background and what led you to Gretel.
Yeah, sure.
Give the two minute version of it here.
I started my academic career
in computer science, moved out to the East Coast right after September 11. And I actually joined
the NSA. I was working there for about seven years. Awesome experience. I got to dive in on
early applications of machine learning and also security, which has influenced my career quite a bit over
the years since then.
2013-ish, I moved out to San Diego where I am now.
I started my first company.
It's a company called Harvest AI.
We were helping large companies that were starting at the time to transition to use
SaaS applications like Google Suite, Office 365, Salesforce, AWS, things like that.
Help them to identify where important data was inside their environment and protect it.
So really cool experience there.
We built that for about two years.
We were at that point went out for a series A raise.
Had some interest around acquisition and actually got acquired by AWS. And I went on to spend the next four years of my career at AWS as a general manager, launching the first security service for AWS, which was our product at Harvest called Macy.
It's a service that customers use today within the AWS world to identify and protect important data in the cloud.
And through, happy to kind of dive in on on you know that process here but i think through
you know both the incredible access that we had inside the walls to data at aws and then also
talking to customers and realizing how difficult it was for them to enable access to the sometimes
really incredible data sets that they had that they you know to enable decisions inside of their
business was really some of the that led to the initial pieces that we have
with Gruddle and Synthetic Data today.
Very cool.
So much to dive into there.
Can I ask one question about working for the NSA?
Because, you know, like working in sort of
intelligence type stuff for the government,
I think a lot of times, probably because of Hollywood,
you have like two
views of it it's either like extremely advanced and like very scary big brother or it's like well
it's the government and they move slow and so like maybe the technology is quite as good like
where on that spectrum was the actual experience of working with the nsa if you can tell us
the answer there is both though my uh you know my first job
was programming crazy for computers actually when i started so i got a chance to work with
no way putting edge multi-million dollar machines on really cool um so you know this scale with
which they were working also the caliber of the people there are you know almost unlike any other
place i've ever worked.
Wow.
Also really incredible.
Also, it's the government.
So things don't move quite as quickly as you would hope, but incredible experience there.
Yeah, yeah.
Very cool.
Thanks for indulging me.
Okay, let's talk about Harvest. So you were going out for a Series A and then you get, you know, sort of ingested into, you know, a company that, you know, provides, you know, maybe more data infrastructure than any other company in the world, right?
What was that like? Because at Macy's you worked with data at a huge scale.
So can you just talk about that experience a little bit and maybe, you know, especially as Macy sort of grew to be a, you know, a very
widely used product. What were some of the lessons that you learned working at AWS scale providing
a data service via AWS? Yeah. Yeah. You know, with scale, I think one of the things you learn
really fast is the details matter. So that's one thing that really stands out. Those things that
you only have a couple of customers and you're, you know, even, you know,
large scale customers, but those things that are okay to let slide become really big issues when
you have thousands of customers. And that's really what we needed to prepare for. I think on
kind of cool experiences or things that I learned, even during my time there, I think
dealing with the scale, like how do we do natural language processing NLP
at the scale of terabytes or petabytes of data
that customers have in the cloud was really fascinating.
I also think the experience of taking, you know,
at the time, a single tenant software that we'd written
that would run inside a VPC per customer
to a multi-tenant that needed to support, you know, thousands to tens of thousands of customers in the first month was quite the
experience and had some, we had some, some pretty cool learnings during that process.
One of the stories, maybe just to cover it really quickly that like really stood out to me
and just kind of helped shape how I think about building software today was as most people know like at AWS like everything revolves around reInvent and New York Summit those
two launches right and those are the two times that you launch service and we were hitting the
ground running and having really good you know traction I think with a couple customers and we
were getting ready to launch um Macy to the world and the fully multi-tenant version of Macy.
And one of the kind of challenges that we ran into
was we had not enough time to finish,
completely finish multi-tenancy before we launched.
So our choice was either delay six months
and launch it at reInvent
or launch at New York Summit.
We really wanted to launch, you know,
and how could we get there?
What could we do?
And one of our product managers had a really kind of ingenious idea and said,
what if we launched the whole backend as multi-tenant? We launched the front end as
single tenant. So what that meant is that each customer would have their own unique
box in the cloud that would be running our complete user interface stack.
And since it's AWS, it's never just one box.
You have three regions per zone.
Oh, yeah.
Sorry, you have three zones per region.
So you have high availability within there.
So for each region, we need to have three boxes per customer.
We forecasted we would have about 6,000 customers at launch at the bar window.
You know, that's somewhere around there.
So that meant that for us to launch on time, we needed to run 18,000 virtual machines.
To power the user interface for these customers that might sign up.
So incredible experience.
It was wild.
We almost broke CloudFormation doing a deployment.
At the time, I'm sure it can handle it quite easily now.
But at the time, I'm sure it can handle it quite easily now,
but at the time that was pretty new. And we forecasted that if we could finish the multi-tenant version of the UI within 45 days and shut it down, that we would actually have a pretty conservative
amount of cost for running all those user interfaces. So that was one of my more wild
experiences was exactly 45 days into the launch,
we were able to turn 18,000 machines into nine machines.
I'm sure a data center kind of collectively
cooled down at that point,
but that was a neat experience and it went without a hitch.
So it's one of those things like just taking a step back
and asking how you can do something
when you're trying to hit a deadline
or do something like that.
And being data-driven in the decisions you made,
we felt like we could get there and we did.
That was a really cool experience.
Wow, what a story.
What a story.
That is so great.
There's a lot of stress in there.
It sounds like it's a problem starting now.
I can only imagine.
It's the pendulum swinging between like,
this is going to be awesome, we can pull it off,
and are we completely crazy?
Totally.
Well, tell us about Gretel.
When did you decide to start it?
And then give us an overview of the problems that you saw.
So we started Gretdle with this thesis.
And the thesis was that it was really difficult, as we saw.
And as I saw running Macy and talking to our big customers that are trying to figure out whether all their important data is in the cloud and protect it and figure out if it's exposed to the world or answer those questions.
Like how difficult of a problem it is to enable access
to data inside of a business.
And usually that revolves around privacy, right?
So like a contract that you have with your customers for your brand or sometimes legally
enforced with things like GDPR.
And a feeling that we had that kind of the existing methods that are like, oh, build a wall around your data, build a perimeter,
build a better perimeter or VPC.
Like those things are effective tools, but they don't work.
And at some point they're going to break.
And that's what kind of leads to breaches happening.
And our initial thesis for Gretel was saying,
what if we could train a generative AI model?
So, you know, very similar technology under the hood to
what you see at like OpenAI with the
GPT model
on data instead of natural language text.
And what if we could get that model to recreate
another data set that looks just like your
sensitive data set, except it's not based on
actual people, objects, things.
And what effect
would that have on privacy? And in theory,
if you could pull it off it wouldn't matter
if someone's you know computer got left at a starbucks and it got picked up and it had you
know a lot of sensitive information on it and with things possible maybe we could unlock new ways to
share data it's evolved quite a bit since then we've got a couple use cases but i think that's
one of the still one of the primary ones that we see today is how to address privacy and how to essentially use these generative models to anonymize data.
Yeah, super interesting.
You mentioned a few tools that companies, you know, turn to in order to mitigate concerns around, you know, privacy and security.
You know, you mentioned VPC, for example.
Those seem to be pretty pervasive.
I mean, those are sort of like the default set. you mentioned VPC, for example. Those seem to be pretty pervasive.
I mean, those are sort of like the default set.
Would you agree with that?
Is that the most common pattern that you see for sort of... I think perimeter was a great term, right?
I mean, doing nothing is obviously not an option for many companies.
But to your point, there's data breaches in the news every single week.
Yeah. At various levels levels you see customers you know some customers keeping data within
like within their own kind of perimeter of their walls their private cloud
you see other customers using the cloud that will you know embrace technologies which are
awesome in my opinion like you know bpcss and using role-based access to things instead of passwords
and things like that. So really good patterns all around. But, you know, access control still leads
to the chance and the risk of raw data finding its way out. So it's one of the things like just to,
you know, I would say I applaud the effort that a lot of companies put in making that work is really difficult like you start seeing permissions when you're trying
to set up a vpc or an s3 group and often a developer just makes a change and they say
i'm going to do this real fast and see if it works and i'll fix it later and they forget
you start seeing issues like that so there's you know a whole new class of tools that are being
built to to address problems like that.
But I've been in the security world long enough that you start to see the repeated patterns and that repeated pattern of a better way to build a perimeter around data is one that sounds good.
And it works in some cases, but it's not an answer, a long-term answer.
Sure. Yeah. I mean, all best practices for sure, right?
That don't necessarily get to the root of the problem.
One thing that'd be helpful, I think,
especially to give context to the rest of the conversation
as we get more technical here,
could you define synthetic data as Gretel sees it?
Because it's, you know, creating synthetic data as Gretel sees it.
Because it's, you know,
creating synthetic data is a concept that's been around for a very long time.
I guess we could argue how far back in history,
but especially as it relates to technology,
people have been creating data sets
synthetically for decades and decades.
So could you help orient us around the term as it relates to specifically what Gretel does?
Yeah.
So maybe I'll start with a really broad term.
We were describing the 1970s, someone sitting at a DOS terminal writing up their own or
a Unix terminal writing up a CSV file
of data that you would
use to test your program. That's synthetic
data. So broadly speaking,
I would define synthetic data
as
a computer simulation or algorithm
that can simulate real
world events, objects,
activities, things like that.
So it could be a spreadsheet, it could be
a mathematical formula, it could be a computer program that just spits out random temperatures,
ages, things like that for people. So it can be that simple. You also hear the term a lot of times
like fake data or things like that, where you have kind of mock data that might make sense for
testing a user interface or something like that, but you wouldn't want to ever query that data or ask it questions. In the Cradle context,
we use synthetic data to define data generated by a set of deep learning algorithms. So similar,
once again, to use that analogy with OpenAI's GPT models or chat GPT or stable diffusion per image.
Essentially, we have models that learn to recreate data like what they've been trained on.
And you can either create another data set that looks just like it, once again, with
artificial people, places, things like that. Or you can prompt the model to create a new class
or to boost the representation of a class in your data set where you want to see more examples. And we see a lot of that too. to power data science use cases inside your business or information exchanges or things
like that. We see a lot of this in the life sciences world where you've got companies that
are trying to share broadly, you know, research about COVID or about genetic diseases or things
like that at scale while preserving privacy. So that is when we talk about, you know,
synthetic data in the gradual context, it's data that can be used that has the same quality and accuracy as the original data it was based on.
Fascinating. Okay, can we talk about similarities to ChatGPT? It's been making its way across Hacker News and Twitter and all over our internal Slack channels with people doing interesting stuff with it.
And you have a lot of experience with natural language processing and algorithms that run on
language. Could you explain the differences in that flavor of deep learning as compared with
running deep learning on data itself, right? I mean,
that's an interesting concept to consider in general, right? Can you just explain the difference
in sort of the, even the ergonomics of like how you would approach deep learning on data versus
natural language, right? Because there's, it's just a really, it seems like a very different
paradigm, but it sounds like they're actually pretty close. Yeah, they are. I think the underlying technology, maybe to talk about that first,
and then talk about the interface for how people interact with it. The underlying technology
is using a class of machine learning models called language models or large language models
for both open AI and what we're doing in Gradle. And that came out of a realization really on our part that a dataset is a language in its own kind of right
that makes sense to computers
that is harder for humans to assimilate.
But under the hood, the technology is very similar.
We're using language models.
We use a recent class of language models called Transformers
that have a great ability to learn data
from wide collections of data sets and be
able to apply it to whatever context you're asking about. So you can essentially augment your data
with better examples. So I think the OpenAI GPT examples is very close to what we do at Cradle.
Chat GPT is a layer on top of GPT-3. And it is a slightly different mechanism that's used
for training. And this is, you know, kind of wild to think about, and you really have to dive in
here. But under the hood, it's hard to believe that it works at this scale. But under the hood,
GPT-3, or, you know, Gretel at this time, is really predicting the next field. If I have a user that's from,
if you've got a movie review dataset and you've got people that have consistently rated
a movie at this level,
and you have a new user being generated,
probably it's going to generate a rating
within a certain range.
So really it's just saying,
okay, if I have all the information,
what's the next most logical thing for me to do?
Cat GPT put a layer on top of that and did two things that I think were really significant.
One, essentially uses this concept called like human-based reinforcement learning,
where you have humans that are kind of, instead of just getting an algorithm that is the single
best thing at predicting what the next token in a sequence is going to be, it takes a look at the
whole result and says,
is this the result that I want as a human or not?
So there's human labelers that are looking at it
and they're saying,
I ask it to create me a list of to-do items for today.
Like, do these make sense to me?
And a human reviewer at opening,
I will look at it and they'll say,
that is the best answer.
And then they'll use that to feed back into the algorithm
and come up with better results.
So two things that I think that are really significant are kind of happening right now.
One, we're orienting machine learning algorithms to have responses more like what humans want to see.
Which is good, right?
The robots won't be rising up against us if we're teaching them to do the things we want them to do.
At least we have a say in the uprising.
We get a say in the uprising.
Right.
All it takes is one person to train in a different way,
but fortunately our training was going the right direction.
So that part is really neat.
I think the other part that I love about it
that I'm really excited to see in the synthetic data world
is this natural language interface
that you use to talk to models, right?
So instead of GPT-22 if we were to rewind back one
major gpt version right and look at it and you would give it a couple examples you would give
examples of tweets or blogs and it would create new tweets or blogs like what it was trained on
with chat gpt and increasingly with the other gpt models you can just say like brainstorm a list of
to-do topics
for me to look at today or something like that.
Yeah.
So you have this natural language interface
similar to stable diffusion with images, right?
Where you can say,
Yep.
I want a unicorn on a surfboard on Mars, right?
And it'll generate that.
So I'm really excited to see this way
that we interact with data becoming more
based on natural language than SQL queries
or, you know, data engineering
that we all have to do to get that kind of level of answer right now.
Yeah, absolutely fascinating.
All right.
Well, I'm going to stop myself because I have a tidal wave of questions backed up.
Costas, please jump in here because I know you have a ton of questions as well.
Yeah, thank you, Eric. So
Alex, let's talk a
little bit more about
synthetic data. And
you mentioned
because Eric asked
what synthetic data is.
And I'd like to get a little
bit more detailed on what it means
from a data set to generate more data that are synthetic artificial, right?
And they're similar.
They share the same properties.
What are these properties?
How should we think about that?
A synthetic data set, and we use the term like if you were to query it, right?
And so if you were to build a dashboard off of this data set or to send it to SQL query,
that you would get a very similar response for an aggregate statistic. So what is the average age of a person that likes to buy this product?
Or when did I have the spike in activity? Things like that will be very similar between synthetic
data set and the real world data set. There's a couple ways we measure this. You know, at first,
you create your first synthetic data set, you look at it, you're like, that looks great. I don't know
how it's going to work for me. And that's the first question that, you know, we always hear
from our users, the data set looks awesome.
You know, like when I look at it, it looks fine, but I don't know how accurate it is.
How do I measure that?
And so what we try to do, we have both like opinionated ways to measure the quality or
the accuracy of synthetic data and unopinionated ways.
The unopinionated ways make the most sense when you're just trying to create an artificial version of a dataset. You don't know how it's going to be used. So we don't know what
type of machine learning tasks it's going to be used for. So we can't measure that.
So what we do is we look at, I'll give you a couple of examples. We look at
the correlations that exist between each pair of records in the original dataset and the
correlations that exist in the synthetic dataset. So if i knew that uh to go back to the movie review data set right that like you know movies
with keanu reeves usually have like really high ratings right like i would expect another movie
in the synthetic data set with keanu reeves to have high ratings and kind of it goes on much
more deep than just kind of that two level correlation but that's the first thing that
you know one of the things that we look at.
Another really helpful tool we have is called field distribution stability.
So we're looking at,
and this is just pretty common data science tactic.
A lot of people will do this.
We just automate it
where you might have a data set
that's got a hundred rows in it,
a hundred columns.
Essentially, we plot that.
It's something called PCA,
principal component analysis.
Now we plot it in a 2D plane, and we
look at the difference between the plots, or
the synthetic and the real-world dataset, and we say,
when we map this 56
or 100-column dataset down to
two dimensions, how similar do these
distributions look? And that gives you a good insight as to
whether the model's overfitting, and
it's just repeating a couple things inside of there or if it's capturing a whole distribution.
And the third thing is the most probably intuitive way to think about it. And it's just looking at
the per-field distributions, right? If you've got admission times where people in an EHR data set,
or you're looking at financial data set and you have open, high, low, close, you know, type data.
Do the distributions of each one of those match what you're expecting to see.
And that's something we try to automate the whole process and give you a single score that helps you reason about how well the model's working.
All right. And this is the unopinionated ways.
What are the opinionated ways?
The opinionated way is when you know how you're going to use a dataset.
If you're going to use it for downstream classification
training, regression, you're using it
for forecasting. We see this quite a bit in the financial space, right?
Where you want to use a time-sensitive data, but to forecast
a stock price is what that's going to be. Things like that.
When you know how you're going to do that,
you can actually simulate running the
synthetic data on the same downstream forecasting use case as the real world data and compare the
two. There are some really great tools out there that make this easy. So there's a framework in
Python called PyCaret. A lot of our customers like to use that quite a bit. Essentially, it simplifies this process of testing how your synthetic data works on classification tasks or QA, question answering tasks and stuff like that versus the real world data is based on.
Okay.
That's super interesting.
I can't feel myself.
I have to ask you that.
Where is the boundary between synthetic data and prediction of the future at the end, right?
Because, and I am asking that because you are mentioning like, okay, like
financial models, for example, where you're using like synthetic data to go
and like run some models and do the, whatever they want to do there.
And also, we had the conversation earlier about all these things about trying to predict what should be the next
part of the text. So prediction is part of
the whole thing that we are doing here.
So what's the boundary there between trying to
predict what is going to happen at some point in the future, for example, or in some kind of
dataset and actually just creating, let's say, data that they share some common characteristics,
but at the same time they don't represent reality, right?
Great question.
I think where machine learning models in general
have a really hard time is dealing with data
that they've never seen before.
When there is a market event,
to go back to the financial world,
that never happened before,
it's unlikely that your machine learning model
is going to be proficient at detecting it.
But that said, history repeats itself.
So one of the really popular use cases
that we see in the financial space
is when you have rare events.
For example, the GameStop events
that happened where significant market changes happened due to something that has happened for the first time.
Crypto market crashes, things like that.
When you want to train your machine learning models to be good at detecting this, and you can only pass it a single example, once again, it's not going to do well.
So that is an area where I think synthetic data can really help.
Today's synthetic data can really help is that you can give it an example of saying
like, hey, look what happened with GameStop.
I want you to create another 50 or 100 examples of something like that happening.
So it can be better detecting that if it happens in the future.
So those are artificial.
They're based on real world data and they're based off, you know, kind of learning off what happened in that
one example, but they're not perfect. But in many cases that actually really helps we see.
And that's kind of one of the neat kind of patterns we're starting to see with
our journey, you know, kind of building synthetic data. I think that, you know,
we're in year three now at Gretel and first year was like does it work on my data set right and the next year was like okay but how
does it work against my real data and then now we're starting to see this kind of tipping point
where people are realizing that machine learning models are data hungry there will always be
classes that you're not good at so this idea of augmenting your real-world dataset with additional synthetic examples
that are perhaps trained off public data.
So has the world seen this before?
And can I incorporate some of that knowledge
into my own dataset?
Helps you build a better dataset
that can have better accuracy
than you would have all by itself.
That makes sense.
All right, so talking about data,
what are the types of data we are
talking about here, because we can have like a synthetic picture and we
can have a synthetic audio file, we can have a synthetic row in the table,
right, in the database.
So what are like the most common use cases that you see out there where like
synthetic data is like important today.
And by the way, I know that's because here you mentioned many times like the natural
language processing part of this, so it's probably more textual, but we have other things there.
We have like time series data, we have structured versus, structured data.
So yeah, I'm going to come across to what, you know, probably a little bias here because
like I would say Gretel would be the one of the leading companies, not the leading company
in working with tabular formats of data.
That's really where we got our start.
That's, you know's where we built on.
That said, our vision,
and I think the vision you described
for synthetic data,
is much bigger than one type of data.
Maybe to talk about the types of data
that we see being used for synthetics quite often,
I can even give some examples for different types,
but you have tabular data to start out.
The stuff that you have inside a data warehouse,
a database, things like that.
Time series data, which mounts at first
niche category of tabular data
until you realize like 50% of the world's data sets,
the time is such an important component
that it's one we actually treat differently.
Text, so natural language text for different languages.
And image synthetics are really big,
right? So people are using images quite often to train models for self-driving cars or to recognize problems in a manufacturing line or things like that. So a lot of use cases around that.
And I'd say increasingly getting into video and audio. So some of the new technologies recently, like stable diffusion, have really shown the
ability to create new variations or artificial versions of images and videos where the companies
that are trying to build, say you're an insurance company and you're trying to build something
to give somebody a better insurance quote for their house and you want
to look at the quality of the materials that they have and does it look like they have
fire extinguishers and things like that around the house just from a set of pictures
you never have enough data to start with so this idea of augmenting images with you might have a
room and you want to see a room with really fancy like furniture or with more like something that a college student might have, things like that.
So we're seeing a lot of use cases there.
And maybe the last place to touch on would be the simulation space.
So we're talking about today, we've talked a lot about generative models, machine learning model, neural network that creates new examples of things.
But there is, you know, in parallel to that, there is a simulation
space where you might use something like a computer game
engine. So Unity
would be a good example of this.
NVIDIA has
a neat product called the Omniverse
as well, where essentially they
have created a 3D world using a
game engine that you can use to create
and test these different kind of
simulation-based outcomes. Wow, that's super interesting. Okay, let's focus on tabular data. What do we mean by
tabular data? Tabular data is any type of, and I'll use the term here, and I'd love to hear what
you think on it too, like any type of structured or semi-structured data format. So it could be
anything from a CSV file to use the formats
again, JSON, where
you don't necessarily have the same level of structure,
but you can have arbitrary levels of nesting.
More advanced data formats
like Parquet that are really efficient
at encoding large amounts of data
or just data that's inside a database
or a data warehouse.
Okay.
When we are talking about like creating like synthetic data here, what I do the most
common approach that you see out there is like, okay, I have, let's say a user table,
right?
With like 1 million users.
And I'd like to see like 2 million of these users having like, or the distribution of the users similar,
let's say in terms of the age or the geography or whatever kind of information we capture
already on this table.
Is this something that's the most common use case that you see out there?
Or people are actually coming and they're like, okay, that's my database here, right?
Like I have users and the users have, I don't know, products that they procured at some point.
And I also have, let's say, my inventory.
And I also have, like, let's say, we represent the whole domain of like what the company is dealing with or the user is dealing with, which can be quite complex, right?
And they won't like to synthesize the whole database that is out there.
The relational components, so not just capturing the relationships that have inside a single table like your users table,
but capturing the relationships between users and the inventory table is a really
cool challenge to the synthetic data space.
Yeah.
To answer your question, when we have
users come in and use our platform
often,
especially if you're doing pre-production
testing, like a really big use case we haven't talked
about yet as much as that
you are trying to build
a version of your production environment that you are trying to build a, a, a version of your production
environment that you might use inside a development or a staging.
You don't want to have real world data, but you want to have it reflect what's happening
in your production system.
So this allows any of your developers to use it, to hammer away, to like investigate different
records without worrying about privacy or things getting compromised or
anything like that. So in this use case, we have customers often that will create a twin version,
a diverse staging test version of a production database. Essentially, they'll queue it up to
depending on how recent they need to keep it once an hour, once a day, we'll run the job,
we'll bring in all the new records, the records that have
changed, train a synthetic model on that
data and create another, essentially
create another
database that sits inside of your
near test or your staging environment.
The really neat thing is not just the
database you're getting here, you're getting a model.
And this model can be used to either
subset that data. So if you
have, you know, 2 billion records inside your production database
and you can't run that in your dev-taging environment
without having insane DynamoDB costs,
you can create a smaller data set that captures
as many of the variations as possible.
So it's much more efficient than just taking a slice of that data set.
It's more innovative of it.
Or, and I think it's another really neat
use case for scale testing.
Yeah.
You want to test
the ability of your
application to handle
10 or 100 times
the amount of data
you might encounter
on a typical day
without just repeating
the same records
over and over again.
You can use that same model
to generate
new variations
of the data
that you can use to test.
That's super interesting.
And how efficient this process is,
can you take us through the developer experience?
Let's say I'm a developer, I have my database today,
and I want to go and try the limits of my production environment, right?
What am I going to be doing?
How am I going to be using Gradle to do that?
Yeah.
So this process with Gradle is two stages.
You've got your production database that's sitting.
Let's say it's a Postgres database
or it's an Atlas database,
hosted or not hosted,
it really doesn't matter.
You want to create a version of it
for your lower production version.
There's two steps you need to do.
One, you don't want the synthetic data model
to memorize important customer information names,
customer IDs, and things like that.
So we have two steps.
These are both powered by cloud APIs,
so you can either run in the cloud.
Sometimes customers have really sensitive data requirements, so they need to run inside their own cloud. So you
can deploy essentially these workers as containers to your own cloud. But the two steps are one,
scan and use NLP to identify, for example, sensitive data, customer IDs, names, things like
that. From there, you have a policy that says, whenever I see this,
I'm going to redact it. I'm going to replace it with a fake version of it. I'm going to encrypt
it in place or whatever your company feels is appropriate. Often we see people using fake data
because it just, so the name, for example, my username, I might replace with another artificial
name to make sure the model doesn't learn it.
You have a risk there that even when you do that traditional de-identification,
the other attributes of your data set that by themselves aren't identifying, for example,
like my age, my location, a lot of times the advertising data, this precise location will put you right at somebody's house, right? It becomes very identifying when you put those
together. And that's the real power of these synthetic models is that they will create the artificial
versions of those things.
So you remove or replace the names inside your data set.
You create new artificial locations, shopping cart activity, like whatever you have inside
of your data set.
So the second stage is the data synthesis where you train a model and then you tell
that model, I want to generate 10 times as much data or I want to generate one
fifth as much data and you take the outputs and essentially put that right
back into your database so you create a twin database that you can use for testing.
Okay.
And so how long that process takes?
How long it takes like to train this model?
That varies, and it varies based on what your use case is.
And so we have this kind of belief that there's no one machine learning model to rule them all,
and each one has different advantages. So if you were a machine learning use case and you care about accuracy,
you'd want to use a deep learning generative model,
which gives you the best performance of anything. Alternatively, you have want to use a deep learning generative model, like HODL, which gives you
the best performance of anything. Alternatively, you have GANs, so generative adversarial networks,
which don't offer quite the performance of our language models, but they're pretty fast,
faster at training. And we have built, and really working with customers when they have
tremendous scale they need to run at,
we've built statistical models.
So these are based on copulas.
So instead of using a deep learning technique, they use a mathematical, really neat kind of technique
to learn and recreate distributions and data.
So essentially based on your use case,
like do I care about accuracy?
Am I training a machine learning model on this?
Use a generative algorithm that might take an hour
to five or six hours to train on a dataset,
depending on the size of the dataset.
If you want speed and you want to generate data
at 100 megs per second so I can create 40 billion records
to test my dataset, that's where I really suggest using,
we call it our Amplify model, but the statistical
model.
And on a 32-core machine, we've clocked it at about 100 megs per second it can generate.
So if you're generating billions of records, it's entirely possible to do that within a
day instead of having to wait a month to have a model to do it.
Yeah, that makes a lot of sense. And like how, that's very interesting, actually.
Like there is like this trade-off between like, let's say, fidelity and time that you
have to spend like training the model, right?
Again, going back to what it means to represent with accuracy,
like the characteristics of the data that you have initially,
what does this mean from the user perspective?
How can I reason as a user about that stuff?
Because on a high level, it's easy to understand.
But I think that when you start working with a real example
and you have your own data out there, things are much harder to figure out.
So how you can reason about these things?
Often, it's somewhat, at the end of the day, dependent on the domain and the use case you're going after. And we see a lot of repeated domains that we talk to.
We've got a Discord chat channel
where we talk about things from life sciences
to things that we see in the advertising space,
another area that's really kind of picking up on data.
So we can reason about those.
I'd say between our top and most capable language models
and GANs versus our statistical models, you'll see about
a 10% decrease in inaccuracy. So were you to train that as a classifier, that downstream data set
model with, you know, using the statistical method would be about 10% less accurate on average
than this. And one of the things I can link to that I'll link to you guys after the show here,
we run all of our models
against about 50 different data sets
and then compare the results
and the accuracy of each one.
And you can kind of see
how the state-of-the-art language model performs
versus a state-of-the-art GAN
versus the statistical model
and kind of make your own decision there.
We also are realizing that
so many of our users don't have time to make this decision so you know we're introducing these things
called auto params that are on by default with many of our systems now and it just looks at the
size of data set and it says are you trying to generate you know to use that example again 40
billion records if you are and use auto it will just pick the right algorithm to do this for you
so increasingly
I think all of our vision is that six months
from now people don't have to worry about
what model do I choose for this use case
we just pick it based on what we've observed
for the data
Alright
and one last question from me
because we are getting closer
to the
buzzer here, as Eric usually says, and I want to give him some time to
ask any questions that he has.
If I'm new into the synthetic data worlds, where should I look to learn more and play around with technologies or
tools or anything else out there that exists right now?
Yeah.
So in this world, I would, I mean, of course, would recommend starting with Gretel.
So just a quick thing on that.
And then I'll mention a couple other platforms
to check out as well.
Our underlying models and code are all open source.
So Gretel Synthetics on GitHub.
So you can see how they work.
You can introspect how we do privacy,
things like that.
Our service has a pre-tier.
So all you need is a Gmail or GitHub to sign up.
And we have example data sets.
So we have these low-code interfaces where you can just say, I'm trying to balance a
data set.
I'm trying to classify a data set on a create-a-synthetic version of the CSV that I have.
You don't have to write a single line of code.
You can do it yourself.
And that's where I always recommend starting because it just makes so much more sense after
you've tried it.
So that part is free.
I would also definitely recommend OpenAI has a really great playground
for the chat GPT and OpenAI GPT models.
Just trying some prompts
or trying to send some data in
and tell it to summarize something for you
or create a list for something,
I think that kind of gives you a feel
for where models are today
and where they're going.
So that'd be another thing to try as well.
Awesome. Thank you so much.
Eric, the microphone is yours again.
Oh, wow.
So much power.
I feel so empowered. Yeah.
This has been such a fascinating conversation.
Alex, I want to pick your brain here
in the last couple of minutes
on your thoughts on sort of the impacts that these technologies will have.
You know, I think as we think about Gretel, you know, one example we talked about as we were prepping for the show is, you know, hospitals being able to share records around, you know, a particular disease, right?
In order to help researchers and medical professionals, you know, solve a problem,
you know, and help treat that disease or even cure the disease, which is really incredible.
And then you have, you know, sort of, I would say, things that are in a little bit more of a gray area with like stable diffusion, right? Or even chat GPT where the uses can vary, you know, widely, right? And can even be used for things that, you know, people would consider unethical, you know, sort of depending on what you're talking about. And you have really deep experience. I mean, I was thinking about this as Costas was talking,
and you were explaining a lot of the stuff. I mean, you have experience with, you know, intelligence in the government and building AI technologies, solving privacy problems.
Maybe a good way to frame my question would be, do you think about stewardship of these deep learning models and artificial intelligence in general? And if so, what are the things that are top of mind for you as we break new ground with deep learning and the ability to produce all these novel outputs?
Yeah.
Great question.
Maybe I have two parts to that question. Where could this be
transformational or what are we going to see across these
different technologies both for
sharing data or creating
data and then what are the ethical implications
we need to think about around that?
Or where this
is going and the potential of it
there is a very good chance that you know these this kind of and i'll kind of back this up in a
second let's start something a little bit more bold but like this will be the biggest
innovations happen since cloud computing and the reason I believe so is because these models
give you the ability to distill
and disseminate information
or intelligence in a way
that has never been possible before, right?
A natural language interface
you can query.
So I think that's huge
and we're just starting to see
the use cases for it.
Speaking of the data sharing use case,
for example, like with life sciences institutions,
things like that, right? Like, so data-driven healthcare and medicine is like, you know,
anyone in the space would say that is like the biggest potential for helping health, you know,
that they can see. The biggest limitation that they have is that often that data is siloed within a particular region. If you're trying to create a cure
that's going to work for people across the world,
but you only have access to one demographic,
for example, the UK biopic,
how do you know that it's not just you created
or you found a signal that it's in a population
that'll work everywhere?
So the power of this,
and we did a really cool study with Illumina
working on genomic data,
was showing that we could, in fact, synthesize the one of the most complex data sets that's ever been created.
Sure.
We started with mice, which was kind of funny.
So, you know, but even the mice had about it, but then recreate the results of a popular research paper that had been created using that data, which was cool.
So a lot of work left to be done there, both on the sheer scale of human genomic data and then also the privacy. But the potential there is that a researcher anywhere in the world that has an idea on
how you can cure rare disease could test that against every hospital in the world, which
would just be.
So really exciting example there on that, the chat GPT and open AI approach.
I hear a lot, particularly stable diffusion that it's just for creative use cases.
It's just for kind of like messing around.
And I would challenge that and say, that's just where it is today it's not going to be
there for long yeah and what i think is missing right now is the um the confidence that you have
that model is going to output what you're looking for right so you could say like you know generate
a picture of me standing on a mountain like drinking a coffee or something like that right
and like maybe the first time i'll do it second time third time it won't and in the data world in the
conversations we have with our users right like there are tons of applications for machine learning
training machine learning models based on being able to generate new images but you have to have
confidence that what the model is outputting meets your expectations so i think that's going to be
the next big you know kind of big thing there But I do think that these models are going to, you know, in one way or
another, they're going to be everywhere, right? So we're training, creating more training data
for models, whether it's summarizing a meeting that you had automatically for you at the end,
or things like that, you're going to see these models by a bit. And the last part on ethics,
and, you know
kind of stewardship of where this goes it's an interesting question particularly how you kind
of phrased it with the background around you know intelligence and things like that and
when technologies exist when they get created they will inevitably by some level be abused
so that will happen and so i i would personally vector a lot more towards openness and relying on society to
solve these problems together than having the risk of trying to control it, but then
essentially just creating a small set of governments and rich companies that have access to this
technology.
So I really kind of applaud the open source movement here and open source publishing and
research and things like that.
That approach, I think, is working well.
Things start to, you know, I think historically have gotten a little more problematic when
that gets closed off or limited.
And then you don't have the kind of the ability for a community to look at something and give
you an opinion on whether it's ethically correct or we should do something about it.
Sure.
Such insightful answers.
I will be considering those things
definitely for the rest of this week
and probably
long after. Alex, this has been
an unbelievably thought-provoking
show and we've learned a ton.
So thank you so much for giving us some of your time.
I appreciate it.
I think my big takeaway from this show, Kostas,
is that
Alex is number one, so approachable as a person, but number two, has such a variety of deep experience in the space, you know, from government intelligence to startups to, you know, delivering things at scale on a crazy timeline within AWS.
And so I just grew more and more to respect his opinion throughout the show,
which made his final thoughts on where these types of deep learning technologies are going,
I think, even more poignant for me. And I really
agree with him. I think it was a really fresh, honest take not to say, well, you shouldn't use
it for this, or you should use it for this. I mean, he acknowledged outright that these new
technologies are always used in ways that humanity probably shouldn't use.
And doing things in the open is a really healthy antidote to that.
And so I really appreciated his perspective on that.
It sounded simple, but I think was very powerful
and something that I'll definitely keep from the show.
Yeah, 100%. I totally agree with that.
I mean, I think at the end,
especially when you're talking about technologies
or knowledge in general that can, like, I don't know,
like, change in a very foundational way,
like the way that we operate as humans.
Yeah, it might be scary.
Obviously, we can make mistakes and use the technology in the wrong way.
But at the end, that's how, I don't know,
humanity models like to make progress, right?
I don't think that we can change that.
And I don't think that there's that much value at the end
in not taking the risk of having access to these new tools or like this
new knowledge. And again, the best way to protect humanity is to make these things available to
everyone. So I totally agree. I think we are going to hear more about these technologies.
And okay, there's a lot of, let's say, also like kind of like hype right now.
And we are still like just scratching the surface of what can be done with these
technologies, but I have a feeling like in the next couple of months, we will see
like much more like practical and interesting uses of these technologies.
And we'll have more people on the show also to talk about that stuff.
Absolutely.
It was a great episode and we want to have them back on.
But like we said earlier, when we were wrapping up the year,
we want to talk more about some of these emerging technologies like ChatGPT
and Gradle that are forging new ground. So thank you for joining us. Subscribe if you haven't,
tell a friend, and we'll catch you on the next one. We hope you enjoyed this episode of the
Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new
episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.