Everyday AI Podcast – An AI and ChatGPT Podcast - EP 326: How Data Will Be AI’s Bottleneck
Episode Date: July 31, 2024Win a free year of ChatGPT or other prizes! Find out out.Yeah, AI is cool. But have you tried AI WITH good data?! If you're running into AI implementation bottlenecks, it could be your data to bl...ame. Matthijs de Vries, Founder & CEO of Nuklai, joins us to tackle AI and data.Newsletter: Sign up for our free daily newsletterMore on this Episode: Episode PageJoin the discussion: Ask Jordan and Matthijs questions on AI and dataRelated Episodes: Ep 268: AI’s Data-Driven Decision ParadoxEp 145: NVIDIA Leader Talks GenAI + Data: Unlocking new ways to interact with our worldUpcoming Episodes: Check out the upcoming Everyday AI Livestream lineupWebsite: YourEverydayAI.comEmail The Show: info@youreverydayai.comConnect with Jordan on LinkedInTopics Covered in This Episode:1. Data and Large Language Models (LLMs)2. Practical Data Strategies3. Data Quality IssuesTimestamps:01:35 Daily AI news05:00 About Matthijs and Nuklai06:48 Data bottleneck hinders implementation of generative AI.10:26 Start with a goal, leverage data effectively.13:20 Collaborating on data is costly, causing limitations.15:46 Standardize data access to improve overall efficiency.18:46 Discussion on the use of synthetic data.23:13 Challenges for small AI projects due to funding.27:33 Crowdsourcing data important for future developments.28:38 Data used to improve bread quality. Multiple purposes.Keywords:Everyday AI, Jordan Wilson, generative AI, data bottleneck, OpenAI, GPT 4, SAM 2, video segmentation, Meta, AI Studio, chatbot creation, llama 3.1 model, Matt deFries, Nuclei, structured data, unstructured data, Large Language Models (LLMs), AI implementation, data in silos, data consortiums, data pipelines, data collection, memory efficiency, synthetic data, crowdsourcing data, data quality, human-generated data, collaboration, data science, philosophy in data.Send Everyday AI and Jordan a text message. (We can't reply back unless you leave contact info) Start Here ▶️Not sure where to start when it comes to AI? Start with our Start Here Series. You can listen to the first drop -- Episode 691 -- or get free access to our Inner Cricle community and all episodes: StartHereSeries.com Also, here's a link to the entire series on a Spotify playlist.
Transcript
Discussion (0)
This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips.
Listen daily for practical advice to boost your career, business, and everyday life.
Meet Firefly AI Assistant, now live and Adobe Firefly, the All In One Creative AI Studio.
Just describe what you want to create and the assistant handles the rest,
orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome.
The assistant accelerates execution.
Is data the next big bottleneck when it comes to generative AI?
I mean, we've heard about a lot of these things that slow AI down when it comes to companies
actually using it, right?
First, it was should we use generative AI and large language models?
And then it was, how do we use it?
Well, now that most companies, especially here in the U.S., understand that they need to be using
generative AI is data, that next big bottleneck.
We're going to be talking about that today and answering those big questions and your questions.
So thank you for tuning in, and this is Everyday AI.
What's going on, y'all?
My name's Jordan, and I'm the host of Everyday AI, and this is for you.
This is your daily live stream podcast, free daily newsletter, helping us all learn and leverage
generative AI to grow our companies and to grow our careers.
So I'm excited today to talk about the elephant in the room.
and how data will be AI's next big bottleneck.
Before we get started, as a reminder, if you're joining us in the podcast, thank you, as always,
check out your show notes.
You can always read our recap for today's show.
We always share more insights as well as what's going on in the world of AI news.
Before we get to the AI news, as a reminder, also in the newsletter, check out our thanks a million
giveaway to celebrate everyday AI hitting a million downloads.
You don't want to miss that giveaway.
All right, let's get into the AI news.
for today. So OpenAI has launched advanced voice mode for a select few chat GPT plus users.
So OpenAI has finally begun rolling out their advanced voice mode, initially available to only a small group of chat GPT plus users.
So this new feature known as GPT4, it's under the GPT40 kind of umbrella, it delivers hyper-realistic audio responses and is expected to be available to all.
plus users by fall 2024.
When first demonstrated in May, GPT40's voice named Sky caused quite a stir due to its striking
resemblance to actress Scarlett Johansson, who subsequently took legal action.
OpenAI has since removed the Sky Voice and delayed the release to enhance safety features.
Advanced voice mode allows ChatGBT to talk and listen using a single multimodal model,
reducing latency and improving conversation quality.
This alpha version that's rolling out will not include, though,
the video and screen sharing capabilities shown in the Open AI's spring update,
which will be released later.
Open AI's claims that this new advanced voice mode can detect emotional,
you know, emotion in your voice, such as sadness or excitement.
And it is wild, right?
If you've seen the demos, it is, I think, a whole next step in looking at generative AI,
more as an insistent than just a large language model.
Speaking of models,
meta has released a new multimodal AI
that can segment video in its SAM2.
So segment anything model 2,
Sam 2 for short,
has been released by meta,
extending the capabilities of the original SAM to video.
So the new SAM2 can segment any object
in an image or video and track it consistently in real time
across all video frames.
pretty, again, pretty wild breakthrough technology.
So for that, we'll have more in the newsletter.
Speaking of meta, also they revealed AI Studio for personalized chatbot creation.
So meta, the parent company of Facebook, Instagram, and WhatsApp has just released a new suite or a new tool called AI Studio,
aimed at enabling users to create and share personalized AI chatbots.
So meta's, meta introduced this tool that really just allows anyone.
You can go on their platform and we'll have the link, you know, in our newsletter, drag and drop.
You don't have to know how to code.
And also, you can use these in Instagram.
So Instagram creators can use these AI characters to handle common direct message questions
and story replies acting as an extension of themselves.
And obviously this new AI studio is powered by Meta's Lama 3.1, their latest and pretty
impressive open source model.
So pretty big news there from Meta and chat GPT.
There's always going to be more mid-journey 6.1 dropped, AMD's big stock jump.
We'll have all that in the newsletter.
All right.
So enough about AI news.
Let's go ahead and get to the big topic for today,
which is how is data going to be slowing companies down
and what we can ultimately do about it.
So please help me welcome on the show today.
I'm excited to have.
There we go.
Matt DeFries, who is the founder of Nuke, Nucleiai.
Matt, thank you so much for joining the Everyday AI show.
Super excited to be here.
Thank you so much for having me.
And I'm really looking forward to just have fun talking about data today.
All right.
Same. And hey, for our live stream audience, appreciate y'all tuning in from Tara to Rolando and Fred and Douglas and Daniel and everyone in between.
If you have a question about data and AI, please get it in. Now, before we dive into the topic, Matt, tell us a little bit about what you all do at nuclei.
Yeah, so we are building infrastructure for everything around data. That means that we identify a couple of problems in many different industries where you want to innovate on.
data. Imagine you want to do something with AI, even if it's generative AI, you need data for that.
And that data is usually locked in silos. It's in many different data sources. You have to write a lot of
adapters. So we provide a layer to have generic access to all the data without having to deploy
new infrastructure, having to deploy another ETL or whatever. And additionally, we make it very
easy to add context to the data so that that data is understood at the same level across all
data sources or data sets so that you can innovate very easily on top of that, create many
different kind of data pipelines that can eventually lead to the creation of novel AI models
or to train existing models, including LLMs.
And you know, you kind of brought up a very good point there and that's kind of the crux of
this show is data is so important for large language models. Why do you think, right? Let's just kind of
skip to the end. Why has data become such a big bottleneck for AI implementation? Because, you know,
before this big generative AI wave, right, we all heard, oh, data is is the next oil or whatever it is.
So, you know, we've known about the importance of data for decades. So why is it still this big
bottleneck for companies to implement generative AI around their data?
Yeah, I think, you know, first of all, I think we already pass the point of calling data
is the new oil.
I think really data is already our oxygen because, frankly, with many of the things in our life,
we can't live without data anymore.
Like, we can't function as a society normally anymore without data.
Everything that we do, we're using our phone, taking public transport, we use data, right?
So it's really the new oxygen.
And I think that introduces a new problem because the data landscape is so extremely fragmented that there is not a lot of access to it.
Like it serves the purpose and that's it.
And that leads to a lot of silo data and that silo data, that boring structure data, that's actually very necessary for the next generation of generative AI to hallucinate less.
Because if you look at the generative AI landscape today, you know, it tries to come up with really.
proper answers based on a conversation that makes sense, right, based on the next word prediction
or word part prediction, without taking the really factual data into account. And having more
access to structured data sets will definitely help improve on that level. And we talk about this a lot
here on the show, Matt, but maybe for people who are tuning in for the first time or don't know
the difference, can you explain simply what's the difference between structured data and unstructured
data. Absolutely. So structured data is best to be compared with what you have in Excel. So if you have
a boring Excel where you have the first row, you have just column names and below that you have data,
numbers, yes and no, dates, you know, it's predictable. The next row is the same as the role
before that. It's structured. Unstructured data is typically what you would find on an internet
forum, a blog article. Like it's text, but it's not formatted in a specific sense.
structure. So you have to understand the context or you have to take the the whole text as a text,
which makes LLM's perfect conversationalist, but the lack of structured data make them very lacking
a proper fact-based conversation. So that's the difference. Yeah. And yeah, thank you for that
explanation. And I think that companies that, you know, have been using AI, because, you know,
AI is not new. Machine learning's not new. It's been used for many decades. But this concept of using
unstructured data, I would say is newer, right, with generative AI and large language models.
That's where they truly shine being able to work with this unstructured data.
You know, with that in mind, you know, a great question to start off the bat here from Douglas.
So thanks for this. So he's asking, how do you recommend people who do not know where to start
leveraging their data? That's great. Yeah, companies, if they're maybe smaller and they haven't had a huge
data team for a long time. Where do they start? I mean, it's, I mean, you have to start with a goal
in mind, right? If you want to leverage your data, where do you want to leverage for? So let's assume
that in most cases, the data will have to be used for a large language model integration. I think
there are a lot of use cases for support chatbots and stuff like that is where companies start
nowadays. And having a predictable access to the data across multiple data sets of data sources is
good start and adding context to that data. Like going back to an example of the spreadsheet,
like the Excel spreadsheet where you have just a bunch of columns and rows, that first column name,
that's your data point, the column name. And that tells you something about that data, but not
everything. And this is why we think that metadata is extremely important that further contextualizes the
data. In fact, we did a proof of concept with integrating an LLM on top of our data infrastructure so that
you could prompt the LLM with a question and it would see what kind of data source or data set
would it need to answer this question or a combination of different data sets that will give back that
fact-based answer. But it wouldn't do that properly without having that additional
context on top of that data. So back to that Excel spreadsheet, you have the column names.
Now imagine that you will add descriptions to that column names. You can explain what kind of
limitations that data has. Like if it's just a column full of numbers, what do these numbers mean?
Is there a limit? Is there a range or your exception? So if you start deeply explaining that,
that kind of context will lead to better answers. So it's having generic access to the different
data sets that you already have. And adding to
context to that data.
Yeah, that's a great point because all data, it does have a story.
It has a meaning.
It has a reason why it can help or hurt your business.
So I love what you said there, Matt, about, you know, essentially start explaining your
data, which, you know, oddly enough is something large language models can help with, right?
But even before we get into that, I want to, you know, hit reverse here and tackle this problem,
even a little bit more, right?
So, you know, I love what you said about data isn't oil.
It is oxygen, right?
Businesses need data in order to breathe and survive.
But why are companies not able to, you know, kind of grab their data out of silos, right?
Because, yeah, companies will, you know, we consult with companies all the time.
They're like, oh, here's our Salesforce data or, you know, here's data from this source.
But they just live there.
So why is that a problem, data being in silos, and how can companies start to kind of fix that issue?
Yeah, actually, so companies can already collaborate and innovate on data.
A lot of bigger companies, they already leverage the data and smaller companies do too.
I don't think that there are a lot of technical limitations.
I think the biggest limitation is cost.
It's very expensive to do something with data.
and very expensive to collaborate with others,
especially enterprises that have really complicated data infrastructure
in very big data ecosystems with a lot of rules and exception
and standards and frameworks,
and it's very hard to be interoperable with these standards.
As a smaller business,
so that's where cost costs.
And even for the bigger enterprises,
like I've talked to a lot of enterprises about data consortions,
where you have a group of enterprises
or even also with small,
in business, even cities, governments,
they're inside of this data consort,
then they want to work together on a use case
where platform creation on top of all
the collective data that they put in.
The biggest problem why
most of these data consortions fail
is funding. They
stop being funded at some point
because it's getting too expensive. And why
is that? Because they have to bring
in one of the biggest consultancy
firms like Cap Gemini or
Kockney Zand inside of this data consortium
to build all the adapters
and built all the expensive data pipeline.
So most of this money is going into the hours
that are written by the consultancy firms.
And we are speaking about hundreds of millions
before something tangible is coming from the ground.
So we need something that is really low effort,
a very low barrier for both enterprises,
small million businesses to start leveraging the data
without having big investment.
That also allows trying new things
without investing a lot of things,
without investing a lot of money so that you can get started and test out things early.
How should, you know, let's say someone here works at a, you know, medium-sized company, you know, here in the U.S.
So, you know, they're not working with a Cap Gemini and maybe, you know, they have a decent hold on their data,
but it is still a little bit siloed.
What questions should they be asking themselves or what steps should they be taking, specifically
when it comes to better preparing and better organizing their data for large language models.
Yeah, so I suggest to find a way to first have the same type of data across all the data sources and datasets.
So what I mean is that if you have CSV files there, you have a MongoDB there, and you have a MySQLDB there,
you all have to access that data in a different way.
So first, you know, try to get a layer,
and this is what we from Nuclear,
specialize in focus on,
is have this layer where all this data is onboarded very easily,
and you can just have one SQL query to access, you know,
a CSV file, join it with data from a MongoDB,
without discriminating what that context of the data engine is.
And that way you have, you know,
that first, if that first step is done and it doesn't cost you a lot, you're already halfway
for the innovation because you have, you can tap into all that data and start playing around with
it. A good data scientist is crucial to have on board.
Adobe just introduced an entirely new way to create, bringing the power and precision of its
creative suite into one conversational experience. Meet Firefly AI assistant, now live in the
Adobe Firefly app, the all-in-one creative AI studio.
Powered by Adobe's creative agent, Firefly AI assistant lets you start with your vision,
just describe what you want, and shape the outcome as it takes form with the assistant.
The assistant orchestrates multi-step workflows, drawing on 60 plus pro-grade tools across
Adobe Creative Cloud apps, including Photoshop, Illustrator, Premiere, Lightroom Express, and more
to help bring your ideas to life.
You can also get started with creative skills, a growing library of pre-built workflows for
common creative tasks like batch editing photos, creating mood boards, portrait retouching, and creating
social variations. Every step the assistant takes is visible so you can refine, redirect, or take
over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant
now in public beta. See it today at firefly.adopi.com. You know, a good question here from Tara.
This one's a little specific, Matt, but I think it begs a good question.
So she's asking what strategies can be employed to improve memory efficiency when working with
datasets containing millions of rows?
Yeah, that's a great question, especially when these data sets become so big.
What steps can companies be taking, if any, to kind of counteract this issue?
So to be honest, that's not really my specialty.
So I don't know if I would be answering this question properly.
but there is a lot of free, free or very cheap technology out there that can work with this data very efficiently.
I don't think that data sets of containing millions of roles should be of any limitation anymore.
That's a good point.
All right.
Let's get to something a little bit more of your specialty, which, you know, I want to get a little bit more into the data that makes up models themselves, right?
If you follow large language models or read our newsletter, you saw that there's been this kind of divide recently about the quality of the data in the models themselves, right?
Because when companies are trying to really implement generative AI across different sectors of their organization, they're not only trying to bring, you know, maybe fine-tune or bring in their own data with RAG, but ultimately they are relying heavily on the actual data set of these large things.
models. But we've heard recently with synthetic data, we thought it was bad, but, you know,
meta and others have come out and said, no, you know, when you use synthetic data, it's actually
can help your models improve better. What's your thoughts on this, Matt, and what should business
owners and decision makers be aware of when it comes to synthetic data in large language models?
So actually, in fact, I've always been a believer in the value of synthetic data, but I also have to
honestly admit that it kind of shocked me when I read last week that the quality of the
LLM actually improved by bringing in more synthetic data than without.
It was actually, for me, it was counterintuitive.
I would not guess this.
So for me, you know, the question immediately came, like this is largely about unstructured data.
And I'm very curious because we actually work together with a partner that converts existing
structured data into synthetic structured data sets in order to scramble the data so that you cannot
you know get personal identifiable information out there and the statistical qualities remain the same
in that synthetic data so that was always the goal that was a very clear use case for synthetic data
but now i'm very curious and we haven't we haven't got there we're testing it yet but now i'm very
curious if that's structured synthetic data bringing more of that
into AI model training, if that will also yield the same results.
Yeah, same, right?
I've always had this thought in my head, not just with synthetic data.
And, you know, if you are new here, that's just essentially, you know, artificially generated
data that mimics real world data that is then used in models, right?
So you can make models a little bit, you know, technically cheaper, bring down inference costs,
et cetera.
But I've always said it's also going to be a problem, right,
if so much of the data and the content in these large language models is ultimately just
kind of regurgitated, right?
Studies say that more than 90% of new information posted online by 2026 is going to be
coming from large language models.
Even on that end, right, just the data quality of what goes in outside of, you know,
synthetic data, Matt, is there problems there?
Is there problems with, you know, hey, yeah, all these large things.
models scrape the internet and is the internet just getting a little worse because people are
over reliant on large language models so my my expectation would be that because that's also a kind
of a type of synthetic data all this generative i created data right so having more and more
of this data might lead in a short term to better models at first we will hit the ceiling for sure
And by then, more high-quality data.
And I think that's also immediately the elephant in the room that we want to discuss today.
Like the bottleneck of next generation AI is going to be around data.
Because it's going to be more difficult to obtain high-quality and true data that has been generated by human, right?
And it's going to be and also going to be very philosophical in the moment.
because when we have all of this structured,
sorry, unstructured data out there that's human written,
it starts to become more valuable.
And a problem that we've seen so far is that the data that has been scraped
hasn't been paid for it.
So you saw New York Times suing Open AI for using the data without paying for that.
So it's going to be increasingly difficult for creating
budget, next generation budget generative AI models that are going to be able to go beyond
that ceiling that I just introduced. So it's going to be a problem that, I mean, I find it a problem
that only the big companies will be able to buy their way out of there. So you'll have the large
organization like meta, like Open AI, that will have the funding in order to pay for that
data, but the smaller ones, the open initiatives, community-driven initiatives, will hit that
ceiling and it will be very hard for them to pass. And now, for a little bit of philosophy,
you could argue, and this is about synthetic data again, you could argue that everything that
we have ever come up with as humans and that we will ever come up with as humans is in one
way or another already out there.
But it just said, because a lot of innovation is just bringing existing things together
and to come up with something new.
So in that sense, it makes sense to say, like, it just, it doesn't matter if there is too
many LLM generated data or too many synthetic data out there.
It's just how you're going to mix that together in order to find new creative ideas, new
innovations.
So this is a little bit of philosophy from...
Hey, let's go down this philosophy wrote. I like that. You know, everything that you can come up with as humans has already been out there. You know, that's a good point. But I think maybe, you know, especially if we're talking about smaller, smaller companies, smaller to medium sized companies. When they look at data, right, they think data science, they think business intelligence. And, you know, oftentimes, you know, smaller companies don't have big resources to devote to that piece.
When they think of data, they think of big data and they say, oh, you know, that's not necessarily
for us.
So maybe however we use large language models is going to not be as impactful as those
companies that have their own, you know, first party data.
How can even smaller or medium-sized companies still kind of, you know, break down this elephant
in the room and break through this bottleneck and still create valuable, you know, first
party or first company data.
How can they do that if, you know, going back to your philosophy question there is, you know,
essentially saying, hey, all this data that, you know, that big companies have can kind of
already exist in a way, shape, or form.
Exactly.
Exactly.
And I think, I think the crucial key component here is collaboration.
Like, I don't think, because I really believe in that ceiling and that the smaller initiatives
will heat that ceiling much quicker than the companies like meta and OpenAI and stuff like that.
But through collaboration, it will be easier to break through that ceiling.
And also from nuclear, we chose collaboration over competition.
We have around 40, 45 partners that we collaborate with, each, you know,
playing in unique role in what we try to solve and what we offer.
You know, I have a hot take, Matt.
You know, normally I don't do this, but, you know, we're talking data and we're talking about
how important it is for, you know, companies and for large language models.
One of my hot takes is this, is that we're going to start looking for data in places that
we generally wouldn't look for, especially when it comes for our company or, you know,
large language models.
That could be, I think, large-scale university studies.
It could be, I think, talking to employees, right, and capturing.
their knowledge and turning that into unique first party or first company data.
Is that a crazy thought to say that that could be a, not a future, but or the future of data
when it comes to large language models and generative AI.
But is that a path forward that's worth exploring, just bringing in this more human,
human level data?
Actually, I would argue that this is a hot take, but it's just simply consensus, man.
I mean, look, I think crowdsourcing data, which is kind of, you know, really broadly describes what you describe.
When you ask something from your employee, it's like kind of crowdsourcing data.
I think crowdsourcing data is going to be much more important in the short-term future and is going to be a much bigger thing, whether it's about polling or stuff like that.
And when it comes to trying to find data out there,
that's not looking into places that you normally would.
I think there are some cool examples out there where,
you know,
once I talked to an organization,
they had a factory where they were baking bread.
And there was data generated by the machines.
And that data was used in order to monitor the machines
and see when they had to have maintenance.
And then some software company was hired
that was really specialized in innovation,
typically innovation on data.
And they said, let's look at that data
and see if we can find patterns
that would indicate a better quality of bread
that is being baked.
So then they used that data for the purpose
of improving the quality of the food
instead of the initial purpose,
which is always served,
and that the data was thrown away.
And that's really cool.
I mean, that's how you look at data
where you normally wouldn't look.
And I think that's kind of multiple purpose
data, it's out there everywhere already, and we will start scrambling more and more to get
access to that data. And the funny thing is, is that it's not necessarily super valuable data.
It's not like you won't have to sell it for big bucks. So access to it should hopefully
be very easily. So Matt, we've covered a lot in today's episode. We've talked about how data's
not oil, it's oxygen, how companies should be giving a story to their data, pros and cons of
synthetic data, and even, you know, this concept of crowdsourcing data. So, you know, as we wrap up
today's show, how, what's kind of your one biggest takeaway for business leaders, decision
makers out there in order for data to not become a bottleneck in their AI strategy?
Again, don't be scared to put your data out there and focus on getting this data as early on as possible in a generic way and immediately putting context to that.
If you're a small business, you're just starting to work with data, start doing this already, like today.
even if you don't have an AI or data strategy yet, like start doing this.
And collaboration.
Collaboration is key.
I love it.
So many great insights in there.
And yeah, like, I love this.
Data is not oil.
It is oxygen.
So if you want your business to grow, you got to know your data.
And you also have to keep tuning in to great episodes like this.
So thank you so much, Matt, for joining the Everyday AI show.
We appreciate your time.
The pleasure is all mine.
It was super cool, man.
Thank you so much.
And everyone who's watching right now, have an amazing rest of the day.
All right.
And hey, as a reminder, a lot of great information in there.
We're going to be recapping it all in our newsletter.
So make sure you go to your everyday AI.com.
Sign up for that free daily newsletter.
And also, while you're there, make sure if you haven't already joined our thanks,
a million giveaway.
Appreciate y'all tuning in.
So we'll see you back tomorrow and every day for more everyday AI.
Thanks, y'all.
Meet Firefly AI Assistant.
Now live in Adobe.
Firefly, the Allman One Creative AI Studio.
Just describe what you want to create in your own words and the assistant handles the rest,
orchestrating multi-step workflows across Adobe Creative Cloud apps,
including Photoshop, Premiere Express, and more in one conversational interface.
You direct the outcome while the assistant accelerates execution.
Stand control with the ability to step in and refine at any time.
See it today at firefly.adop.com.
And that's a wrap for today's edition of Everyday AI.
Thanks for joining us.
If you enjoyed this episode, please subscribe and leave us a rating.
It helps keep us going.
For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind.
Go break some barriers and we'll see you next time.
