Everyday AI Podcast – An AI and ChatGPT Podcast - EP 326: How Data Will Be AI’s Bottleneck

Starting point is 00:00:00 This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Meet Firefly AI Assistant, now live and Adobe Firefly, the All In One Creative AI Studio. Just describe what you want to create and the assistant handles the rest, orchestrating multi-step workflows across Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome. The assistant accelerates execution. Is data the next big bottleneck when it comes to generative AI?

Starting point is 00:00:50 I mean, we've heard about a lot of these things that slow AI down when it comes to companies actually using it, right? First, it was should we use generative AI and large language models? And then it was, how do we use it? Well, now that most companies, especially here in the U.S., understand that they need to be using generative AI is data, that next big bottleneck. We're going to be talking about that today and answering those big questions and your questions. So thank you for tuning in, and this is Everyday AI.

Starting point is 00:01:23 What's going on, y'all? My name's Jordan, and I'm the host of Everyday AI, and this is for you. This is your daily live stream podcast, free daily newsletter, helping us all learn and leverage generative AI to grow our companies and to grow our careers. So I'm excited today to talk about the elephant in the room. and how data will be AI's next big bottleneck. Before we get started, as a reminder, if you're joining us in the podcast, thank you, as always, check out your show notes.

Starting point is 00:01:48 You can always read our recap for today's show. We always share more insights as well as what's going on in the world of AI news. Before we get to the AI news, as a reminder, also in the newsletter, check out our thanks a million giveaway to celebrate everyday AI hitting a million downloads. You don't want to miss that giveaway. All right, let's get into the AI news. for today. So OpenAI has launched advanced voice mode for a select few chat GPT plus users. So OpenAI has finally begun rolling out their advanced voice mode, initially available to only a small group of chat GPT plus users.

Starting point is 00:02:27 So this new feature known as GPT4, it's under the GPT40 kind of umbrella, it delivers hyper-realistic audio responses and is expected to be available to all. plus users by fall 2024. When first demonstrated in May, GPT40's voice named Sky caused quite a stir due to its striking resemblance to actress Scarlett Johansson, who subsequently took legal action. OpenAI has since removed the Sky Voice and delayed the release to enhance safety features. Advanced voice mode allows ChatGBT to talk and listen using a single multimodal model, reducing latency and improving conversation quality. This alpha version that's rolling out will not include, though,

Starting point is 00:03:14 the video and screen sharing capabilities shown in the Open AI's spring update, which will be released later. Open AI's claims that this new advanced voice mode can detect emotional, you know, emotion in your voice, such as sadness or excitement. And it is wild, right? If you've seen the demos, it is, I think, a whole next step in looking at generative AI, more as an insistent than just a large language model. Speaking of models,

Starting point is 00:03:43 meta has released a new multimodal AI that can segment video in its SAM2. So segment anything model 2, Sam 2 for short, has been released by meta, extending the capabilities of the original SAM to video. So the new SAM2 can segment any object in an image or video and track it consistently in real time

Starting point is 00:04:05 across all video frames. pretty, again, pretty wild breakthrough technology. So for that, we'll have more in the newsletter. Speaking of meta, also they revealed AI Studio for personalized chatbot creation. So meta, the parent company of Facebook, Instagram, and WhatsApp has just released a new suite or a new tool called AI Studio, aimed at enabling users to create and share personalized AI chatbots. So meta's, meta introduced this tool that really just allows anyone. You can go on their platform and we'll have the link, you know, in our newsletter, drag and drop.

Starting point is 00:04:46 You don't have to know how to code. And also, you can use these in Instagram. So Instagram creators can use these AI characters to handle common direct message questions and story replies acting as an extension of themselves. And obviously this new AI studio is powered by Meta's Lama 3.1, their latest and pretty impressive open source model. So pretty big news there from Meta and chat GPT. There's always going to be more mid-journey 6.1 dropped, AMD's big stock jump.

Starting point is 00:05:23 We'll have all that in the newsletter. All right. So enough about AI news. Let's go ahead and get to the big topic for today, which is how is data going to be slowing companies down and what we can ultimately do about it. So please help me welcome on the show today. I'm excited to have.

Starting point is 00:05:43 There we go. Matt DeFries, who is the founder of Nuke, Nucleiai. Matt, thank you so much for joining the Everyday AI show. Super excited to be here. Thank you so much for having me. And I'm really looking forward to just have fun talking about data today. All right. Same. And hey, for our live stream audience, appreciate y'all tuning in from Tara to Rolando and Fred and Douglas and Daniel and everyone in between.

Starting point is 00:06:11 If you have a question about data and AI, please get it in. Now, before we dive into the topic, Matt, tell us a little bit about what you all do at nuclei. Yeah, so we are building infrastructure for everything around data. That means that we identify a couple of problems in many different industries where you want to innovate on. data. Imagine you want to do something with AI, even if it's generative AI, you need data for that. And that data is usually locked in silos. It's in many different data sources. You have to write a lot of adapters. So we provide a layer to have generic access to all the data without having to deploy new infrastructure, having to deploy another ETL or whatever. And additionally, we make it very easy to add context to the data so that that data is understood at the same level across all data sources or data sets so that you can innovate very easily on top of that, create many

Starting point is 00:07:06 different kind of data pipelines that can eventually lead to the creation of novel AI models or to train existing models, including LLMs. And you know, you kind of brought up a very good point there and that's kind of the crux of this show is data is so important for large language models. Why do you think, right? Let's just kind of skip to the end. Why has data become such a big bottleneck for AI implementation? Because, you know, before this big generative AI wave, right, we all heard, oh, data is is the next oil or whatever it is. So, you know, we've known about the importance of data for decades. So why is it still this big bottleneck for companies to implement generative AI around their data?

Starting point is 00:07:58 Yeah, I think, you know, first of all, I think we already pass the point of calling data is the new oil. I think really data is already our oxygen because, frankly, with many of the things in our life, we can't live without data anymore. Like, we can't function as a society normally anymore without data. Everything that we do, we're using our phone, taking public transport, we use data, right? So it's really the new oxygen. And I think that introduces a new problem because the data landscape is so extremely fragmented that there is not a lot of access to it.

Starting point is 00:08:34 Like it serves the purpose and that's it. And that leads to a lot of silo data and that silo data, that boring structure data, that's actually very necessary for the next generation of generative AI to hallucinate less. Because if you look at the generative AI landscape today, you know, it tries to come up with really. proper answers based on a conversation that makes sense, right, based on the next word prediction or word part prediction, without taking the really factual data into account. And having more access to structured data sets will definitely help improve on that level. And we talk about this a lot here on the show, Matt, but maybe for people who are tuning in for the first time or don't know the difference, can you explain simply what's the difference between structured data and unstructured

Starting point is 00:09:23 data. Absolutely. So structured data is best to be compared with what you have in Excel. So if you have a boring Excel where you have the first row, you have just column names and below that you have data, numbers, yes and no, dates, you know, it's predictable. The next row is the same as the role before that. It's structured. Unstructured data is typically what you would find on an internet forum, a blog article. Like it's text, but it's not formatted in a specific sense. structure. So you have to understand the context or you have to take the the whole text as a text, which makes LLM's perfect conversationalist, but the lack of structured data make them very lacking a proper fact-based conversation. So that's the difference. Yeah. And yeah, thank you for that

Starting point is 00:10:16 explanation. And I think that companies that, you know, have been using AI, because, you know, AI is not new. Machine learning's not new. It's been used for many decades. But this concept of using unstructured data, I would say is newer, right, with generative AI and large language models. That's where they truly shine being able to work with this unstructured data. You know, with that in mind, you know, a great question to start off the bat here from Douglas. So thanks for this. So he's asking, how do you recommend people who do not know where to start leveraging their data? That's great. Yeah, companies, if they're maybe smaller and they haven't had a huge data team for a long time. Where do they start? I mean, it's, I mean, you have to start with a goal

Starting point is 00:11:00 in mind, right? If you want to leverage your data, where do you want to leverage for? So let's assume that in most cases, the data will have to be used for a large language model integration. I think there are a lot of use cases for support chatbots and stuff like that is where companies start nowadays. And having a predictable access to the data across multiple data sets of data sources is good start and adding context to that data. Like going back to an example of the spreadsheet, like the Excel spreadsheet where you have just a bunch of columns and rows, that first column name, that's your data point, the column name. And that tells you something about that data, but not everything. And this is why we think that metadata is extremely important that further contextualizes the

Starting point is 00:11:51 data. In fact, we did a proof of concept with integrating an LLM on top of our data infrastructure so that you could prompt the LLM with a question and it would see what kind of data source or data set would it need to answer this question or a combination of different data sets that will give back that fact-based answer. But it wouldn't do that properly without having that additional context on top of that data. So back to that Excel spreadsheet, you have the column names. Now imagine that you will add descriptions to that column names. You can explain what kind of limitations that data has. Like if it's just a column full of numbers, what do these numbers mean? Is there a limit? Is there a range or your exception? So if you start deeply explaining that,

Starting point is 00:12:34 that kind of context will lead to better answers. So it's having generic access to the different data sets that you already have. And adding to context to that data. Yeah, that's a great point because all data, it does have a story. It has a meaning. It has a reason why it can help or hurt your business. So I love what you said there, Matt, about, you know, essentially start explaining your data, which, you know, oddly enough is something large language models can help with, right?

Starting point is 00:13:07 But even before we get into that, I want to, you know, hit reverse here and tackle this problem, even a little bit more, right? So, you know, I love what you said about data isn't oil. It is oxygen, right? Businesses need data in order to breathe and survive. But why are companies not able to, you know, kind of grab their data out of silos, right? Because, yeah, companies will, you know, we consult with companies all the time. They're like, oh, here's our Salesforce data or, you know, here's data from this source.

Starting point is 00:13:39 But they just live there. So why is that a problem, data being in silos, and how can companies start to kind of fix that issue? Yeah, actually, so companies can already collaborate and innovate on data. A lot of bigger companies, they already leverage the data and smaller companies do too. I don't think that there are a lot of technical limitations. I think the biggest limitation is cost. It's very expensive to do something with data. and very expensive to collaborate with others,

Starting point is 00:14:13 especially enterprises that have really complicated data infrastructure in very big data ecosystems with a lot of rules and exception and standards and frameworks, and it's very hard to be interoperable with these standards. As a smaller business, so that's where cost costs. And even for the bigger enterprises, like I've talked to a lot of enterprises about data consortions,

Starting point is 00:14:36 where you have a group of enterprises or even also with small, in business, even cities, governments, they're inside of this data consort, then they want to work together on a use case where platform creation on top of all the collective data that they put in. The biggest problem why

Starting point is 00:14:52 most of these data consortions fail is funding. They stop being funded at some point because it's getting too expensive. And why is that? Because they have to bring in one of the biggest consultancy firms like Cap Gemini or Kockney Zand inside of this data consortium

Starting point is 00:15:08 to build all the adapters and built all the expensive data pipeline. So most of this money is going into the hours that are written by the consultancy firms. And we are speaking about hundreds of millions before something tangible is coming from the ground. So we need something that is really low effort, a very low barrier for both enterprises,

Starting point is 00:15:29 small million businesses to start leveraging the data without having big investment. That also allows trying new things without investing a lot of things, without investing a lot of money so that you can get started and test out things early. How should, you know, let's say someone here works at a, you know, medium-sized company, you know, here in the U.S. So, you know, they're not working with a Cap Gemini and maybe, you know, they have a decent hold on their data, but it is still a little bit siloed.

Starting point is 00:16:02 What questions should they be asking themselves or what steps should they be taking, specifically when it comes to better preparing and better organizing their data for large language models. Yeah, so I suggest to find a way to first have the same type of data across all the data sources and datasets. So what I mean is that if you have CSV files there, you have a MongoDB there, and you have a MySQLDB there, you all have to access that data in a different way. So first, you know, try to get a layer, and this is what we from Nuclear, specialize in focus on,

Starting point is 00:16:44 is have this layer where all this data is onboarded very easily, and you can just have one SQL query to access, you know, a CSV file, join it with data from a MongoDB, without discriminating what that context of the data engine is. And that way you have, you know, that first, if that first step is done and it doesn't cost you a lot, you're already halfway for the innovation because you have, you can tap into all that data and start playing around with it. A good data scientist is crucial to have on board.

Starting point is 00:17:24 Adobe just introduced an entirely new way to create, bringing the power and precision of its creative suite into one conversational experience. Meet Firefly AI assistant, now live in the Adobe Firefly app, the all-in-one creative AI studio. Powered by Adobe's creative agent, Firefly AI assistant lets you start with your vision, just describe what you want, and shape the outcome as it takes form with the assistant. The assistant orchestrates multi-step workflows, drawing on 60 plus pro-grade tools across Adobe Creative Cloud apps, including Photoshop, Illustrator, Premiere, Lightroom Express, and more to help bring your ideas to life.

Starting point is 00:18:03 You can also get started with creative skills, a growing library of pre-built workflows for common creative tasks like batch editing photos, creating mood boards, portrait retouching, and creating social variations. Every step the assistant takes is visible so you can refine, redirect, or take over at any time. You stay in the driver's seat as the creative director. Adobe Firefly AI assistant now in public beta. See it today at firefly.adopi.com. You know, a good question here from Tara. This one's a little specific, Matt, but I think it begs a good question. So she's asking what strategies can be employed to improve memory efficiency when working with datasets containing millions of rows?

Starting point is 00:18:51 Yeah, that's a great question, especially when these data sets become so big. What steps can companies be taking, if any, to kind of counteract this issue? So to be honest, that's not really my specialty. So I don't know if I would be answering this question properly. but there is a lot of free, free or very cheap technology out there that can work with this data very efficiently. I don't think that data sets of containing millions of roles should be of any limitation anymore. That's a good point. All right.

Starting point is 00:19:26 Let's get to something a little bit more of your specialty, which, you know, I want to get a little bit more into the data that makes up models themselves, right? If you follow large language models or read our newsletter, you saw that there's been this kind of divide recently about the quality of the data in the models themselves, right? Because when companies are trying to really implement generative AI across different sectors of their organization, they're not only trying to bring, you know, maybe fine-tune or bring in their own data with RAG, but ultimately they are relying heavily on the actual data set of these large things. models. But we've heard recently with synthetic data, we thought it was bad, but, you know, meta and others have come out and said, no, you know, when you use synthetic data, it's actually can help your models improve better. What's your thoughts on this, Matt, and what should business owners and decision makers be aware of when it comes to synthetic data in large language models? So actually, in fact, I've always been a believer in the value of synthetic data, but I also have to

Starting point is 00:20:36 honestly admit that it kind of shocked me when I read last week that the quality of the LLM actually improved by bringing in more synthetic data than without. It was actually, for me, it was counterintuitive. I would not guess this. So for me, you know, the question immediately came, like this is largely about unstructured data. And I'm very curious because we actually work together with a partner that converts existing structured data into synthetic structured data sets in order to scramble the data so that you cannot you know get personal identifiable information out there and the statistical qualities remain the same

Starting point is 00:21:21 in that synthetic data so that was always the goal that was a very clear use case for synthetic data but now i'm very curious and we haven't we haven't got there we're testing it yet but now i'm very curious if that's structured synthetic data bringing more of that into AI model training, if that will also yield the same results. Yeah, same, right? I've always had this thought in my head, not just with synthetic data. And, you know, if you are new here, that's just essentially, you know, artificially generated data that mimics real world data that is then used in models, right?

Starting point is 00:21:55 So you can make models a little bit, you know, technically cheaper, bring down inference costs, et cetera. But I've always said it's also going to be a problem, right, if so much of the data and the content in these large language models is ultimately just kind of regurgitated, right? Studies say that more than 90% of new information posted online by 2026 is going to be coming from large language models. Even on that end, right, just the data quality of what goes in outside of, you know,

Starting point is 00:22:28 synthetic data, Matt, is there problems there? Is there problems with, you know, hey, yeah, all these large things. models scrape the internet and is the internet just getting a little worse because people are over reliant on large language models so my my expectation would be that because that's also a kind of a type of synthetic data all this generative i created data right so having more and more of this data might lead in a short term to better models at first we will hit the ceiling for sure And by then, more high-quality data. And I think that's also immediately the elephant in the room that we want to discuss today.

Starting point is 00:23:07 Like the bottleneck of next generation AI is going to be around data. Because it's going to be more difficult to obtain high-quality and true data that has been generated by human, right? And it's going to be and also going to be very philosophical in the moment. because when we have all of this structured, sorry, unstructured data out there that's human written, it starts to become more valuable. And a problem that we've seen so far is that the data that has been scraped hasn't been paid for it.

Starting point is 00:23:47 So you saw New York Times suing Open AI for using the data without paying for that. So it's going to be increasingly difficult for creating budget, next generation budget generative AI models that are going to be able to go beyond that ceiling that I just introduced. So it's going to be a problem that, I mean, I find it a problem that only the big companies will be able to buy their way out of there. So you'll have the large organization like meta, like Open AI, that will have the funding in order to pay for that data, but the smaller ones, the open initiatives, community-driven initiatives, will hit that ceiling and it will be very hard for them to pass. And now, for a little bit of philosophy,

Starting point is 00:24:38 you could argue, and this is about synthetic data again, you could argue that everything that we have ever come up with as humans and that we will ever come up with as humans is in one way or another already out there. But it just said, because a lot of innovation is just bringing existing things together and to come up with something new. So in that sense, it makes sense to say, like, it just, it doesn't matter if there is too many LLM generated data or too many synthetic data out there. It's just how you're going to mix that together in order to find new creative ideas, new

Starting point is 00:25:17 innovations. So this is a little bit of philosophy from... Hey, let's go down this philosophy wrote. I like that. You know, everything that you can come up with as humans has already been out there. You know, that's a good point. But I think maybe, you know, especially if we're talking about smaller, smaller companies, smaller to medium sized companies. When they look at data, right, they think data science, they think business intelligence. And, you know, oftentimes, you know, smaller companies don't have big resources to devote to that piece. When they think of data, they think of big data and they say, oh, you know, that's not necessarily for us. So maybe however we use large language models is going to not be as impactful as those companies that have their own, you know, first party data. How can even smaller or medium-sized companies still kind of, you know, break down this elephant

Starting point is 00:26:13 in the room and break through this bottleneck and still create valuable, you know, first party or first company data. How can they do that if, you know, going back to your philosophy question there is, you know, essentially saying, hey, all this data that, you know, that big companies have can kind of already exist in a way, shape, or form. Exactly. Exactly. And I think, I think the crucial key component here is collaboration.

Starting point is 00:26:41 Like, I don't think, because I really believe in that ceiling and that the smaller initiatives will heat that ceiling much quicker than the companies like meta and OpenAI and stuff like that. But through collaboration, it will be easier to break through that ceiling. And also from nuclear, we chose collaboration over competition. We have around 40, 45 partners that we collaborate with, each, you know, playing in unique role in what we try to solve and what we offer. You know, I have a hot take, Matt. You know, normally I don't do this, but, you know, we're talking data and we're talking about

Starting point is 00:27:21 how important it is for, you know, companies and for large language models. One of my hot takes is this, is that we're going to start looking for data in places that we generally wouldn't look for, especially when it comes for our company or, you know, large language models. That could be, I think, large-scale university studies. It could be, I think, talking to employees, right, and capturing. their knowledge and turning that into unique first party or first company data. Is that a crazy thought to say that that could be a, not a future, but or the future of data

Starting point is 00:28:02 when it comes to large language models and generative AI. But is that a path forward that's worth exploring, just bringing in this more human, human level data? Actually, I would argue that this is a hot take, but it's just simply consensus, man. I mean, look, I think crowdsourcing data, which is kind of, you know, really broadly describes what you describe. When you ask something from your employee, it's like kind of crowdsourcing data. I think crowdsourcing data is going to be much more important in the short-term future and is going to be a much bigger thing, whether it's about polling or stuff like that. And when it comes to trying to find data out there,

Starting point is 00:28:49 that's not looking into places that you normally would. I think there are some cool examples out there where, you know, once I talked to an organization, they had a factory where they were baking bread. And there was data generated by the machines. And that data was used in order to monitor the machines and see when they had to have maintenance.

Starting point is 00:29:09 And then some software company was hired that was really specialized in innovation, typically innovation on data. And they said, let's look at that data and see if we can find patterns that would indicate a better quality of bread that is being baked. So then they used that data for the purpose

Starting point is 00:29:27 of improving the quality of the food instead of the initial purpose, which is always served, and that the data was thrown away. And that's really cool. I mean, that's how you look at data where you normally wouldn't look. And I think that's kind of multiple purpose

Starting point is 00:29:43 data, it's out there everywhere already, and we will start scrambling more and more to get access to that data. And the funny thing is, is that it's not necessarily super valuable data. It's not like you won't have to sell it for big bucks. So access to it should hopefully be very easily. So Matt, we've covered a lot in today's episode. We've talked about how data's not oil, it's oxygen, how companies should be giving a story to their data, pros and cons of synthetic data, and even, you know, this concept of crowdsourcing data. So, you know, as we wrap up today's show, how, what's kind of your one biggest takeaway for business leaders, decision makers out there in order for data to not become a bottleneck in their AI strategy?

Starting point is 00:30:35 Again, don't be scared to put your data out there and focus on getting this data as early on as possible in a generic way and immediately putting context to that. If you're a small business, you're just starting to work with data, start doing this already, like today. even if you don't have an AI or data strategy yet, like start doing this. And collaboration. Collaboration is key. I love it. So many great insights in there. And yeah, like, I love this.

Starting point is 00:31:11 Data is not oil. It is oxygen. So if you want your business to grow, you got to know your data. And you also have to keep tuning in to great episodes like this. So thank you so much, Matt, for joining the Everyday AI show. We appreciate your time. The pleasure is all mine. It was super cool, man.

Starting point is 00:31:28 Thank you so much. And everyone who's watching right now, have an amazing rest of the day. All right. And hey, as a reminder, a lot of great information in there. We're going to be recapping it all in our newsletter. So make sure you go to your everyday AI.com. Sign up for that free daily newsletter. And also, while you're there, make sure if you haven't already joined our thanks,

Starting point is 00:31:48 a million giveaway. Appreciate y'all tuning in. So we'll see you back tomorrow and every day for more everyday AI. Thanks, y'all. Meet Firefly AI Assistant. Now live in Adobe. Firefly, the Allman One Creative AI Studio. Just describe what you want to create in your own words and the assistant handles the rest,

Starting point is 00:32:12 orchestrating multi-step workflows across Adobe Creative Cloud apps, including Photoshop, Premiere Express, and more in one conversational interface. You direct the outcome while the assistant accelerates execution. Stand control with the ability to step in and refine at any time. See it today at firefly.adop.com. And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating.

Starting point is 00:32:45 It helps keep us going. For a little more AI magic, visit Your EverydayAI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

Everyday AI Podcast – An AI and ChatGPT Podcast - EP 326: How Data Will Be AI’s Bottleneck

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.