The AI Daily Brief: Artificial Intelligence News and Analysis - The Ethical Gray Area of AI Labs and Data

Episode Date: April 8, 2024

The New York Times alleges AI companies, including OpenAI and Google, may bypass their own policies on data use, potentially scraping YouTube and other platforms for training AI systems. This has rais...ed questions about copyright compliance and the ethical implications of sourcing massive datasets required for AI development. The urgency stems from a possible data shortage by 2026, pushing these companies into moral gray areas to maintain competitiveness. ** Be the first to learn about our new AI education platform: https://besuper.ai/ ** ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Transcript
Discussion (0)
Starting point is 00:00:01 Today on the AI breakdown, we're talking about a report from the New York Times on how big AI labs are not necessarily following even their own procedures when it comes to data. Before that on the brief, more information about the Johnny I. Sam Altman device. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.net Network for more information about our YouTube, our Discord, and our newsletter. Welcome back to the AI breakdown brief, all the AI headline news you need in around five minutes. there is a ton going on in the AI hardware space. We've seen a bunch of consumer devices announced, the Humane Pin, the Rabbit R1,
Starting point is 00:00:39 and one that has not yet been figured out, but still gets a ton of attention whenever we learn anything more about it, is a potential AI device startup that appears to be a collaboration between former Applehead designer and designer of the iPhone, Johnny Ive, and OpenAI Sam Altman. At the end of last week, the information reported that they're in funding discussions with Thrive Capital, as well as the Emerson Collective, which is the vehicle of Steve Jobs widow, Larene Powell Jobs.
Starting point is 00:01:02 According to the information sources, I was looking to raise up to a billion dollars in funding. The information says it isn't clear whether OpenAI would own a piece of the deal, but a scenario seems likely. Now, we had previously heard that SoftBank was in discussions as a potential funder, but no new information has come about from that. Emerson would be an interesting funder because they're not just a traditional venture firm or Silicon Valley fund.
Starting point is 00:01:24 They have an extreme focus on social impact. There's an associated philanthropic organization. and so I'll be really curious to see what the angle is there, or whether it's just strictly on the business side of the balance sheet. One other interesting piece of information is that people say that the device will not look like a phone. It's going to be interesting to see how this all shakes out. Next up, Canada announces a $1.8 billion investment in the country's AI sector. Bloomberg writes,
Starting point is 00:01:51 The government unveiled a $1.8 billion package of measures related to artificial intelligence on Sunday. The centerpiece is $2 billion Canadian dollars for, quote, computing capabilities, and technological infrastructure that can accelerate the work of AI researchers, startups, and other firms. Prime Minister Justin Trudeau said, the funds will help harness the full potential of AI so Canadians and especially young Canadians can get good paying jobs while raising our productivity and growing our economy. Now, right now we're fairly short on details. For example, Benjamin Bergen, the head of the Council of Canadian Innovators,
Starting point is 00:02:20 said that his group was trying to get more information on how Canadian companies would be able to access the computing power and infrastructure. He said, if this gives Canadian companies the resources to compete globally, today's announcement is a step in the right direction. I saw another post on LinkedIn from entrepreneur Andy Morrow, who wrote, feels premature to celebrate this new $2.4 billion government investment in AI without more info. Let me tell you a story. In my last conversational AI company Automat, we raised 15 million of venture capital and sold in 2021. When Canada's AI Global Innovation Cluster announced a few years ago that they were dispensing 200 million of taxpayer money to fund AI innovation, I was excited and
Starting point is 00:02:55 reached out to them and had a meeting right away. We were precisely the kind of company that made sense for a program like this. We had secured financing and early large customers like L'Oreal, and our team were folks who had worked in AI since before it was called AI. But as soon as I heard about the program requirements, my heart dropped. Startups needed to have, one, a large company partner, two, a research lab partner, three, a project structure with deliverables and milestones. You know what a bad startup would do? It would partner with slow-moving and risk-averse big companies and base its technology on lab research that is decoupled from customer and user needs, and to find milestones ahead of time instead of shipping fast, learning, and iterating.
Starting point is 00:03:30 I actually spoke with one of the government officials who administered the program, and they outright told me that these programs existed to derisk the government giving cash directly to startups, because that was deemed too risky after a major direct investment failed publicly. Like I said, we don't really have information yet on how this new program will differ, but I think it's a worthwhile concern. I thought it was interesting that on the same day, CNBC published a piece about Oracle Chairman Larry Ellison, saying, quote, every government, pretty much every government,
Starting point is 00:03:55 is going to want a sovereign cloud. He said, we talk about winning business with companies. For the first time, we're beginning to win business for countries. We have a number of countries where we're negotiating sovereign relations with the national government. Given how frequently we're seeing news about some AI initiative in this country or that, I don't think he's that far off. Lastly today, a new study about AI's impact on jobs.
Starting point is 00:04:16 A poll of 2,000 executives conducted by Swiss staffing firm ADECO found that 41% of C-suite executives expect to employ fewer people because of AI. If this is a subject that interests you, watch out for Wednesday's episode. We are going to get deep on some of these questions, alongside a bit of a special announcement. For now, though, that is going to do it for today's AI breakdown brief. Up next, the main AI breakdown. The tools that help us manage our projects and endeavors become way more significant feeling than most other normal tools we interact with.
Starting point is 00:04:47 They are the things that help us breathe life into ideas, and that's why I am so pleased that today's episode is brought to you by Notion. Notion is the power center for all of my projects from the AI breakdown to the new education initiative that so many of you are participating in. Notion combines your notes, documents, and projects into one space that's simple and beautifully designed. They've also now brought the power of AI directly into the platform so that when it comes to writing, brainstorming, getting creative with new ideas, strategic planning, all of these things where AI can be an incredible companion and co-pilot, all of that is available to you
Starting point is 00:05:20 without leaving Notion. Now, Notion is a tool for you, but it's also for your teams. In fact, the more your team uses it, the more useful it becomes for everyone. Notion is used by over half of Fortune 500 companies. And teams that use Notion send less email, cancel more meetings, save time searching for their work, and reduce spending on tools, which helps keep everyone on the same page. Try Notion for free when you go to notion.com slash AI breakdown. That's all lowercase letters, notion.com slash AI breakdown to try the powerful, easy-to-use Notion AI today. and when you use our link, you're supporting the show.
Starting point is 00:05:53 Notion.com slash AI breakdown. Hello, friends, quick note before we get into the main part of the episode. If you've been listening to the show for the last few months, you know we have been running an education beta. This is a new approach to AI learning that is hyper-practical, focused on getting you actually using AI tools in minutes, not hours, and certainly not days, based around video tutorials and companion challenges and projects
Starting point is 00:06:18 that have step-by-step instructions that make it really easy to try out new AI. platforms, and all of this is now culminating in the launch of a new platform, which we're hoping is the most practical and useful way to learn AI that anyone has yet created. If you want to be the first to know when that launches, go to B-super.a.i, that's B-super.a.i, and sign up for the waitlist. I cannot wait to tell you more about it. Welcome back to the AI breakdown. We begin with the report from the New York Times that came out over the weekend that basically accuses the big AI labs of not even following their own policies
Starting point is 00:06:56 when it comes to copyright law to say nothing of what some people think the right approach should have been. The piece begins. In late 2021, OpenAI faced a supply problem. The artificial intelligence lab had exhausted every reservoir of reputable English language text on the internet as it developed its latest AI system. It needed more data to train the next version of its technology, lots more. So OpenAI researchers created a speech recognition tool called Whisper. It could transcribe the audio from YouTube videos yielding new conversational texts that would make an AI system smarter. Some OpenAI employees discussed how such a move might go against YouTube's rules. YouTube prohibits use of its videos for applications that are independent of the video platform.
Starting point is 00:07:32 Ultimately, an OpenAI team transcribed more than one million hours of YouTube videos. The team included Greg Brockman, OpenAI's president, who personally helped collect the videos. This became core to the training data for GPT4. Now, before we go into the rest of the piece, the Times also did a companion where they showed some of the other sources where data had come from four earlier OpenAI models. There was Common Crawl, which is text from web pages that has been collected since 2007 and represents around 410 billion tokens. There's Wikipedia, Books 1 and Books 2, which are widely believed to contain text
Starting point is 00:08:03 from millions of published books, and Webtext 2, which is described as web pages linked from Reddit that received three or more upvotes, which represented another 19 billion tokens. So the argument that the Times is making here is that when Open AI hit the limits of its ability to find sources that they had perhaps legitimate access to, they dove headfirst into a gray area of scraping data from other websites, potentially against the terms of those websites. What's more, they're clearly trying to point out that this was a top-down decision, given that they say that Greg Brockman was involved. However, OpenAI, they point out, is not the only company going through something similar. The Times writes, like Open AI, Google transcribed YouTube videos to harvest text for its
Starting point is 00:08:40 AI models, that potentially violated the copyrights to the videos which belong to their creators. Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company's privacy team, and an internal message viewed by the Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps, and other online material for more of its AI products. Now, the Times piece portrays the situation as dramatic,
Starting point is 00:09:03 and the pressure to capture this information as immense. They write, their situation is urgent. Tech companies could run through the high-quality data on the internet as soon as 2026, according to Epic, a research institute. The companies are using the data faster than it is being produced. They quote a lawyer, Cy Domley, who represents A16Z, and in a public discussion last year around copyright law, he said, the only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data. The data needed it so massive that even collective licensing really can't work.
Starting point is 00:09:34 Now, one thing that's important to point out is that while the piece is clearly meant to intimate wrongdoing on the part of these companies, They also do acknowledge that, frankly, right now, this is a legal gray area. For example, in discussing Google's terms of service, IP lawyer Jeffrey Lautenberg said whether the data could be used for a new commercial service is open to interpretation and could be litigated. After the YouTube discussion, they also spend a bunch of time on the expansion of terms in Google's privacy policy. The Times writes, the privacy team wrote new terms so Google could tap the data for its
Starting point is 00:10:04 AI models and build products and features like Google Translate, BARD, and Cloud AI capabilities. Ask one member of the privacy team in a new term. an internal message, what is the end goal here? How broad are we going? Then the piece turns its attention to Meta. They write, by early last year, meta had hit the same hurdle as its rivals, not enough data. In March and April 2023, some of the company's business development leaders, engineers and lawyers met nearly daily to tackle the problem. Some debated paying $10 a book for the full licensing rights to new titles. They discussed buying Simon and Schuster. They also talked about
Starting point is 00:10:34 how they had summarized books, essays and other works from the internet without permission and discuss sucking up even more, even if that meant facing lawsuits. One lawyer warned of ethical concerns around taking intellectual property from artists but was met with silence. Now lastly, the piece discusses the possibility that synthetic data could be an answer to all of this, but ultimately the picture that it paints is one in which the competitive pressure of trying to compete in this space has led to a significant focus on data and a willingness to barge ahead to just get it and let it see where the chips fall. So what are we supposed to make of a piece like this? On the one hand, whatever you think of the New York Times, whatever you think of their particular slant,
Starting point is 00:11:10 it's extremely well-sourced and seems to represent a reality going on inside these companies. But what I find myself thinking about as I read this and I watch the commentary is whether it's likely to move the needle at all on public perception around these issues. It has been years and years of data and privacy advocates screaming about why people should have access to their data and why big tech is an enemy of that, without a lot of shift in public opinion or at least not enough of a shift for anything really to change. And it seems pretty inevitable to me that these questions are only going to be resolved in court. I think all of the big labs have determined that the right strategy is do what it takes to compete now
Starting point is 00:11:45 and then fight it out in court later and let the chips fall where they may. I have to say I think this is going to get more and not less complicated, the deeper into this transformation we get. For now, though, that is going to do it for the AI breakdown. Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.