The AI Daily Brief: Artificial Intelligence News and Analysis - Duolingo Replaces 10% of Contractors With AI
Episode Date: January 10, 2024Plus OpenAI claps back at the New York Times copyright lawsuit. ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Brea...kdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're looking at OpenAI's response to the New York Times lawsuit.
Before that on the brief, Duolingo lays off 10% of contractors because of AI.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.network for more information about our YouTube, our newsletter, and our Discord.
Welcome back to the AI breakdown brief, all the AI headline news you need in around five minutes.
One of the things that people will be watching extremely closely in 2024 is,
is the extent to which artificial intelligence actually starts to displace people in their jobs,
be they blue-collar jobs or, more likely, it seems, white-collar knowledge worker jobs.
Well, the first little evidence of that sort of impact is arrived in the form of cuts at Duolingo.
Duolingo is, of course, one of the best-known, if not the best-known apps for language learning
and has, according to reports let go of around 10% of its contractors.
Now, the company has been in serious damage control around this.
First of all, they said that no full-time employees were impacted and that these were not layoffs.
They said that the contractors had been, quote, off-boarded after finishing their projects at the end of
2023.
At the same time, the company did acknowledge that AI gains in productivity was part of the
reason that they didn't need as many people to work on these particular issues.
A spokesperson told Bloomberg, we just no longer need as many people to do the type of work some
of these contractors were doing. Part of that could be attributed to artificial intelligence.
Now, so far, there hasn't been a ton of information about exactly what these people were working on.
We got one report from someone affected on the R-slash Duolingo subreddit who wrote,
I worked there for five years. Our team had four core members and two of us just got the boot.
The two who remained will just review AI content to make sure it's acceptable.
Now, Duolingo's AI plans were well telegraphed. Back in November, the CEO told shareholders
in a letter that the company was using AI to,
more quickly create text speech and images and to produce, quote, new content dramatically
faster. The company said that they were also using AI to generate voices within the app and had
introduced a new premium tier that had AI generated feedback in conversation in additional languages.
Now, something we talk about on this show is the fact that there's going to be a large-scale
conversation happening this year and beyond probably around what lines we want to draw around
AI and how it benefits us or doesn't. One of the interesting things about the subreddit discussion
was that the framing of the discussion was this. In December 20,
2023, Duolingo off-borded a huge percentage of their contractors who did translations. Of course,
this is because they figured out that AI can do these translations in a fraction of the time.
Plus, it saves them money. I'm just curious, as a user, how do you feel knowing that sentences
and translations are coming from AI instead of human beings? Does it matter? A lot of the
answers were pretty nuanced. For example, say no to pudding says, I like and value the human
aspect of language exchange and learning, and I think that there's nuances in language that AI can't
fully replicate at least of now. Even if these nuances might not necessarily be reflected in
Duolingo's content, I still can't help but feel a little bit sad. Kit and Laser Fist writes,
their whole sales pitch was having native speakers cultivate content. It definitely undercuts that
message. The flip side, of course, is that one of the big impacts of AI could be a total
transformation in how people interact across language barriers. The question the world faces is,
if language becomes no longer a barrier, is it worth the cost of translators' jobs to do so?
That question is going to be played out over and over and over again a million times in the coming
years, which is why I think it's so valuable to actually talk about.
Now, moving on, we have a follow-up from yesterday's main story.
You'll remember that we talked about G-42, which is an Emirati company that has been
at the very center of U.S.-China tensions when it comes to artificial intelligence.
The company have been doing its level best to play both sides and try to stay cool with both
the U.S. and China, but was coming under increasing pressure at the end of last year and actually
started withdrawing from its Chinese relationships, favoring instead its U.S. partnerships.
While now, the bipartisan House Select Committee on the Chinese Communist Party has identified G42
as a company that works extensively with China's military, intelligence services, and state-owned
entities, and has asked the Commerce Department to look into whether they should be put under
trade restrictions because of those ties. Basically, this committee has asked the Commerce Department
to consider imposing export restrictions on not only G-42, but 13 companies that are either
owned or linked to it. In other words, whatever scrambling G-42,
is doing to try to get out ahead of these restrictions, it may not be moving fast enough.
Adding a little bit of intrigue to the situation, of course, is the fact that back in October,
OpenAI and G42 had announced a partnership. While it wasn't exactly clear what that partnership
entailed, it shows just how densely connected this world really is. Next up, a couple of pieces
of fundraising news. Luma is a company that you might have seen in relationship to NERFs. Basically,
they're creating models that allow you to capture 3D images and models with your smartphone. The company
has just raised $43 million at evaluation between $200,300 million.
Now, of course, the creation of 3D models is going to open up entire new vectors of content
and creativity.
It's relevant for gaming, for next generation video and content creation.
And this is a space that people are anticipating being contested hotly.
Another big fundraise is that of Parag Agrawal, who was the CEO of Twitter before Elon took
over.
According to the information, Parag's new company, which doesn't have a name that they could figure out,
is building software for LLM developers and has raised $30 million from back.
hackers including Kostla Ventures, Index, and First Round Capital.
Finally, one announcement that I'm watching closely, it's slated to go off, probably around the time
this video comes out at around 1 p.m. Eastern time today, January 9th.
Rabbit appears to be a new hardware device in the personal assistant AI space along the lines
of the Rewind pendant or the humane AI pin, or of course the tab.
And the question is, with all of these companies competing in this sort of wearable hardware
space, is there really a there there?
It's not just a question of which of these companies can compete, but whether any of them actually
become a form factor that matters to the future usage of humanity.
Consider me skeptical but intrigued at the same time.
That's going to do it for today's AI breakdown brief.
Up next to the main AI breakdown.
Welcome back to the AI breakdown.
Today we are looking at OpenAI's response to a recent lawsuit from the New York Times that
many are considering the most significant threat to the LLM training approach that we've yet seen.
To understand OpenAI's response, let's go back to the New York Times' own announcement of their lawsuit back at the
end of December. There were a couple notable things about the New York Times suit. First of all,
apparently it was something that the New York Times was trying to resolve with Microsoft and OpenAI
back earlier in 2023. They had approached Microsoft and OpenAI, effectively trying to
license their intellectual property, as well as create, quote, technological guardrails around their
products, but didn't come to any agreement. Now, Open AI for their part said that those conversations had
been going well and that they were somewhat blindsided by this lawsuit. The complaint says,
OpenAI seeks to free ride on the Times' massive investment in its journalism. Importantly, it accuses
OpenAI and their partners at Microsoft of, quote, using the Times content without payment to create
products that substitute for the Times and steal audiences away from it. In other words, they're not
just alleging that OpenAI is training their LLMs on their copyrighted material, but that they are
reproducing that material in such a way that someone would plausibly use ChatGBTBT, instead of paying for a
subscription to the New York Times. This will be a key part of the case. So let's look at OpenAI's blog
post to get a little bit of further color. They actually break this into four sections. The first
section is, we collaborate with newer organizations and are creating new opportunities. This isn't
really so much a legal argument. It's more just trying to establish their bona fides that it is
important to them to actually be partners with media organizations rather than just non-contributors
or thievers. They write, our goals are to support a healthy news ecosystem, be a good partner
and create mutually beneficial opportunities.
With this in mind, we have pursued partnerships with news organizations to achieve these objectives.
Deploying our products to benefit and support readers and editors,
teach our AI models about the world by training on additional historic non-publicly available content,
display real-time content with attribution in chat GPT, providing new ways for news publishers to connect with readers.
They point to partnerships with AP, Axel Springer, the American Journalism Project in NYU,
as an example of how they're approaching that.
Now, where their legal arguments start is in Section 2.
They write training is fair use, but we provide an opt-out because it's the right thing to do.
They argue training AI models using publicly available internet materials is fair use as supported
by longstanding and widely accepted precedents.
We view this principle as fair to creators, necessary for innovators and critical for U.S. competitiveness.
Now, basically, this is just a list of links to their arguments or to precedential arguments
for why they believe this.
But obviously, this is going to be the very crux of this case and any other case that
makes it to eventually the Supreme Court.
The key overarching question of all of this AI training is whether training is actually fair use.
Now, their argument for why, if they believe the training is fair use, would they allow for an opt-out,
which you remember they started doing last year, their argument is, quote,
legal right is less important to us than being good citizens.
What about the idea that ChatGPT reproduced something from wirecutter in almost exact detail?
While they write, regurgitation is a rare bug that we are working to drive to zero.
They say, our models were designed and trained to learn concepts in order to apply them to new problems.
Memorization is a rare failure of the learning process that we are continually making progress on,
but it's more common when particular content appears more than once in training data,
like if pieces of it appear on lots of different public websites.
So we have measures in place to limit inadvertent memorization and prevent regurgitation and model outputs.
We also expect our users to act responsibly,
intentionally manipulating our models to regurgitate is not an appropriate use of our technology
and is against our terms of use.
Because models learn from the enormous aggregate of human knowledge, any one sector including news is a tiny slice of overall training data.
And any single data source, including the New York Times, is not significant for the models intended learning.
Now, this is obviously where they're starting to get a little bit more forceful.
They're hinting here that there has been some amount of intentional manipulation to regurgitate.
And that's what brings them to bullet four.
The New York Times is not telling the full story.
Basically, OpenAI says here, the conversations had been going well all the way up to their last interaction, which had been December.
19th. But then on December 27th, they heard about the lawsuit by reading about it in the New York Times.
They said that the conversation had focused around a partnership around real-time data display
with attribution, but that it wasn't about solely paying for access to New York Times data,
as, quote, like any source, their content didn't meaningfully contribute to the training of our
existing models and also wouldn't be sufficiently impactful for future training. But here's
where they say things get fishy. Quote, along the way, they had mentioned seeing some regurgitation
of their content, but repeatedly refused to share any examples, despite
our commitment to investigate and fix any issues. We've demonstrated how seriously we treat
this as a priority, such as in July, when we took down a chat GPT feature immediately after
we learned it could reproduce real-time content in unintended ways. Interestingly, the regurgitations
the New York Times induced appear to be from years-old articles that have proliferated on multiple
third-party websites. It seems they intentionally manipulated prompts, often including lengthy excerpts
of articles, in order to get our model to regurgitate. Even when using such prompts, our models
don't typically behave the way the New York Times insinuates, which suggests they either
instructed the model to regurgitate or cherry-picked their examples from many attempts.
Despite their claims, this misuse is not typical or allowed user activity and is not a substitute
for the New York Times. Regardless, we are continually making our systems more resistant to adversarial
attacks to regurgitate training data and have already made much progress in our recent models.
So basically, they're saying that this random wirecutter example was some combination of one,
cherry-picked out of many, many attempts, and two, something that the New York
Times had to work really hard to get ChatGBTGPT to produce, which minimizes their claim
that ChatGPT is reasonably a substitute for the New York Times, which is of course part of their
copyright claim.
Now, the New York Times only comment in response was from their lead counsel Ian Crosby,
who wrote, the blog concedes that OpenAI used the Times work along with the work of many
others to build ChatGPT.
As the Times complained states, through Microsoft's BingChat, chat recently rebranded as co-pilot
and OpenAI's ChatGPT, defendants seek to free ride on the Times' massive investment in its
journalism by using it to build substitute of products without permission or payment.
That's not fair use by any measure.
So effectively, we've just got the entrenchment of both sides here.
So what did the community think?
Well, as mixed opinions.
Brian Romley writes, this is a well-thought-out response.
Matthew Berman says OpenAI just dropped a bold response to the New York Times copyright lawsuit.
They directly hit back with the claim that the New York Times is not telling the whole story.
OpenAI even says NYT manipulated prompts including lengthy excerpts of articles in order to get our
model to regurgitate.
Andrew Ng, the co-founder of Coursera, wrote,
After reading the New York Times lawsuit against OpenAI and Microsoft,
I find my sympathies more with OpenAI and Microsoft than with the New York Times.
The suit, one, claims among other things that OpenAI and Microsoft use millions of copyrighted
NYT articles to train their models.
Two, gives examples in which open AI models regurgitated NYT articles almost verbatim.
But the presentation muddies one and two, and I saw a lot of commentary on social media that,
because of what I believed is a muddied presentation, draws a link between them that I'm not sure
what people think it is.
On one, I understand why media companies don't like people training on their documents,
but believe that just as humans are allowed to read documents on the open internet, learn from
them, and synthesize brand new ideas, AI should be allowed to do so too.
I would like to see training on the public internet covered under fair use.
Society will be better off this way.
The whether it actually is will ultimately be up to legislators and the courts.
On two, I suspect a lot of the examples of chat GPT regurgitating articles nearly verbatim
were due to a rag-like mechanism where the user prompt causes the system to browse the web,
retrieve a specific article and then print it out. If this is the case, then to open AI's credit,
they seem to have already updated their software to make this much less likely, and this is a much
easier problem to fix than if an LLM were to regurgitate texts using only the pre-trained
weights, which as far as I know very rarely happens. To be clear, I believe independent media
is important for democracy and must be protected. I also sympathize with media businesses
worried about generative AI disrupting their business, but I'm not convinced the New York Times
lawsuit is the right way to do this. usual caveat, I am not a lawyer and not giving legal
advice or any other form of advice here. Now, lawyer Cecilia Ziniti was less impressed. She writes,
TLDR, the blog post is weak, little data and odd citations, a missed opportunity for OpenAI, who has a
good fair use case. To start, two odd choices by OpenAI. One, they use a Dali image for the blog icon.
It looks like an indie artist's work on Facebook. Why remind the reader about generative art too?
Second, the blog post author is OpenAI. Better to have a person sign and humanize OpenAI.
Hundreds of OpenAI employees sign the letter for Sam to stay. Not one signed
this. Maybe they didn't want to be deposed. The biggest issue, though, she finds outside of style
is the substance of the fair use part. Cecilia writes, the topic is fair use. OpenAI has a
great chance to win here. LLMs literally transform what they're trained on to new words. Transformative
use is fair use per lots of great cases. But OpenAI skips any mention of actual fair use cases.
Instead, OpenAI cites support from Adobe, IBM, and Gramerly, who all support GenAI because
they do it, surprise, creators no one has ever heard of, authors who are dot-da-dot lawyers in Berkeley,
Where are any big names? OpenAI could have gotten, say, their investor, Reid Hoffman, author of four books and 250
podcasts, to sign. Instead, they got no one. Why? Ultimately, though, Cecilia points out that this post is ultimately
just a PR battle. She writes, so what will happen from this blog post? Substantively, we can expect regurgitation
to be the new hallucination, referring to Open AI's naming of what the New York Times claims by identifying
it as a problem that can be solved. Legally, however, she says, nothing. Open AI's court response isn't due for some
weeks. We'll have to wait for the court to decide. Ultimately, I think Cecilia is right,
but I think that there really are two battles happening simultaneously. One is a public opinion
battle and the second is a legal battle. I think that it could be a split decision and that that
split decision could have a lot of impacts. Ultimately, I can't envision any scenario where this
doesn't make it all the way to the Supreme Court. So to some extent, everything before then is
just prelude. Anyways, friends, that is the story from here. This is a battle that is coming big time
in 2024. Until next time, peace.
