This Week in Startups - AI Copyright & Training Data w/ Chris Paniewski | Wilson Sonsini Startup Legal Basics
Episode Date: September 11, 2025Jason sits down with Wilson Sonsini partner Chris Paniewski for a special Startup Legal Basics on one of the thorniest questions in tech right now: how copyright law applies to AI training data.Chris ...has worked on some of the biggest AI deals ever — including Scale AI’s $14B+ partnership with Meta and OpenAI’s $6.5B acquisition of Jony Ive’s design studio — and brings practical, on-the-ground insights from advising leading AI companies.In this episode, Jason and Chris cover:Why AI copyright law is unsettled and will take years to shake outThe difference between training data and output in legal termsHow “fair use” really works (and why it’s a defense, not a permission slip)The risks of scraping vs. licensing, and why open source ≠ free useHow investors are diligencing AI startups around training dataWhy startups must think differently once they’re funded vs. hacking in a dorm roomWhether you’re building an AI product, investing in one, or just trying to understand where the law is headed, this conversation breaks down the real legal risks every founder should know.Timestamps:(0:00) Jason introduces the Startup Legal Basics series & Chris Paniewski(1:25) Why AI copyright law is unsettled(3:40) Training data: scraping vs. licensing(6:05) Open web ≠ open license; pitfalls around terms of service(8:15) Investor diligence & risks around training data(11:00) Open source & Creative Commons: common founder mistakes(13:25) “Fair use” explained: the four-part test(15:45) Why most disputes never make it to case lawCheck Out Wilson Sonsini: https://www.wsgr.comCheck out all of the Startup Basics episodes here: https://thisweekinstartups.com/basicsFollow Chris:LinkedIn: https://www.linkedin.com/in/christopher-paniewski-09331a59/Follow Jason:X: https://twitter.com/JasonLinkedIn: https://www.linkedin.com/in/jasoncalacanisFollow TWiST:Twitter: https://twitter.com/TWiStartupsYouTube: https://www.youtube.com/thisweekinInstagram: https://www.instagram.com/thisweekinstartupsTikTok: https://www.tiktok.com/@thisweekinstartupsSubstack: https://twistartups.substack.com
Transcript
Discussion (0)
All right, everybody, welcome back to this week in startups.
This is our startup's basic series.
I created this series out of self-interest and to help founders.
The self-interest was I get asked the same questions over and over and over and over again.
And then founders have sometimes the same question.
So I thought, hmm, who can I partner with to just get all these legal issues that come up over and over again?
Well, my attorneys, that Wilson Sonsini, are the best in the business.
all know that. So we do the startup series with them, the startup basics series with them.
You can visit it at this week in startups.com slash basics. Now, normally I have Becky DeGraw.
She's amazing. She's helped me on many fun and some gnarly adventures in legal land and startups.
But today, she introduced me one of her colleagues. His name is Chris Pinyefsky. And he's a partner
at Wilson Sassini, who specializes in, yes, investments, M&A, for
for technology companies, but, Chris, you've been working with some of the biggest companies in
the AI space. And we're going to talk about AI today since you're on the front lines, yeah?
Absolutely. Yeah, I'm thrilled to be here. And so I know you worked with Scale AI. They joined forces
with Meta. Johnny Ive, of course, who created the iPhone along with Steve Jobs. Steve Jobs,
actually, Larry Sincini was Steve Jobs' attorney. That's right. We're continuing the legacy.
Continuing the legacy, Open AI and Johnny I've joined forces.
Let's get started. Copyright and training data. This is something that is critically important.
I have a passionate interest in this as a content creator. And it's taking a while to sort out
what you're allowed to do with other people's content in the framing of AI. And there's really
two components to this. Can I train on somebody's AI? And if so, do I need permission? And do I have to
buy that work, then there's output. There are many legal cases going on right now. The New York
Times and countless others are, you know, as they should, defending their rights. And then there's
technological innovators who want to make great products and need training data. So, tons of
disclaimers here. This is a hot button issue, a lot of emotions, and it's in play. What do people need
to know because it's also one of these situations where what you can do in America, what you can do in
Israel, what you can do in China, Japan, and other places, very different, correct? Absolutely. And so today,
I think we're focusing on U.S. law issues, but what works in the U.S. may not work elsewhere and vice versa.
So if you're a founder, one of the things you need to think about is where are you going to launch your
product, where are you going to be conducting the training, where are you getting data from,
because the answers and the risk may differ depending where you are. Now, the goal is,
standard would be getting permission because if somebody signs a document and says, hey, Jason
wrote this great book angel. It's in 12 languages. I would like permission. And I know this because
Microsoft called my publisher a Harper's business, the number one publisher, I think, in the United
States. And I got 2,500 big ones for letting them use my training data. Not everybody feels the same
way. And there have been a couple of court cases. And people often confuse the concept of fair use.
If we're talking about startup, these are commercial enterprises, which eliminates, you know, one of the big protections, which is like educational use.
And it also triggers, I guess, interfering with the original IP owner's rights to exploit their work.
If I want to create an AI, you know, of my book to answer questions based on it, well, that should be my opportunity.
And the law tends to agree with me on that.
But there have been some uses of AI training being fair use, or I shouldn't even use that.
on fair use, perhaps maybe allowed.
Would be a better term.
What's the state of the state, Chris?
Yeah.
So unfortunately, everything is being worked through the progress of the courts,
and that's going to take a while.
And even when courts do decide things,
it's based on the facts in front of them.
So what may be permissible or permissible under fair use in one circumstance
may not be in another.
So over the course of the next five years,
decade, I think the landscape is going to emerge as to what is permissible and what is not
or what is protected by fair use and what is not. But until that sort of emerges, everyone's
trying to take, you know, reasonable risk based on what their product is, what their market is,
and frankly, what they're looking for their company to do. And, you know, if they're looking for,
you know, an aqua hire where the buyer doesn't care about the product as much, maybe they can
take more risk. But if they think their exit is going to be based on an amazing product they
create, then they should be mindful of building it from the ground up, taking maybe a little bit
less of a risk appetite. The good news is all this tension in the system, all this opportunity
in the system, is creating some novel solutions. We have Cloudflare saying, hey, we'll let you
put some code on your website where people can agree to a terms of service. This is not, you know, the legal
concept of fair use. It's a contract that people
can go into. There's a couple of companies trying
to be the clearing houses for this.
And there's, of course,
tons of open source information
you can build your
data set on and train your data set on, but you have to
really take ownership of this because
open crawl, I guess is where a lot of people
decided to crawl, but they told everybody,
hey, we've crawled the web,
but we're not giving you a license
because we can't do that on behalf of the other
content holders. So I think that
seems to be where a number of young folks or neophytes,
they could be old and just be inexperienced,
seem to think, oh, I just found this on the web.
Therefore, I can use it.
Just because something's on the open web
does not automatically grant your license, correct, Chris?
I think that's right.
So just because something is available to you,
that doesn't mean you can do what you like with it.
And so when thinking about information
that's available on the internet,
there are a couple of rules of thumb that we put out there.
So, for example, you shouldn't circumvent a paywall to get access to content,
or you shouldn't circumvent login access or having to agree to a terms of service
in order to gain access to that content.
So that should be a bright line.
Also, if you've agreed to a terms of service as part of accessing a website,
then you've entered into a contract with the owner of that website and the owner of the content.
And so not only do you need to worry about higher,
level legal concepts like copyright for the contented issue, you need to worry about now you have a
contract. And what did you agree to in that contract? And probably in that contract, it says,
you can't go take all the data on that website and do what you please with it. Yeah, and this is,
I think, hard lessons learned by some startups. Some startups will, this could be fatal,
because they, if they've raised a half million dollars, so they've gone through Y Combinator or My
Accelerator and they have 125K, you know, you get one legal letter that you scraped a website and you
circumvented the terms of service and, you know, whatever shenanigans or clever techniques you use,
when that legal letter comes in, when people do their due diligence on your startup for your
seed round or series A, that will come up in due diligence and that could ice a deal. Yeah,
you've seen this kind of thing happen? Yeah, it is normal course for Investors Council to ask
exactly these questions. What training data did you use? Where did you get the data if you're relying on
fair use, which we can definitely talk about, you know, what is the basis on which you're doing that?
So it is in every single financing now as part of ordinary course diligence where investors
counsel will put you through their paces and want to, you know, conduct a forensic level of
examination given all of the sort of the legal prof in this area.
What is, if I was making, let's say, an image generator, you know, possible path
for me to go down in terms of either training data or building relationships.
You know, you could find some copyright free, royalty-free images.
And again, read the terms of service, I guess.
But there's also contacting people and saying, hey, here's what we want to do in an experiment.
Would this be okay with you? Yeah.
Yeah. So there's various ways of accessing data.
So I think everyone acknowledges that trying to strike a license,
with every single owner of content out there that you want to train on is difficult.
And so I think the larger publications like Hyundai Nast and the New York Times or even music publishers
are probably able to create their own market for their own content.
And that's why I think you're seeing licensing deals with those counterparties
because they have the sophistication, they know how to set rates,
and they have an interest in monetizing these assets.
but I would imagine for smaller content creators,
they don't have counsel,
they don't have the ability to do that.
And that goes actually into a fair use analysis
of how much are you affecting the market
by scraping or using this content
and using it for other training purposes.
So I think there's a handful of things to look at
in terms of where to get the data from.
So you could, you could scrape it,
but if you're using scraped content
for a proof of concept,
to see if something works before you actually go out and license it and pay for it,
maybe that's fair use, right?
Maybe that works.
Also, there are open source licenses and there's things like Creative Commons.
And oftentimes people think, oh, it's open source, I can use it freely,
I don't have to worry about the consequences.
That's unfortunately not the case as well.
There's family of licenses that are called Share-Alike.
And what that actually requires is a condition of using that
content, anything that's a derivative, like the output of the model that's trained on it,
needs to be licensed under the exact same terms. And that means you can't protect it. You can't
protect the output you've generated because you're required to give it to everyone else under the
licensed terms under which you got it. And so there's multiple layers of analysis here that's
necessary to figure out for any given piece of content, how you should use it, and under what
terms. And there's a distinct difference between doing a project as a couple of folks in a dorm
room versus being venture-backed. You could get a legal letter if you did a little project on
the web and you published it. But if you just had it on your laptop, in my estimation as an investor,
and it's not legal advice, but just my lived experience, you're probably not going to trigger too
many alarm bells unless you did something of reason, like scraped a billion pages. However,
Once you're a funded entity and your, you know, explicit purpose is to generate revenue and
build a business, which when you incorporate is kind of, you know, a bit of a tell, people are
going to look at you a lot different. So there is a little bit of space, I think, in the world
for people to experiment as long as they're, you know, not leveraging it. But once you become an
actual concern, a business concern, I think that's when maybe the Condi Nest of the world,
the New York Times of the world,
they're going to think about things completely different.
Exactly.
So if you are the first one to stick your neck out,
then maybe you're the one who needs to be made an example of,
and it's worth it to bring that initial claim against you
in order to set precedent and to sort of tell others in the market,
hey, we are rights owners and we are going to enforce our rights.
So that would be the first example.
The other could be you are deep pockets.
So bringing a claim after you,
I could see money because I am a rights holder.
and so either you're just generally well-funded,
or you'll pay to make me go away.
Yeah.
And, you know, I learned a lot of these lessons
when a piece of technology came out
called a blog.
And we had built Engadget and other blogs.
And I would have a blogger once in a while.
They'd grab an image off the web.
Maybe they found the image on Reddit.
Maybe they found it on Google search.
And this was even before Google Search
had licenses and you could, you know,
granular search it.
And then we would get a legal letter.
Hey,
That image is from car and driver magazine.
We had the specific one over a very verticalized niche called spy photos, Chris,
which were like some new Corvettes coming out.
And there was a group of photographers who knew that.
They would get tipped off.
They would take pictures of them.
And they actually had a lot of value in the world.
And when we put them on auto blog, we would be detracting from that.
And there was car and driver I paid for them.
So car and driver would contact me.
And it's like, well, it's just a blog.
And then I came up with a very interesting compromise.
through a conversation. I said, what if we used one image of the 40 and we cropped it so it was only
25% of the image and then we linked back to you and in the first sentence we said from car and driver,
would that be okay with you? And they said, sure, that'd be fine. That seems fair to us.
Yep. And I created a piece of advice for my startups of, you know, is what you're doing fair.
So there's the fair use test, a four-part test, which I guess you could explain to the audience since you know it so well.
Sure. But then there was also the practicality of putting fair and fair use.
Right. And so what I think what everyone needs to remember for fair use is it's a defense.
And right, so you impermissibly copied something, but the law recognizes certain instances in which that copying without rights from the rights holder is okay and should be allowed.
Right. So each time it's a very fact.
specific inquiry into under these facts, is this defense available to you? And sort of the four
factors, and you mentioned that, are the purpose of the copying. So why are you copying? The nature of
the work. So what is it? The amount. How much did you take? And what is the effect on the market
by virtue of this copying activity? And the first and the fourth are the most important.
And when you think about, I'll just talk about the first and the fourth,
so the purpose and the character is it transformative.
And so you could think about training on books.
And so if I'm training on books to help for a translation app,
so I get the same book in 15 different languages and I'm training it
so that I can translate other content, right?
So I can say, hey, write me the sentence in Spanish.
The model knows how to train because it has the same work in multiple languages.
But if you contrast that with the model that's meant to write summaries of the work it's been trained on
or potentially output key elements that the work is known for, that is not transformative.
So training on multiple books and sort of different languages for translation, the law of use is transformative
and helps you with a fair use argument if you use it as essentially a substitute for the original
that cuts against fair use.
The most interesting cases I remember were Star Wars on YouTube, which has its own unique set of copyright claims and systems for adjudicating this, which have become very effective.
If you use somebody else's content and they use content ID to identify it, let's say a clip from Star Wars, the owner of that gets alerted and they have a choice.
Take it down or collect the monetization from that video. So you've now done work for that person.
and then you get re-strikes and your channel gets turned off
and you can appeal them and it's become quite elegant,
I think you would agree.
And the transformative work is critical.
And also if you're not making money in my experience,
if you're not making money and you're doing commentary,
so there are many YouTube channels
that will take a small piece of a song
or take an entire song and do a reaction video to it.
And the nature of giving the reaction to it
will either allow them to do it,
it or people will just say it will collect the money on it. And in the case of Star Wars,
somebody just talked over the whole movie, Vandemannis, and criticized it for two hours.
And they were like, yeah, that works for us. And I think the score from John Williams, they
had a little problem with it, so they've removed the music. But the really interesting thing is,
you know, Chris and your experience, most of the time, this doesn't go all the way. So we don't
actually have case law because they typically, these things get settled, one way or the other, yeah?
Yeah, so I think that fair use is going to be a gamble. It's going to be judge determined. And are we going to rely on striking a deal where we will know the parameters and everyone will operate under the terms of that license agreement? Or are we going to roll the dice with a judge and we'll live by whatever that ruling is? And some of these cases have been appealed to the Supreme Court as to what is.
what is fair use or not, but you're always gambling when you decide to take it to the court.
All right. Listen, this is part one of a two-part series. We did training data, the input.
But equally important, Chris, you would agree, is the output. And we'll do that in part two of this
amazing series. Wilson Sincini, what can I tell you? Great law firm. They do my work. That's all you
need to know. We'll see you next time on startup basics. Thank you very much.
