This Week in Startups - AI Copyright & Training Data w/ Chris Paniewski | Wilson Sonsini Startup Legal Basics

Episode Date: September 11, 2025

Jason sits down with Wilson Sonsini partner Chris Paniewski for a special Startup Legal Basics on one of the thorniest questions in tech right now: how copyright law applies to AI training data.Chris ...has worked on some of the biggest AI deals ever — including Scale AI’s $14B+ partnership with Meta and OpenAI’s $6.5B acquisition of Jony Ive’s design studio — and brings practical, on-the-ground insights from advising leading AI companies.In this episode, Jason and Chris cover:Why AI copyright law is unsettled and will take years to shake outThe difference between training data and output in legal termsHow “fair use” really works (and why it’s a defense, not a permission slip)The risks of scraping vs. licensing, and why open source ≠ free useHow investors are diligencing AI startups around training dataWhy startups must think differently once they’re funded vs. hacking in a dorm roomWhether you’re building an AI product, investing in one, or just trying to understand where the law is headed, this conversation breaks down the real legal risks every founder should know.Timestamps:(0:00) Jason introduces the Startup Legal Basics series & Chris Paniewski(1:25) Why AI copyright law is unsettled(3:40) Training data: scraping vs. licensing(6:05) Open web ≠ open license; pitfalls around terms of service(8:15) Investor diligence & risks around training data(11:00) Open source & Creative Commons: common founder mistakes(13:25) “Fair use” explained: the four-part test(15:45) Why most disputes never make it to case lawCheck Out Wilson Sonsini: https://www.wsgr.comCheck out all of the Startup Basics episodes here: https://thisweekinstartups.com/basicsFollow Chris:LinkedIn: https://www.linkedin.com/in/christopher-paniewski-09331a59/Follow Jason:X: https://twitter.com/JasonLinkedIn: https://www.linkedin.com/in/jasoncalacanisFollow TWiST:Twitter: https://twitter.com/TWiStartupsYouTube: https://www.youtube.com/thisweekinInstagram: https://www.instagram.com/thisweekinstartupsTikTok: https://www.tiktok.com/@thisweekinstartupsSubstack: https://twistartups.substack.com

Transcript
Discussion (0)
Starting point is 00:00:00 All right, everybody, welcome back to this week in startups. This is our startup's basic series. I created this series out of self-interest and to help founders. The self-interest was I get asked the same questions over and over and over and over again. And then founders have sometimes the same question. So I thought, hmm, who can I partner with to just get all these legal issues that come up over and over again? Well, my attorneys, that Wilson Sonsini, are the best in the business. all know that. So we do the startup series with them, the startup basics series with them.
Starting point is 00:00:37 You can visit it at this week in startups.com slash basics. Now, normally I have Becky DeGraw. She's amazing. She's helped me on many fun and some gnarly adventures in legal land and startups. But today, she introduced me one of her colleagues. His name is Chris Pinyefsky. And he's a partner at Wilson Sassini, who specializes in, yes, investments, M&A, for for technology companies, but, Chris, you've been working with some of the biggest companies in the AI space. And we're going to talk about AI today since you're on the front lines, yeah? Absolutely. Yeah, I'm thrilled to be here. And so I know you worked with Scale AI. They joined forces with Meta. Johnny Ive, of course, who created the iPhone along with Steve Jobs. Steve Jobs,
Starting point is 00:01:24 actually, Larry Sincini was Steve Jobs' attorney. That's right. We're continuing the legacy. Continuing the legacy, Open AI and Johnny I've joined forces. Let's get started. Copyright and training data. This is something that is critically important. I have a passionate interest in this as a content creator. And it's taking a while to sort out what you're allowed to do with other people's content in the framing of AI. And there's really two components to this. Can I train on somebody's AI? And if so, do I need permission? And do I have to buy that work, then there's output. There are many legal cases going on right now. The New York Times and countless others are, you know, as they should, defending their rights. And then there's
Starting point is 00:02:15 technological innovators who want to make great products and need training data. So, tons of disclaimers here. This is a hot button issue, a lot of emotions, and it's in play. What do people need to know because it's also one of these situations where what you can do in America, what you can do in Israel, what you can do in China, Japan, and other places, very different, correct? Absolutely. And so today, I think we're focusing on U.S. law issues, but what works in the U.S. may not work elsewhere and vice versa. So if you're a founder, one of the things you need to think about is where are you going to launch your product, where are you going to be conducting the training, where are you getting data from, because the answers and the risk may differ depending where you are. Now, the goal is,
Starting point is 00:03:00 standard would be getting permission because if somebody signs a document and says, hey, Jason wrote this great book angel. It's in 12 languages. I would like permission. And I know this because Microsoft called my publisher a Harper's business, the number one publisher, I think, in the United States. And I got 2,500 big ones for letting them use my training data. Not everybody feels the same way. And there have been a couple of court cases. And people often confuse the concept of fair use. If we're talking about startup, these are commercial enterprises, which eliminates, you know, one of the big protections, which is like educational use. And it also triggers, I guess, interfering with the original IP owner's rights to exploit their work. If I want to create an AI, you know, of my book to answer questions based on it, well, that should be my opportunity.
Starting point is 00:03:51 And the law tends to agree with me on that. But there have been some uses of AI training being fair use, or I shouldn't even use that. on fair use, perhaps maybe allowed. Would be a better term. What's the state of the state, Chris? Yeah. So unfortunately, everything is being worked through the progress of the courts, and that's going to take a while.
Starting point is 00:04:14 And even when courts do decide things, it's based on the facts in front of them. So what may be permissible or permissible under fair use in one circumstance may not be in another. So over the course of the next five years, decade, I think the landscape is going to emerge as to what is permissible and what is not or what is protected by fair use and what is not. But until that sort of emerges, everyone's trying to take, you know, reasonable risk based on what their product is, what their market is,
Starting point is 00:04:46 and frankly, what they're looking for their company to do. And, you know, if they're looking for, you know, an aqua hire where the buyer doesn't care about the product as much, maybe they can take more risk. But if they think their exit is going to be based on an amazing product they create, then they should be mindful of building it from the ground up, taking maybe a little bit less of a risk appetite. The good news is all this tension in the system, all this opportunity in the system, is creating some novel solutions. We have Cloudflare saying, hey, we'll let you put some code on your website where people can agree to a terms of service. This is not, you know, the legal concept of fair use. It's a contract that people
Starting point is 00:05:27 can go into. There's a couple of companies trying to be the clearing houses for this. And there's, of course, tons of open source information you can build your data set on and train your data set on, but you have to really take ownership of this because open crawl, I guess is where a lot of people
Starting point is 00:05:45 decided to crawl, but they told everybody, hey, we've crawled the web, but we're not giving you a license because we can't do that on behalf of the other content holders. So I think that seems to be where a number of young folks or neophytes, they could be old and just be inexperienced, seem to think, oh, I just found this on the web.
Starting point is 00:06:04 Therefore, I can use it. Just because something's on the open web does not automatically grant your license, correct, Chris? I think that's right. So just because something is available to you, that doesn't mean you can do what you like with it. And so when thinking about information that's available on the internet,
Starting point is 00:06:24 there are a couple of rules of thumb that we put out there. So, for example, you shouldn't circumvent a paywall to get access to content, or you shouldn't circumvent login access or having to agree to a terms of service in order to gain access to that content. So that should be a bright line. Also, if you've agreed to a terms of service as part of accessing a website, then you've entered into a contract with the owner of that website and the owner of the content. And so not only do you need to worry about higher,
Starting point is 00:06:54 level legal concepts like copyright for the contented issue, you need to worry about now you have a contract. And what did you agree to in that contract? And probably in that contract, it says, you can't go take all the data on that website and do what you please with it. Yeah, and this is, I think, hard lessons learned by some startups. Some startups will, this could be fatal, because they, if they've raised a half million dollars, so they've gone through Y Combinator or My Accelerator and they have 125K, you know, you get one legal letter that you scraped a website and you circumvented the terms of service and, you know, whatever shenanigans or clever techniques you use, when that legal letter comes in, when people do their due diligence on your startup for your
Starting point is 00:07:36 seed round or series A, that will come up in due diligence and that could ice a deal. Yeah, you've seen this kind of thing happen? Yeah, it is normal course for Investors Council to ask exactly these questions. What training data did you use? Where did you get the data if you're relying on fair use, which we can definitely talk about, you know, what is the basis on which you're doing that? So it is in every single financing now as part of ordinary course diligence where investors counsel will put you through their paces and want to, you know, conduct a forensic level of examination given all of the sort of the legal prof in this area. What is, if I was making, let's say, an image generator, you know, possible path
Starting point is 00:08:24 for me to go down in terms of either training data or building relationships. You know, you could find some copyright free, royalty-free images. And again, read the terms of service, I guess. But there's also contacting people and saying, hey, here's what we want to do in an experiment. Would this be okay with you? Yeah. Yeah. So there's various ways of accessing data. So I think everyone acknowledges that trying to strike a license, with every single owner of content out there that you want to train on is difficult.
Starting point is 00:08:59 And so I think the larger publications like Hyundai Nast and the New York Times or even music publishers are probably able to create their own market for their own content. And that's why I think you're seeing licensing deals with those counterparties because they have the sophistication, they know how to set rates, and they have an interest in monetizing these assets. but I would imagine for smaller content creators, they don't have counsel, they don't have the ability to do that.
Starting point is 00:09:27 And that goes actually into a fair use analysis of how much are you affecting the market by scraping or using this content and using it for other training purposes. So I think there's a handful of things to look at in terms of where to get the data from. So you could, you could scrape it, but if you're using scraped content
Starting point is 00:09:49 for a proof of concept, to see if something works before you actually go out and license it and pay for it, maybe that's fair use, right? Maybe that works. Also, there are open source licenses and there's things like Creative Commons. And oftentimes people think, oh, it's open source, I can use it freely, I don't have to worry about the consequences. That's unfortunately not the case as well.
Starting point is 00:10:14 There's family of licenses that are called Share-Alike. And what that actually requires is a condition of using that content, anything that's a derivative, like the output of the model that's trained on it, needs to be licensed under the exact same terms. And that means you can't protect it. You can't protect the output you've generated because you're required to give it to everyone else under the licensed terms under which you got it. And so there's multiple layers of analysis here that's necessary to figure out for any given piece of content, how you should use it, and under what terms. And there's a distinct difference between doing a project as a couple of folks in a dorm
Starting point is 00:10:53 room versus being venture-backed. You could get a legal letter if you did a little project on the web and you published it. But if you just had it on your laptop, in my estimation as an investor, and it's not legal advice, but just my lived experience, you're probably not going to trigger too many alarm bells unless you did something of reason, like scraped a billion pages. However, Once you're a funded entity and your, you know, explicit purpose is to generate revenue and build a business, which when you incorporate is kind of, you know, a bit of a tell, people are going to look at you a lot different. So there is a little bit of space, I think, in the world for people to experiment as long as they're, you know, not leveraging it. But once you become an
Starting point is 00:11:38 actual concern, a business concern, I think that's when maybe the Condi Nest of the world, the New York Times of the world, they're going to think about things completely different. Exactly. So if you are the first one to stick your neck out, then maybe you're the one who needs to be made an example of, and it's worth it to bring that initial claim against you in order to set precedent and to sort of tell others in the market,
Starting point is 00:12:02 hey, we are rights owners and we are going to enforce our rights. So that would be the first example. The other could be you are deep pockets. So bringing a claim after you, I could see money because I am a rights holder. and so either you're just generally well-funded, or you'll pay to make me go away. Yeah.
Starting point is 00:12:19 And, you know, I learned a lot of these lessons when a piece of technology came out called a blog. And we had built Engadget and other blogs. And I would have a blogger once in a while. They'd grab an image off the web. Maybe they found the image on Reddit. Maybe they found it on Google search.
Starting point is 00:12:35 And this was even before Google Search had licenses and you could, you know, granular search it. And then we would get a legal letter. Hey, That image is from car and driver magazine. We had the specific one over a very verticalized niche called spy photos, Chris, which were like some new Corvettes coming out.
Starting point is 00:12:54 And there was a group of photographers who knew that. They would get tipped off. They would take pictures of them. And they actually had a lot of value in the world. And when we put them on auto blog, we would be detracting from that. And there was car and driver I paid for them. So car and driver would contact me. And it's like, well, it's just a blog.
Starting point is 00:13:09 And then I came up with a very interesting compromise. through a conversation. I said, what if we used one image of the 40 and we cropped it so it was only 25% of the image and then we linked back to you and in the first sentence we said from car and driver, would that be okay with you? And they said, sure, that'd be fine. That seems fair to us. Yep. And I created a piece of advice for my startups of, you know, is what you're doing fair. So there's the fair use test, a four-part test, which I guess you could explain to the audience since you know it so well. Sure. But then there was also the practicality of putting fair and fair use. Right. And so what I think what everyone needs to remember for fair use is it's a defense.
Starting point is 00:13:55 And right, so you impermissibly copied something, but the law recognizes certain instances in which that copying without rights from the rights holder is okay and should be allowed. Right. So each time it's a very fact. specific inquiry into under these facts, is this defense available to you? And sort of the four factors, and you mentioned that, are the purpose of the copying. So why are you copying? The nature of the work. So what is it? The amount. How much did you take? And what is the effect on the market by virtue of this copying activity? And the first and the fourth are the most important. And when you think about, I'll just talk about the first and the fourth, so the purpose and the character is it transformative.
Starting point is 00:14:47 And so you could think about training on books. And so if I'm training on books to help for a translation app, so I get the same book in 15 different languages and I'm training it so that I can translate other content, right? So I can say, hey, write me the sentence in Spanish. The model knows how to train because it has the same work in multiple languages. But if you contrast that with the model that's meant to write summaries of the work it's been trained on or potentially output key elements that the work is known for, that is not transformative.
Starting point is 00:15:20 So training on multiple books and sort of different languages for translation, the law of use is transformative and helps you with a fair use argument if you use it as essentially a substitute for the original that cuts against fair use. The most interesting cases I remember were Star Wars on YouTube, which has its own unique set of copyright claims and systems for adjudicating this, which have become very effective. If you use somebody else's content and they use content ID to identify it, let's say a clip from Star Wars, the owner of that gets alerted and they have a choice. Take it down or collect the monetization from that video. So you've now done work for that person. and then you get re-strikes and your channel gets turned off and you can appeal them and it's become quite elegant,
Starting point is 00:16:10 I think you would agree. And the transformative work is critical. And also if you're not making money in my experience, if you're not making money and you're doing commentary, so there are many YouTube channels that will take a small piece of a song or take an entire song and do a reaction video to it. And the nature of giving the reaction to it
Starting point is 00:16:31 will either allow them to do it, it or people will just say it will collect the money on it. And in the case of Star Wars, somebody just talked over the whole movie, Vandemannis, and criticized it for two hours. And they were like, yeah, that works for us. And I think the score from John Williams, they had a little problem with it, so they've removed the music. But the really interesting thing is, you know, Chris and your experience, most of the time, this doesn't go all the way. So we don't actually have case law because they typically, these things get settled, one way or the other, yeah? Yeah, so I think that fair use is going to be a gamble. It's going to be judge determined. And are we going to rely on striking a deal where we will know the parameters and everyone will operate under the terms of that license agreement? Or are we going to roll the dice with a judge and we'll live by whatever that ruling is? And some of these cases have been appealed to the Supreme Court as to what is.
Starting point is 00:17:33 what is fair use or not, but you're always gambling when you decide to take it to the court. All right. Listen, this is part one of a two-part series. We did training data, the input. But equally important, Chris, you would agree, is the output. And we'll do that in part two of this amazing series. Wilson Sincini, what can I tell you? Great law firm. They do my work. That's all you need to know. We'll see you next time on startup basics. Thank you very much.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.