librarypunk - 104 - Intro to AI and Copyright

Episode Date: September 7, 2023

Justin and Jay both had to talk about this for work, so why not do an episode?  Discord: https://discord.gg/jY4jaNgan Media mentioned https://www.plagiarismtoday.com/2023/08/29/appeals-court-rules-ag...ainst-mandatory-deposit/#:~:text=In%20a%20decision%20handed%20down,violation%20of%20the%20takings%20clause. The Fear Of AI Just Killed A Very Useful Tool | Techdirt Large language models, explained with a minimum of math and jargon Introduction to Large Language Models and the Transformer Architecture | by Pradeep Menon  Google and YouTube are trying to have it both ways with AI and copyright - The Verge Google and Universal Music working on licensing voices for AI-generated songs | Artificial intelligence (AI) | The Guardian  https://www.altexsoft.com/blog/generative-ai/  https://www.copyright.gov/ai/ Feist Publications, Inc., v. Rural Telephone Service Co. - Wikipedia. AI-Generated Art Lacks Copyright Protection, D.C. Court Says (1) AI-generated art cannot receive copyrights, US court says | Reuters

Transcript
Discussion (0)
Starting point is 00:00:27 My name is Justin. I'm Scholar Communications Library. My pronouns are he and they. I'm Sadie. I work IT at a public library and my pronouns are they them. And I'm Jay. I'm a music librarian and my pronouns are he. And we don't have a guest because we are doing stuff that we did for work already anyway. And that's the same as episode prep. Everything's a podcast, remember? Everything can be a podcast. So I've been reading a biography of George Pettai. Hell yeah. Because we've got to get our batai posting in because this is the batai fall. I already read Story of the Eye, so y'all got to catch up to me. I know more about it. I've got a selection of his readings, but I want to get through the biography first because his biographer says that his works make more sense once you understand who he's responding to and when.
Starting point is 00:01:17 So, and he's doing new translations. Oh, nice. Yeah, the version of Story of the Eye that I read included like an afterword by him and like an outline of a, what he had thought of to be like a sequel, but he like goes through in detail all the things from his life that inspired certain things about Story of the Eye. He's basically kind of like psychoanalyzing himself. It's interesting.
Starting point is 00:01:41 I did see Susan, or was it, it was Bart actually, Roland Bart who said that like that removed some of the mystery and intrigue of Story of the Eye, of him going in, by the way, this is about my shitty relationship with my dad in several ways. But yeah, it's interesting. Well, I think the whole thing is it's not supposed to be a narrative novel. It's supposed to be a form-breaking thing.
Starting point is 00:02:05 But anyway, I was just thinking how much everything, like the Surrealist were doing, was just podcasting, creating little secret societies and journals and limited edition conference-only dictionaries, like a critical dictionary, where they just define, like, 75 words in ways that they find funny. So, RIP, George Pottai, you would love posting. I mean, you did love posting. We would love posting on the internet. You'd have loved something awful. Oh, God.
Starting point is 00:02:31 I bet Batai would really like the fucked up corners of AO3. That would be his jam. That would be his jam. He would be, like, best friends with all the, like, the fan fiction girlies. I don't really know how much he valued writing, like, unique things rather than derivative things. But I'm trying to, I'm trying to understand more about his life as a librarian because he was librarian his whole life. No shit. He was?
Starting point is 00:02:56 Yeah. But I was just like me for real. That was a part of the reason he wrote under a pseudonym most of the time too. So the solar anus he wrote under, was like the first thing he wrote under his own name. And his like, once he established his style. And then he writes about anthropology based on stuff he reads. He also writes ostensibly about like coins because that's part of his job. I forget what the study of coins and rare materials is.
Starting point is 00:03:25 but he uses that to talk about some more of his weird religious stuff. Cool. I think you worked there this whole life. Nice. So, yeah, that was Batai Corner. Or Batai Corner. Make a drop. Hot Vatai Fall.
Starting point is 00:03:36 We've got a lot more to read. Yeah, I should pick up the solar anus. Or maybe, can you, let's put the info of the biography in the notes. So the biography is from Critical Lives. And I believe I mentioned the author before. I don't have his name in front of me. And then the collection I have is an earlier one, so I don't know how good the translations are because he was saying that they need to redo some translations. So I might pick up some of his newer translations.
Starting point is 00:04:05 I don't know how many are out. And a lot of the type stuff, I guess, also hasn't been translated into English. Not all of it, at least, because he's sort of been ignored. I think some stuff is still only available in French. All right, we have news. D.C. Circuit Court or Appeals Court. rules against mandatory deposit for copyrighted works. So this one was one I put in the Discord. You can join our Discord. There will be a link in the description. I just kind of put it in there
Starting point is 00:04:37 for us to talk about. Essentially, what happens is when a work is published in the United States, the Copyright Office can demand two copies, although I think at this point they only demand one. In order to register a copyright, you definitely have to provide the copy, but if you just, since copyright is automatic, that was kind of at the heart of this. this case. So the copyright office can still send you a letter of demand. I've actually had someone get a letter of demand for one of the journals they were running out of their library, but for journals it's a little different if it's like an only online journal. So I was like, I pretty sure you can ignore it. But if you do not give the copyright office the copies that it needs, you get a
Starting point is 00:05:18 $250 fine per item and then continue to ignore it would be another $2,500 fine. So you could just pay the fine and not provide the copies, but you also would have to pay for each copy they would have to procure. So you can ask for them to waive this requirement, and that was part of the Copyright Office's argument was that some of these can be put, can be submitted electronically, although that wasn't part of like their original argument. It's part of the reason why they lost this decision in the appeals court. The problem is this was ruled as a unlawful taking. It could be presented as a voluntary exchange where the publisher receives a benefit for participating, and the other would be that they could claim the deposit requirement didn't amount to a taking. So registration is not
Starting point is 00:06:05 required to get copyright, it's automatic, and thus there's no additional benefit from complying other than to avoid the fee. So that's really the only reason to comply at that point, because this author wasn't registering copyright. They were simply publishing with a copyright notice on the books, which is totally fine, and you can do that because you don't have to register copyright to have it. Right, like if you have a website, you normally do that, right? Yeah, but that can also trigger a letter of demand if it's published, and that's what happened in this case. And in the initial ruling, the bookseller who sells a lot of rare books, digital versions of rare books, some of which are in the public domain, some which are not, some which have original annotations and introductions. And so this is a couple hundred books, and some of them were not available.
Starting point is 00:06:54 anymore, so the publisher decided to sue over this and also turned down a settlement offer from the Copyright Office, which I think waived it for some but not all of the books and said that a digital copy would be fine. Publishers said they didn't have every digital copy, so they wouldn't be able to comply with it anyway. So that's why they didn't accept the settlement. The decision did not address if requiring electronic deposits would be allowed. It was only ruled that it was irrelevant in this case. There were a lot of references to raisins, which is a case,
Starting point is 00:07:30 Horn v. Department of Agriculture, which was about the government taken raisins. Horn, H-O-R-N-E. They said corn. That was like, corn v. Agriculture. Yeah, like when the police take your money, and it's like versus $250.
Starting point is 00:07:49 When they do civil asset forfeiture, like versus corn. Mm-hmm. So there was a requirement that the Department of Agriculture get a certain amount of raisins from producers in order to ensure a stability of the market because, you know, we have to keep the food market from collapsing and food prices going on a wild. There was a similar ruling for Monsanto because they were dealing with selling dangerous chemicals, but in that case, selling produce is well established. And so there was no benefit ruled for the taking of the raisins, even though, which seemed kind of strange, because selling produce is well established. And I thought you could probably argue that the deposit requirement was well established since it was a holdover from the older system of copyright where you had to do deposit in order to have copyright. But it's really no longer relevant, especially post-1988 with the burn convention saying that there's no requirement to having copyright.
Starting point is 00:08:44 There's no filing requirement or deposit requirement that can inhibit. I can't remember the language of it, but that's the international copyright standard. So basically nothing can get away from you having a copyright. The most likely outcome is that the Copyright Office is going to change these kind of demands in the future, like how it handles them. So it'll probably make the process easier, which could be good. And the, yeah, I mean, it probably are just going to change how they handle deposit, how they levy the fine. And in what situations they are going to send out these demand letters, probably less often than they used to or will do so in a way that's easier to comply with. And there might not be a fine attached.
Starting point is 00:09:27 I don't know what will happen. But it'll be interesting to see. I don't know if it's been appealed or not. But it was interesting, particularly because the deposit requirement is a holdover, but it's also one of the interesting ways in which the Library of Congress does preservation of things. So you could have a lot of preservation issues. But in terms of like mass culture and things that are created by like Netflix or whatever, the deposit requirement is not like a problem for them because anyone who's registering copyright will still have to do the deposit requirement. This was someone who wasn't registering their copyright.
Starting point is 00:09:58 So AI and copyright. We're going to talk about AI. We're going to talk about copyright. And what I did when I did this presentation at work is I just did a review of some of the copyright basics, which is why I wanted to also. do that high-in with the news story because it highlights a couple things that are important. First of all, the copyright is automatic. It doesn't have to be registered. You just have it. Some things can't be copyrighted. So, for example, words or short phrases, so like Nike's just do it.
Starting point is 00:10:31 It could not be copyrighted. It has to be trademarked, meaning the specific font and the logo are trademarked for brand recognition. But if you wanted to use the phrase, just do it in like a student newspaper, whatever, totally fine. Copyright is limited, so things fall into the public domain. Also, government works are usually in the public domain. Data can't generally be copyrighted. It has to have some sort of minimum level of creativity and arrangement and interpretation in order to be copyrighted, which is easy enough to do. It's like one of the reasons of those recipes you see online have like everyone's 9-11 changed the way America looked at near forever. It's always like the one-to- because there really was a recipe.
Starting point is 00:11:14 that. I think that was a real thing is like a screenshot that people had was like the recipe opened with them talking about 9-11. By the way, blowback podcast is doing a whole season on the war in Afghanistan. Didn't wait, didn't they already do that? No, they did the Iraq for. Right. They're talking about the war in Afghanistan from like the Soviet era to Oh. So yeah. It's kind of builds on the war on terror thing and it's going to be like very broad. So I'm looking forward to it. So shuts out. It's just good. Yeah, it's a good podcast. Yeah, if you want to learn a lot in a pretty quick amount of time, I would say that's one of the better history podcasts out there. Damn near everything is copyrighted. As long as it's put down in a written format, Dias can't be copyrighted, but the tangible expression of a thing is copyrighted. But you have fair use rights, which are the four factor tests, and each of the factors is interdependent, and that'll be pretty important for discussions on AI. So the four factors, I'm going to get a matter of order, so I'm going to just look it up. Yeah, I always like miss one or something.
Starting point is 00:12:18 So the first factor is the character of the use. So this is where transformative use comes in. So like the Google book case, which, you know, by creating an index of scanning full works, that's transformative. Even if Google is doing it with commercial benefits to itself, even if they're copying the whole work, it was determined to be a fair use because it was ultimately transformative and created something new, which is this massive. index of books. And that's why the Google's books case was such a big deal. And also, if you're using
Starting point is 00:12:48 it for nonprofit, that's in your favor, educational or personal use, that rules the first favor in the favor of it being fair use. Commercial uses will then move that back into the argument for the copyright holder. So if you're making money off of it. Yeah, right. But that's the thing is like all of these factors are so interdependent that it's kind of hard to talk about them one at a time. Right. Because like within the first factor, the commercial uses is important because of a case I'm going to talk about. Right. Because it's like I always love to point out that a fair use case in art is that one guy who like put the guitars on the like indigenous people and like sold that art.
Starting point is 00:13:31 And it's like, you know, for billions of dollars, right? So you can make money off of things that are fair use. Like a lot of people think that like. oh, it's, I made money off of it, therefore it's not a fair use anymore. No, it is still. It's just less in your favor. Yeah. And it will always be intertwined with the other factors in terms of,
Starting point is 00:13:52 does it have an impact on the market in which the copyright holder is selling their work? But like I was saying, Google Books case was commercial. Like, they sell books to their platform. So it didn't stop it from being a transformative use in that first character, even though in that first, the character of use, first factor. talks about is it a commercial or a non-commercial? They still won in that first factor. The nature of the work, so works that are fact and published,
Starting point is 00:14:18 tip the balance in favor of fair use. Works of fiction are less likely to be fair use. Unpublished works also are in balance of the copyright holder. I would say particularly the reason for that is the potential that they could publish it in the future and make commercial use of it. So again, like, these are all embedded within each other. How much of the work do you use, the amount? So there is no minimum amount.
Starting point is 00:14:42 There's no maximum amount. You can use the full work. This was the Google Books case. The full work could be used because the end result wasn't sharing the full book, but it did use the whole book to train the Google Books product and index. So that's kind of the argument about AI, particularly like large language models, is pointing to the Google Book case. But it might not be good enough.
Starting point is 00:15:06 And fourth is a pretty important one. which is the effect on the market. If you are giving something, say you make a copy of something and give it away for free, like a textbook, even though you're not selling it, it's not a commercial use. It is impacting the market of the copyright holder.
Starting point is 00:15:22 And this one is always really heavily weighted in favor of the copyright holder, I would say. Harm is like very, it doesn't have to be proven. It can be like potential harm. They hurt my fee. This is true. Give me a million dollars.
Starting point is 00:15:35 Yeah, I mean, this was like one of the insane things about the internet archive case was like the fact that internet archive gets donations. Yeah. Was like either factor one or factor four that like that was a commercial use or that was affecting the market. That made it. It was one of the weird like more out there things. So I mean, those are the four basic parts, but they're all interdependent. So just because something is transformative doesn't mean it's not going to be relevant in other at other times. So with that out of the way. I wanted to differentiate types of AI because AI as a term is just really slippery and not very good. On purpose. On purpose, yeah. On purpose. And that's why instead of
Starting point is 00:16:22 talking about AI, I think it's better to learn some of the terminology and just have a little glossary for yourself so that you could talk about what kind of AI are you talking about. And one of the things that's relevant to that is what we talked about last time, which was the Shakespeare. prose craft thing, which was very basic digital, digital humanity stuff and stuff that you could do with no neural networks with nothing that is machine learning, which is generally defined as a very basic term and like an algorithm that creates its own algorithms. Which is cool. So you don't even need to do that.
Starting point is 00:16:59 And so to do something like Shakespeare or Prosecraft, I don't know if it did, but you don't need to do it because he was doing this back in 2017. Yeah, he was doing this back in 2017 before, like, this kind of large language model was even readily available for most people. The main one, I think that's relevant to talk about, are large language models. Yeah. Which is what chat GPT is. And GPT is a generative, pre-trained transformer. And I'll get into what that means.
Starting point is 00:17:29 I mean, generative and pre-trained, pretty straightforward. A transformer is the technology in which large language models work. So the cool thing about like a language model, the way you would have been taught about AI probably like 10 years ago or whatever, would be a human has to label these things and that allows the machine to start doing it. It needs pre-trained data. It's like a URI. It's like Lingata. Yeah. Where like both the human and the computer can understand what's going on.
Starting point is 00:18:00 Yeah, basically. It's more like this is more relevant actually with images. but with large language models, I feel like I'm not explaining this well without my slides. Somebody pull those up. Another thing that I pointed out in my slide, so we're doing this episode because both me and Justin had to do this independently at work.
Starting point is 00:18:19 A thing I was careful to call that in mine was calling it specifically a composition tool. So it's not AI. It's a composition tool. It's like an AI composition tool or like an AI composition software. Because again, that's sort of like people don't know what a lot of those other things mean, but like bringing away the like spookiness of the term AI, right? And bringing it's like, no, it's a it's a type of tool. It's a composition tool.
Starting point is 00:18:45 It's all it's doing is it's composing things. It's not thinking. It's just spitting things out. This is why I felt like I was off track because I was starting to describe image training data. And actually I was reaching for the wrong thing. So the first thing to explain about large language models is word vectors, I guess, is the best place to start. So a vector is a point in space like map vectors. So Washington, D.C. is at 38.9, comma, 77. And Washington, D.C. is 40.7-74. So you can tell 38.9 and 40.7 are close to each other. Therefore, D.C. and New York are close to each other.
Starting point is 00:19:24 So that's sort of what it's doing with words, except instead of two dimensions, like a map, it's hundreds of vectors. So hundreds of dimensions, which is hard for us to think about, but it's easier for a computer to do it because it's turning these things into vectors. So cat is represented in English as C-A-T, a sequence of letters. Cat would be represented by a vector that's like 300 numbers long. So like 0.0074, 0.0030, negative 0.0105. So you could have hundreds of those. And that is explaining the relationship of the word cat to other words. like a spatial relationship. So language models turn words into a spatial relationship so we can
Starting point is 00:20:08 imagine like a word space where words are crisscrossing each other and the vectors are next to each other and that helps the computer understand, oh, these words are similar or linked in some way. So in this space, words close to cat are dog, kitten, pet. Whereas if you did this alphabetically, right, it would be that cat is next to catalog, right, which the computer is not using that kind of association. The vectors are explaining semantic relationships. The process of creating those vectors is called word embedding. And I've got like a pretty cool, I'll put this medium post in there, but it's a pretty cool one where it tries to map English words to German words, like their translations, quote unquote. And some words are perfect overlaps, but because a language model
Starting point is 00:20:56 looks for semantic meetings, there are certain words that are translations of each other, but they're not quite, they don't quite mean the same thing. So, for example, like, market and marked are not perfect overlaps. But article and article are pretty much perfect overlaps. So because some of those vectors didn't quite overlap, the machine is understanding that these words have slightly different semantic meanings in different situations. So since now you have these series of connections, it allows the machine to the computer to create analogies. So say we took those hundreds of numbers that are vectors for the word biggest. We'll subtract all of those numbers that represent the word big and adds small.
Starting point is 00:21:46 The word that's going to be closest to the resulting vector we create is smallest. So now it knows analogies. Big is to biggest as small is to smallest. So you can have all kinds of analogies, like Swiss is to Switzerland as Cambodian is to Cambodia, so nationalities. Paris is to France, as Berlin is to Germany, capital. So it doesn't need to just have linguistic relationships like big and biggest. It can have semantic relationships. So the more information you put into it, the more it understands and can draw these comparisons. But this is one of the first places, well, not the first places, but it's a very early place where
Starting point is 00:22:20 biases in the model can come in. So it can draw false analogies. So in some word vector models, if you took the vectors for doctor, took out all the ones for man and added in ones for women, some models would yield nurse. So it's made a mistake. It didn't understand that there was no reason to output the word nurse there. So it also means that like hominems have different vectors. So bank of a river will have different numbers representing it than bank where you put your money. And don't you love our language?
Starting point is 00:22:53 Yeah. I do, though. It's a, I can't imagine. having to do this in like Japanese where there's tons more homophones. That's the basics of word vectors. And once you're creating these vectors, you can start talking about what a large language model is. And every each layer of a large language model is a transformer.
Starting point is 00:23:13 So you've got layers and layers. So what will happen is you can put in any type of text. So that's why it doesn't have to be pre-trained, what I was getting at earlier. So you can put in the sentence, John wants his. bank to cache the and just like leave that. Put it into the first layer of a transformer. It will then start adding vectors to these words and understanding the connections between them and then spit that out to the next layer of the transformer. So that first transformer layer is going to say, okay, wants and cash are both verbs. Cool. So then it's going to add that data and then spit that out
Starting point is 00:23:50 to the next transformer. And now that next transformer is going to say, okay, John wants his. Okay, his is John. bank, okay, financial institution, because cash is a verb, and so on and so on. So GPT3 had 96 transformers. So it creates a lot of computing resources. The more larger model needs to be. And also for it to sound more human, it needs more layers and more data, which is why AI computing or large language model, particular GPT computing, is becoming so expensive and is probably going to be beyond the reach of most people, is that it needs a massive. amount of computing power for all these transformers, and it needs a massive amount of data to have context. So going back to that sentence, if there was a whole novel about John, it might embed
Starting point is 00:24:38 in that word John that John is married to marry, right? So it knows in the layers of the transformers, it adds basically tags to each of these words to explain context. So that's how it figures out, you know, who John is, things about him. And that's also why when you want to know, what an AI got wrong, it's really hard to find out because we don't know where it labeled something wrong or why it labeled something the way it did, which is kind of like the creepiest part about LLMs is if you want to figure out why it did something wrong, you just have to start looking at all the tags it created for these words to create associations and going, oh, here's what it did, it mistagged something, or we think these are the tags it's using. We don't really know what the
Starting point is 00:25:24 machine is thinking when it's adding these tags to words, these vectors to words. And is this, and to be fair, I have not read very much on how large language models work. So this may be a very basic question. But is this where like the whole black box thing comes in is what the transformers, like what each layer of transformer does or? Yeah, I think that the transformers, I don't think those are like openly available for people to like play with. But I really, am not sure, like how much, because, especially with GPT, because it's pre-trained transformers, generative pre-trained transformers, it's already got tons of context, and then you give it, you know, who is John married to? And it will go, I don't know. But then if you gave it like a
Starting point is 00:26:10 story, it would start to figure out, it can give you better arguments based on how much it has. So to be accurate, it needs scale, both the computing power and input. And LLMs learned by trying to predict the next word in a passage, so nearly any writing can be used. So you can plug in all Wikipedia, for example. But an early language, that's the difference between a large language model and a language model is early language models needed that human labeling, whereas now it's just based on tons of computing and tons of information to get more accurate sounding answers. So that's kind of the first area where training data comes in. Yeah. Oh, I was just going to say it's statistics. Yeah. It's just, it's just.
Starting point is 00:26:51 statistics. That's all this is is like kind of figuring out things through massive computing power. Which is, to me at least, is the reason why artificial intelligence is such a misnomer. It's not actually, there's not actual any intelligence going on there. It's statistics. Yeah. And it's also, the whole thing is where you have, once you have like a machine learning thing where it's creating its own algorithms, then you just simply don't know why it's doing something and if it does it wrong, you have to go digging to find out what it did wrong because you're no longer programming the damn thing. So because, but it's just sort of letting itself run again and again because of all this like cheap computing ability. And then we have to go dig through it as a human and figure out why it does that. Yeah. So the first area that copyright comes
Starting point is 00:27:42 into with the large language models is like training. Like where does training data come from? Is it stuff that was copyrighted? Is it stuff that was accessed illegally? So copyrighted, works are, I think the assumption most people have is that these things are always going to train on the open web and they're always going to train on pirated material. But I don't think that's more, that's really what's going to happen. Google, for instance, is working with Universal Music Group to license materials just to avoid litigation because there's only like 10 companies. All they need is like a handful of handshakes and most of like the copyrighted content in the world could just be run through someone's AI. So they might not even try to make
Starting point is 00:28:21 a fair use argument anymore because they're aware that they're open to litigation at this point. They're just going to license with the biggest players. So it won't matter if you were scraping from copyrighted sources at that point. It just depends on how many of them are going to like, while we're building it ourselves, so we're not going to license to you. And then that'll create silos of content that will be trained on. But I don't know if Universal Music Group is planning on making their own AI. So they've got a close relationship with Google. So they might be able to build like Google might build the best AI music in the future because they might have access to the most music sources. I think that could probably happen for a lot of these companies.
Starting point is 00:29:00 They'll just work with each other to build the products that they want. And also because it's so expensive to do the computing, it seems impossible that anyone who's not already extremely rich. Like if you saw that thing going around that Chad GPT costs $700,000 a month to run or something, yeah. Like it's just insane amounts of computing power needed. But this was something that Johnny pointed out to me as I was working on getting some of this stuff done, which is the thing about chat GPT and just OpenAI in general, which is the parent company that does all the GFTs, is that their real value has been capturing the platform that everyone uses. So if everyone's using chat GPT, that is giving them reinforcement learning from human feedback, RLHF. And that is really, really valuable because that's users talking to the machine. and explaining what it's getting wrong,
Starting point is 00:29:50 and that's really, really valuable training data. So both the inputs and the outputs, if you go into like the license agreement or the terms of service for chat GPT, it says, look, whatever we output, you can use commercially, but anything you input and anything we output can be used for training data unless you opt out.
Starting point is 00:30:07 And they're just figuring not enough people are going to opt out for that to matter. But that also explains why everyone and their brother is setting up their own little AI chatbots because they all want this reinforcement learning from human feedback process running. So that's why Google has one going. That's why Bing has one going.
Starting point is 00:30:22 They want you to sit there and argue with it so that it's going to get very, very valuable feedback data on who is talking to it. So that's why everyone's rolling this stuff out, even though the copyright questions are all like, you know, could open them up to liability. And that's kind of the same thing too. Like, you know, if you're using like Facebook, you've already given them a license to everything you write on there.
Starting point is 00:30:42 If you're using, you know, Instagram, all your images, you've given Instagram a license. I think that that kind of platform power is also where they're going to get a lot of power to build more powerful large language model and also image generation. So bringing it back to copyright, though, the fair use analysis is dependent on all four factors, not just that the use is transformative. So if you're reading like TechDirt, like Mike Maznick's website, he'll always point to Authors Guild versus Google, which is that the training data is fair use because Google Books was allowed to to, to, copy full works. Also, like, a word can't be copyrighted, and the computing unit of a large language model is the word or subword, like big and biggest. That's a token, like a subpart of a word, and those tokens are what are broken down and computed. And so it's computing each of these words
Starting point is 00:31:35 kind of individually, but the relative position of the word gives meaning to the transformer, right? It knows that, like, oh, bank in this case means a financial institution because cash, in this case, was a verb. But the Andy Warhol Foundation for the visual arts be Goldsmith, the relevant part of that case is that whether or not the secondary work was sufficiently transformed to protect against copyright infringement must also consider if that's a commercial use or not. And since all of these AI models are now being commercially used, that's going to impact the fair use factor, the more these things are commercialized and the more of a market impact they have. So that's why I think it's pretty interesting that these licensing agreements might.
Starting point is 00:32:16 just happen between the copyright holders and the tech companies because they don't want to deal with a fair use argument. They'll just say, well, we've got a license to use it, and now we don't need any adjustment to copyright laws. So that's like the main thing about LOMs is those levels of transformers and generative pre-trained transformers. Now, for images, those are interesting, because those are generative adversarial networks, which is a really fun idea. So you have two neural nets. One is a generator, which is its job is to generate fake input or fake sandals from samples from a random input vector and a discriminator, which has a set of training data that it is
Starting point is 00:33:01 comparing against what the generator creates. And its job is to determine if the data being sent to it is similar, is the same as the training data it was given. And it's a zero-sum game. So they just keep going. Yeah. It's just going to say that sounds like just a back and forth game of, is this anything? Is this anything?
Starting point is 00:33:22 Is this anything? It is just 20 questions until it guesses what's behind the veil. Until it's, Johnny said something like it's running. It's basically just running the material through an algorithmic like cheese grater. Because you're just like, if I put a photo of Jay, like behind a curtain and a machine asks, is this Jay? Is this Jay? and it just does that until it can trick the discriminator and say, yes, that's J.
Starting point is 00:33:48 And it's just algorithmically figured out what was behind the curtain the whole time. So you're running it just through a cheese grater. I don't know if they said cheese grater if they said something else, but it was something like that. I imagine cheese going through like a grater and like getting through the other side and like turning back into something roughly block shaped. So it looks slightly different, but it's like still a block of cheese. So yeah, the game continues until the generator can fool the discreet.
Starting point is 00:34:13 So the discriminator needs preexisting data, which may be copyrighted. I think it originally, this model was created so that you didn't need training data, but for like image generation, it needs like something that's already labeled and everything to like figure out, is this whatever? Is this the thing? So those are GANS, generative AI network, generative adversarial networks. So those are two neural nets fighting each other. And then the large language of models are the layers of transformers. I'm just picturing like mecca.
Starting point is 00:34:47 Like the word adversarial is for some reason just really amusing me here. It's just like two giant mecca going at it until a picture forms that makes sense. Yeah. So the fun thing about this is the copyright office has been putting up information on like, how do you register copyright with things that are AI generated, either through GANS or LLMs? And the main point is that AI-generating material does not have a human, author and therefore cannot be copyrighted as the AI as the author. If you've like not done anything to it.
Starting point is 00:35:19 Like if you use it as like a jumping off point, that starts changing things. Human artists can modify AI art in such a way that those modifications are copyrighted. And as of like March this year, AI involvement must be disclosed in copyright registration. So before that, like you don't have to retroactively go back and do it. But it does have to be disclosed unless it's de minimis use, which goes back. to if a machine is creating something that wouldn't be copyrightable by a human because it's not creative enough, then you don't have to disclose that. And AI did it. So I could ask the AI to give me the vectors for a blue circle. And like I could use that, but I wouldn't have to disclose that
Starting point is 00:35:59 because that's not like a copyrightable amount of creativity anyway. So other examples would be like having an AI fixed background noise in audio, which is something that Adobe already has software for, or having AI check grammar and spelling. Those are both de minimis use and they don't have to be disclosed. And I would say that's the thing that a lot of these are good at doing. It's good tools for this kind of task. Yeah. There's a lot of these like de minimis things or like scaffolding kind of work.
Starting point is 00:36:29 Yeah. I also think that's why it's like it's going to have to be incorporated into like teaching. If you feel like students are always, I mean, we don't know how long these tools are going to be available for free to people. So I don't know how long this is even going to be an active problem because, again, they're only convincing if they're eating money. They can only crank out good AI, good essays if you're like paying for it eventually. Like if they had to operate with actual money, like running a business, then you would have to have a subscription or something. Jay, you're making the white people face.
Starting point is 00:37:02 I just keep thinking of something to say. No, I just like that sometimes. I always think that I look like Christian Bale when he smiles with his mouth closed because I've got such a big chin. my mouth like goes all the way up here. I've always thought I look kind of like Christian Bail when I do that. Yeah. Okay. Just gives the impression you're ready to say something.
Starting point is 00:37:20 Like talk to my manager or something. No. Although I could go into some of the stuff about like, because I had to do a lot of research on this specifically for like what should like faculty do with students using it in class. I don't know if you wanted me to talk about that at all. Yeah. I also want to get some water because I've been talking too long.
Starting point is 00:37:37 Okay. You go do that. So yeah, I also had to do a little like, I've had to do three. these presentations all about 10 to 15 minutes. Well, I'm doing my third tomorrow at time of recording because, like, again, I work at a music conservatory. And so there's not a lot of writing and research that happens that said conservatory.
Starting point is 00:37:53 But sometimes there are classes that, like, there's a class called musicians portfolio where you like make a website to put your, your crap on, right? Or like, you know, some of these other, you know, classes that might have a little bit of writing or kind of like a project, artist statement, whatever kind of thing. And because the majority of our students are English language learners, we've been getting more and more things. There's like people turning in assignments that are obviously just chat GPT, right? Or we've even had professors encourage students like, hey, this aspect of this assignment is not what I'm grading you on. So like, like, you know, scaffolding of a website.
Starting point is 00:38:32 So you chat GPT or whatever like AI to create that scaffolding of your website because that's not what you're being graded on. So that's already been happening at my music conservatory. And so they're like, Jay, you're literally the only one here who knows anything about this. Please educate us. So this has been my job all summer. And like within the classroom, it's like they're like kind of like three ways that AI is probably going to show up. If you are an educator listening to this. And it will be as part of the creative process.
Starting point is 00:39:03 Right. So this could be like a student uses AI compositional tools like to literally help compose. or perform music somehow. It could be part of their recital. They could be making a commentary on it. It could be part of the actual thing that is being created. It could be a scaffolding tool.
Starting point is 00:39:20 So this is, you know, write me an outline for this topic. You know, draft my artist statement for me. Write this policy for me. You know, whatever, like that kind of basic shit where it's more an outline that you're jumping off from kind of thing. Again, it's really good at this kind of thing. This is what AI tools are really good at is this bullet point right here. And then the third is cheating.
Starting point is 00:39:46 So not quite plagiarism because, again, it's not necessarily plagiarizing anything, but it would definitely be considered a violation of academic integrity. It's like a way that teachers can say that and it be more accurate. And with all of the things I was looking through, like, don't fucking use checkers. They're incorrect and also don't, you know, play into the whole like student data surveillance. industrial complex, right? But also it's like if you have an assignment where a student could submit something done entirely in AI and fulfill the requirements of that assignment, maybe it wasn't a good assignment. And you should like, maybe this will encourage educators to rethink
Starting point is 00:40:29 their assignments and what are they actually grading on? How can they be more involved in the process like with the student like could assignments be more iterative and have more feedback you know that kind of thing um so like i hope this actually makes profess like faculty rethink how they're doing assignments and grading what they're actually looking for what are they just making students do bullshit busy work are they making t a zoo busy work right you know that kind of thing oh and then like think being really intentional with your syllabus statements so i don't necessarily think universities or educators or even courses should have
Starting point is 00:41:06 blanket AI policies because what you might think is okay or not okay might depend on the assignment itself unless it is for like a political reason. No, we're not using AI because XYZ reasons. But I found this document and it's linked
Starting point is 00:41:23 in my little presentation in the notes that was in like a article or whatever that is just like a collection of syllabus statements that is constantly being updated. And there's a couple good ones that like have different statements that you can use depending on are you allowing it with attribution or are you allowing it without attribution are you not allowing it are you know whatever or like there was one from plymouth state university that i really like that's like students can
Starting point is 00:41:51 use it but they're responsible for any offensive or incorrect or biased information uh whether they created it or then i i did and also if they use it they have to cite it and i and i and answered these like reflection questions of like, did I ask further prompts? What did I edit? Did I learn anything? You know, that kind of thing. I liked that. And these syllabus statements come from all disciplines, humanities, comps, I, all kinds of things. So it wasn't just one certain realm or discipline. And so that was sort of like the most actionable thing that I think educators could do is to be more intentional with their syllabus statements and did not just go AI bad. I mean, if you want to say AI bad, you can do that, but like give a reason why and be intentional about that.
Starting point is 00:42:35 I didn't have to go too much into copyright, even though I could have. I was like, I probably know more about it than anyone in this school ask me questions if you need to. I've got more copyright stuff too about registration. But on the AI checkers thing, like the thing that people were doing for a while was feeding student essays into chat GPT and asking if it wrote it. I hope the explanation of the transformer process explains why that wouldn't work because it's estimating what the next word should be. That's what a GPT does, is it just guesses what the next word is supposed to be. It doesn't know if it wrote something or not.
Starting point is 00:43:13 So it's just going to tell you, yes, I wrote it or no, I didn't write it. And that's kind of also that what these AI checkers, I don't know how they're working. I don't know if they're running it through an AI or if they're just saying this language sounds stilted, which is like, hey, what does that prove? So it causes a lot of problems. Plus, if you're feeding student data into GPT, remember those terms of service I mentioned, that student's copyrighted work has now been submitted by you into GPT illegally
Starting point is 00:43:43 and is now being trained on chat GPT. I just wanted to point out that they, like, this reminds me a lot of spam checking and block lists, which I've been dealing with a lot lately, just because like it's stilted. It's like, okay, is this stilted because it's spam, which has been interestingly used, like has obviously been using chat GPT kind of things lately. And is it stilted because it's GPT or it's whatever?
Starting point is 00:44:17 Or is it stilted because it's a non-English speaker? And especially if like it's a non-English, like if you're getting email from a patron and it's not, right? So I've had to bring that up. I've also seen where like someone like submitted a CV or like or the cover letter and someone responded back, hi, no, please stop. Like we are rejecting this because you obviously use AI to write this. And the person was like, I'm autistic.
Starting point is 00:44:44 Like or like the lack of like the tone markers even in writing that like a autistic person might just be blunt and straightforward and not do flourishy bullshit that non-autistic people might do that like that person thought that an AI wrote it. It's like, hmm, that's not great. Yeah. And so like a lot of times when we're doing like spam checks when they actually get through our filter, it's it's basically whether or not judging whether or not the context actually makes sense. So like is this just spam from like some company that's trying to do promotional shit and they scraped your, you know, email off of LinkedIn, or is there something more malicious like here that could end up being, you know, like malware or whatever? But it just makes me think
Starting point is 00:45:35 because like those kinds of things make mistakes all of the time. So why would we think that AI, you could just chuck something in there and be like, oh yes, it will be able to tell me accurately whether or not it wrote this, even though, I don't know, I just, that kind of thing bothers me. Yeah. I kind of forgot where I was going on that. Sorry, guys. I thought I had something else about teaching, but maybe not. With the back to the Copyright Office and Copyright Registration, so one of the top cases you'll see there, one of the registration claims you'll see, is Zarya of the Dawn decision, which is a comic that was using the art created by Mid Journey, but it didn't disclose it because it was kind
Starting point is 00:46:12 of before the disclosure rules. And so the Copyright Office made it modify the registration so that the parts of it that were AI were listed, and the fact that the land. out and the writing and the word bubbles and the comic panels, those were all done by the artist. And those can be copyrighted. But if I wanted to use that AI generated art, then I could because it's not copyrightable because the artist, even though they did the prompts and everything, did not have a significant creative hand in doing that. So it's simply not copyrightable. There was another case recently Thaler v. Pearl Mudder. Thaler is a guy who loves insisting that
Starting point is 00:46:53 AI should be like a patent holder and a copyright holder, or at least an artist as the author. So you'll probably hear his name again. This case from August 18, 2023, was that a two-dimensional representation was not eligible for copyright protection. The court upheld the copyright office's decision to deny registration because he submitted it as a work made for hire. So he was the copyright owner, but the work was made for hire by the AI that he created. So he insists on having the AI be the creator and not listing himself as the creator. So his whole thing is he also had a patent case and the copyright office is saying no. The fact that you're arguing that copyright's intent is to promote science and useful arts. That's kind of his argument.
Starting point is 00:47:45 outweighs, it doesn't outweigh the interest of the copyright law only applies to natural human. So it doesn't apply to like artificial, like fictitious humans like companies. Companies can own copyright, but they're not the author. Those are like works made for hire. So like humans are always have to be the author. And other cases on training data are still pending. But the human authorship requirement is like pretty well explained by the copyright office now, where a human has to be involved.
Starting point is 00:48:15 So that's why the Naruto Monkey selfie case was always about AI, and I've always said this, that that was always going to come up in like AI discussions, and it did. Because a human did not, even though a human set up the tripwire, the framing of the photo, the angle, the composition of the shot, none of that was triggered by the artist. It was triggered by the macaque. So other examples would be a painting by an elephant, could not be copyrighted. appearances in natural material, so animal skin, rocks, driftwood. So if you see an image in a natural formation, you can't copyright it. This also applies to gardens. So like someone tried to
Starting point is 00:48:55 copyright their garden layout or something. It said, no, that's just an effect of nature taking its course. And also it has to be natural humans cannot be extra-dimensional or extra-spiritual being. So someone tried to copyright a song written by the Holy Spirit and put the Holy Spirit as the author. It doesn't matter if you are in spot. like the Urantia book, the Arancha book is copyrighted, but you can't put like space aliens as the author, like interdimensional beings. Fun side tangent about that. One of my favorite class of name authority records is the spirit, name authority records, where it is. So we might have a like William Shakespeare name authority record.
Starting point is 00:49:39 We might have a William Shakespeare in parentheses spirit name authority. record because it's someone writing a book, but claiming that they are possessed by the spirit of something else, or that the spirit of something else is like speaking through them as actual the true author of the book. And so we have a bunch of these spirit name authority records as well. I thought it would be fun to do a blog of like finding books that have the like, that are written by a spirit name authority record and like read them and do reviews on them. Yeah. Because it's just like, it was like my favorite thing ever in cataloging class. I was like, come again.
Starting point is 00:50:17 And you can you can also have copyright in things that are like dictated to you because the act of writing down and copy editing is considered sufficient enough for copyright ability, but you will have to claim the copyright for yourself. You can't do it on behalf of a ghost. Oh, so John Milton doesn't hold the copyright of whatever the fuck that he dictated to his daughters while blind. Yeah, how does this work for like blind authors who have people dictate things? Well, because he's a natural person, he's the author.
Starting point is 00:50:49 If he's dictating it, he's authoring it. And that would be considered. So what are you meaning then? So the transcription and copy editing of interpreting extradimensional voices is considered. Oh, we should talk about ghosts. Okay. We never stop talking about ghosts. I thought we had moved on from ghosts.
Starting point is 00:51:10 And you are writing it down. that's considered enough for copyright protection if you are the author. I see. Basically, the copyright office saying like, yeah, sure, whatever, buddy. Okay, yeah, sure. There was a ghost here. Okay, yeah, you stuck your face in a hat and came out with the Book of Warman. Yeah, okay.
Starting point is 00:51:26 Yeah, but since you wrote it down and did all the copy editing, you get to be the copyright holder. But only natural persons can hold the copyright. So if Thaler had not insisted that the AI was the author, he might have had a chance at registering the copyright. If he had said, like, I wrote the program, it did things that I told it to do, and the output is there for my output. You might have been able to copyright, at least part of it. And this goes back, J.O. like this, this goes back to Burrow Giles Lithographic Company v. Saroni, which is where copyright gets applied to photography.
Starting point is 00:52:00 And so what happened was the famous portrait of Oscar Wilde was my favorite things. was being copied and illegally sold by Burro Giles lithographic company, I think. Maybe I got those mixed up.
Starting point is 00:52:14 But their argument was photography does not have enough creativity because it's a natural process of chemical reactions and the judge said no,
Starting point is 00:52:23 the scene, the posing, the shot composition, the prompts, all of those things have a creative aspect to them and therefore are
Starting point is 00:52:33 copyrightable. And interestingly, something they said was maybe if there was some natural scene in which nothing was set up by the author, then you would have a case, which is exactly what happened in the Naruto case. Yeah. Or like if you dropped it, like if a camera like got knocked off of a table by accident and went off and took a picture, right? Yeah. No intent. Yeah. Whereas if you knock the camera off on purpose, hoping that it will take a picture as a way of a shot composition, I think that would then change it. Yeah, could. Definitely could. And then the rest I had was just about AI and
Starting point is 00:53:07 teaching, which we already covered. But yeah, the violation of student copyright is a big thing. And also just the nature in which of how AI tries to guess things means it's unreliable as a judge of whether or not something was AI written. Yeah. I would also say like an important thing for regarding like using this in teaching or like for what educators need to know is like again, like, There are other considerations beyond just academic integrity and making a purposeful statement about, like, even though this is useful, we're not going to do it. like I still think that's a valid thing that people can do is even if knowing, you know, it could actually be a good tool for this and they might use it in this way and yadda yada, you can still say no, you know, especially considering that this is a huge thing with both of the strikes,
Starting point is 00:53:58 the WGA and the SAG after strike right now is like the use of AI to like remove their livelihoods, right? Thinking like politically as well because this is a political issue as well as a technological one. All technology is political issue. Exactly. Now the cyber is so big. I love that he called it the cyber. That's like a government thing. It's just the thing he picked up, but he just says everything weird.
Starting point is 00:54:26 I love it. So potential applications of AI and libraries, which I didn't really get to when I talked about this at work, but like products trained on proprietary data. So like I said, like people cutting deals. So articles, I think the first place will really see this is in law. So like Lexus Nexus in particular, I think it's going to start coming out with AI tools trained on their data in order to give better tools to big law, right? So people who can afford it. I would imagine this would be popular with like the database vendors as well because they could do like enhance search results or even like a more like a different version of the like apply related words or similar subjects or whatever like AI enhance searching. Yeah, or generate citations that are checked against actual things and like cross ref or his cross stuff's like pretty open and it's so you can you can query cross ref however you want.
Starting point is 00:55:23 Good API. Yeah. And also data brokers. So again, Lexis Nexus, but remember how I said it needs lots of data in order to make accurate guesses. If you start feeding like all of our crime prediction shit that Palantir runs off of, it can make more and more plausible associable. associations. So if that data is being fed into AI systems, this training data, that's a really scary thing. And again, you don't have to wait for like copyright to save you. This whole copyright, like what we've been saying from the beginning, copyright becoming more strict isn't going to
Starting point is 00:55:57 save you because these big companies are just going to cut a deal anyway. And they've already got everything you own. Like everything you post on Instagram, it doesn't matter. Like, they've already got it. They've got a license to it. There's only five websites. And every one of them has the terms of service that says, we own what you type in. I think Adobe had some internal documents that we're saying, we're aware that we are facing liability. So let's just start cutting deals to keep running like Firefly or whatever their AI system is called.
Starting point is 00:56:25 And it's like, yeah, because they don't need fair use as an argument. They don't need copyright as a problem because like so much of human creativity is just owned by two or three companies. And then the other, the other companies are just the tech companies that want to use it. or they have access to it through hosting, you know, think about how much stuff Facebook owns, like all of Instagram, all of Facebook, all of the stupid threads app. There's probably other stuff they own that I forget about. Don't they own like Snapchat or something?
Starting point is 00:56:54 Just all of that. That's huge platform capture. And they could just train anything on that. That took a long time to get through. And that's just really the introduction. But I think it was useful to go through how the technology works so that you understand like where copyright comes. in and why copyright is like not an effective tool for fighting this anyway.
Starting point is 00:57:12 But I am interested to see what happens with some of the training data cases where people are like, you used my stuff and I didn't say you could. And I'm interested to see what happens with that because I think they'll just take it on the chin and then just like license it next time. I really don't think Google Books is going to be a good enough precedent for this because it's just too commercial in nature is the main thing. And it has too much of a market impact. But that's everything.
Starting point is 00:57:34 What we got? Plug the Discord. Ender subject. New episode out. I'll be on, I'm also going to be on an episode of Hereby Media soon. Hereby Media and, oh, I was thinking if people wanted to ask questions about library school, like going to it, going through it. I thought that could be a fun episode.
Starting point is 00:57:54 I don't know if we've taken, like, who had a good time in grad school. But I haven't figured out how to collect the questions. If we should use the question box or if we should have people email, because I don't know if everyone's going to want their email attached. or if we'll just ask in the Discord. So if I think of something, I'll put it in here and then I'd put the link in the notes. You can do like a combo of those three things. I guess.
Starting point is 00:58:16 Yeah. So it's to check all of them. Yeah, if we're going to do that, then that's what I'll do. All right. Good night.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.