Latent Space: The AI Engineer Podcast - AI Fundamentals: Datasets 101

Starting point is 00:00:09 Hey everyone, welcome to the Latenspace podcast. This is Alessio, partner in CTO and residents at Decibel Partners. I'm joined by my co-hosts, Wix, writer and editor of Laton Space. Today, we finally, finally, we have the second episode, our 101 track. This has been a long time coming. So last time we did Benchmarks 101, I think we got a lot of mileage out of that. We understood a lot about the benchmarks, and we talked with a lot of our guests over the previous episodes,

Starting point is 00:00:37 about how they evaluate their models. And today we wanted to dive into datasets, what they are, how they're constructed, and why they matter. I guess I should go into why we wanted to do this episode. It's a little bit weird to separate datasets and benchmarks. So we did benchmarks first, but a lot of the benchmarks were datasets. So pretty much they're one and the same thing, right? And I think where they start to diverge is a cause of significant interest. But mostly actually, I wanted to focus on data.

Starting point is 00:01:09 sets for one primary reason, which is that many people say that GPT is trained on all the internet. So first of all, this is actually not true. And second of all, it actually causes some potential misperceptions. I say potential because there is some legitimate debate about this. There are misperceptions about us running out of data. And we can discuss the pros and cons of whether or not we are running on the data. It's been named the token crisis by academics and quite a lot of commentators on AI. So in the show notes, we're going to link to a paper on to repeat or not to repeat insights from scaling LLMs under the token crisis.

Starting point is 00:01:47 And then I'm also going to link to an opposite view from OpenEI with his, with Ilya Sutskever talking about how they're not anywhere close to running on the data that they want to train on. So whenever there's such an interesting divergence between practitioners and academics, I think it's a worthwhile thing to dive into. And just in general, I think there's a lot of foundational knowledge that people skip over when they assume that everyone knows what datasets we're talking about. Yeah, I was going to say, I think also in terms of like the knowledge that the models have,

Starting point is 00:02:20 if you say it's been trained on the whole internet, you would assume it knows everything on the internet, but it obviously doesn't. You can go in there and ask about people that have online presences and are not actually in the knowledge base of the model. And this also helps when thinking about what data did that use to find. to them. So if you understand what's in the model, if you're trying to build a verticalized model for a specific use case, you can better figure out what's actually going to be meaningful versus what was already present in the first training run. Yeah, just for some comparison, let's say the total size of the internet, some people have estimated at around 5 billion gigabytes.

Starting point is 00:02:58 Most of the datasets that we are going to talk about today are in the hundreds of gigabytes range, and it's growing every single day. There's always new data being created every single day, and there's always new modalities to claim that data from. So a lot of the whisper behind open the eyes whisper is that they're actually transcribing YouTube, which is a, this is just source of extra tokens, and we'll have to explain what tokens are.

Starting point is 00:03:24 The first thing is about the whether or not we're running out of data. The second issue is this divergence between data sets and benchmarks. And I wanted to dive into this specifically because, they used to be essentially one and the same thing. In a very standard machine learning tutorial, you would do something like the iris datasets, and then you would do train test validation splits, and you would basically evaluate data

Starting point is 00:03:51 based on samples from the data itself. But more recently, we actually have decoupled benchmarking from the datasets they're trained on, except for the calculation of loss. I think in our Discord, in Lane Space Discord, we've actually been doing a small paper club where we've gone over some of the foundational papers. And we actually recently went over the BERT paper, which is the bi-directional Transformers paper that was a predecessor to T5 and is a predecessor to all the large language models today.

Starting point is 00:04:21 And in BERT, they actually invented this concept of masking, which meant that data sets could create their own training objectives, which I actually think is super interesting. So basically, what you have is, for example, a sentence. And out of the sentence, you mask one word, and you ask the model to predict that word, and you grade the model based on whether or not it's able to predict that word. This basically starts to go from supervised learning, where you have a data set that you're trying to train on, to sort of self-supervised learning, where you can just kind of let loose on unlimited set of data. And so this basically lets you scale as much as you want on the data side, as much data as you have,

Starting point is 00:05:01 which I think is just really interesting and foundational. Like you don't have deep learning without self-supervised learning. You don't have a good training objective until you have the concept of masking. And once you have really, really good masking, then you start to find algorithms for that. And it turns out that deep learning is the way to do it to achieve lost unseen by any other algorithm. And then once you have masking, you can predict. And then you can generate.

Starting point is 00:05:29 And then, you know, everything. kind of follows from there. But that is the data set and not a benchmark, right? Because you can have all of Wikipedia as a data set, but don't be bothers the benchmark on Wikipedia because that's not a reasonable

Starting point is 00:05:41 benchmark to derive any score on. But it turns out that training on a data set leads to higher evaluation on benchmarks that we covered in the last episode, like common, what was it? Hellerswag. Like Helm. Big Bench.

Starting point is 00:05:55 MMLEU, Hell. Yeah, exactly. So I just think, like, that's fascinating. that we've just summed up, you know, now it seems so retroactively obvious, but we've just summed up maybe about 10 years of progress on deep learning. Yeah, and finally, the most important thing about datasets is one understanding how big they are and then using scaling laws to work yourself back into what size model you can train with them. So we talked about the chinchilla scaling loss and we'll cover that later.

Starting point is 00:06:23 But if you want to train a model that is 100 billion parameters, you cannot just pick any amount of data. It needs to be a lot of it. So understanding, come and crawl, how many tokens is that? C4, how many tokens is that helps you understand, okay, this is what I can get off the shelf. This is what I need to provide. And we'll dive into more of the datasets later.

Starting point is 00:06:45 But first, I just wanted to do a quick explanation of what tokens actually are. So the thing you read on the open AI docs is like one token is like three quarters of a word. So when I first read it, I was like, Oh, you're just doing character splitting, but that's not really high work. So basically, one token is an integer that can be, you know, up to actually don't know what the highest number would be, but it's an integer representation of words.

Starting point is 00:07:11 And the same word can also have different representations based on where it is in a sentence, for example. One of the big things that Transformers did to be more efficient is the space is included in the token. So if you're doing a long sentence, instead of having one, token for each space in it. The space is inside the token of the word, so it really cuts down on how many you actually need to go through it. But the funny thing is that then you have different representation for the same word. So if you take the word red like the color, there's one token for lowercase red with a space in front, that's token 2266. Then you have red with a space in front with a capital R, that's 22997,

Starting point is 00:07:56 then you have red with a capital R, no space, that's token 7738. So you can see that when you say a trillion tokens, for example, it's not one trillion different English words. So that's also one of the main thing that you've got to be careful of. You can have a dataset that is a lot of tokens, but as potentially a lot of repetition that is not as helpful. And also when you're doing things like a logit bias,

Starting point is 00:08:23 in Open AI where you can deprioritize certain tokens, you have to find all the tokens for that. So the example that they use is, if you want to do a recipe for a take with no eggs, you have to set the logger bias of both egg and like space egg, egg space tokens, not just one. So that's one thing to keep in mind. If you're coming from a non-ML background,

Starting point is 00:08:46 that's probably one of the first gotchas that you have when talking about datasets. And just for more examples, OpenEI-I-H has a tokenizer tool that we're going to link in the show notes that you can use to see what any particular phrase translates to. So I just plugged in, for example, latent space. Latent space is three tokens. It's L-A-T, that's the first token, Lat. And then second token is E-N-T, E-U-N-T.

Starting point is 00:09:11 And then the last one is space and then SPACE. And that's the last token. So latent space is three tokens, and some combination of that will form other words as well. And it's an interesting tokenization scheme. As far as I understand, by the way, I think the upper limit on tokenizers is between 50,000 to 80,000 tokens, which is amazing. It means that you can basically make up any sentence or language from these individual tokens. It's a small number.

Starting point is 00:09:39 It's actually like five digits of numbers, not millions and millions of tokens. And keep in mind that they speak multiple languages. So you have to actually tokenize the language. as well as numbers, as well as symbols, emojis. I actually am pretty amazed at the death of tokenization. And honestly, there was actually a well-known flaw with GPT3 where you could actually do a quick test to see if you're talking to a bot or not, right? One of the reasons that GPT3 is just not good at math

Starting point is 00:10:10 is because they don't tokenize numbers individually and they don't represent numbers the same way that humans do. Humans represent numbers digit by digit. But if I type in into this tokenizer, GPT3, like I type in one to this, that's one token. If I type in one, two, which is 12, that's also a different token. And that's a single token still. And if I type in one, two, three, so it's 123, that's also another single token. But if I type in one, two, three, four, that is now two tokens that is made up of 12 and 34.

Starting point is 00:10:40 So it's breaking up one, two, three, four, like an English word, rather than a mathematical representation of, you know, one, one, one, thousand, two, hundred, three, ten. and four ones. And that's one of the reasons it doesn't do math very well, because it looks at things as though it's a word rather than numbers. Yeah. Yeah, and you mentioned the language thing. That's another good point. Some languages are actually less token efficient than others.

Starting point is 00:11:05 For example, Spanish actually requires more tokens for the same sentence. And they also have this issue where the lower the number of the token, like the more common the token is. And if you tokenize a Spanish phrase and you have syllables like men in it, the token value of the token is very low because the English language is used very often. So you can have predictions that are a little weird when you use different languages and we'll kind of go into language-specific datasets data sets later as well. Yeah. There's a famous article about this called Why is GPT3 15.7 times more expensive for certain languages. And so the English bias is real.

Starting point is 00:11:47 So I want a pizza. It's four tokens in English. And then in French, I don't speak French, but je vu won pizza. That's seven tokens. And in Chinese, it's 15 tokens. I want to eat a pizza. I don't actually know what these characters are. But anyway, so it's interesting, right?

Starting point is 00:12:06 Like all of this is represented within the token space that GPC has trained. And it's actually pre-trained. This is one of the first examples. of pre-trading being useful because ultimately language models or the transformers that power the language models are transforming a set of numbers to a different set of numbers or predicting the next token ID in a sequence of numbers. The transformers themselves don't actually know what word they're predicting. All they know is they're given some data sets of number after number after number and in fact like trillions of them. And then their job is to predict the next number given a sequence of numbers. And then Then we take the tokenizers to convert those numbers into words or images or audio. We reference scaling laws a little bit, and basically this comes out of research around what the optimal sizing of a model should be for a given dataset. I think in 2020, OpenEI published the first scaling law, which was the Kaplan paper,

Starting point is 00:13:06 and that was estimated to be about 1.7 times tokens per parameter. So what that means is the reason that GP3 is a $175 billion parameter model is because they had around about $300 billion tokens to train with. So they worked backwards from the dataset size that they had of $300 billion. They said, okay, based on this, the largest possible model that we can train is $175 billion. Let's go for that. That was the state of the art at that time. And as far as Open the Eye is concerned, larger models were always going to be better because of understanding. around emergence and capabilities and honestly just research around what AGI could be. And just for a

Starting point is 00:13:45 reference, the GPT2 was about 100 times smaller than that. So that's super interesting. And then the year after that, Chinchilla came out of Google DeepMind where they actually optimized for a different metric, which is compute optimal training. So a given compute budget, a given amount of days and hours running a certain number of GPUs, so that gives you a certain compute budget in terms or flops. If you hold that budget constant, so let's say, you know, that's roughly a few days or a few months of compute. That is actually very material for a research team or anyone because that translates directly into dollars, right? Like how much time are you renting on the shared GPUs? So for given compute budget, what is the best model that you can train given some kind of compute budget? And the

Starting point is 00:14:33 number they came up with was 20 times tokens per parameter. So that's actually 10x what the captain laws were doing. What DeepMind did was they had sort of replicas of GPC3 that they called Gopher. And then they created another replica called Chichella and showed that despite being about 10x smaller, they were able to match or beat Gopher. And the assertion, they would also beat GPC3 despite being 10x smaller. Most people call this is basically that GPC3 was over parameterized. There was way too many parameters. 175 billion was way too much. And in fact, we train GPT3 to a full 1,175 billion parameter model. You need 3.5 trillion tokens, not 300 billion, which I think is just fascinating and a little

Starting point is 00:15:19 bit depressing. It means like we just need a lot more data. Yeah. And it just shows you like how early this whole foundation model spaces, you know, like these papers are not coming every 10 years, you know? They're coming like every 18, 24 months. And the other thing you mentioned that with the cost of compute, you know, if you think that 1.7x is good. You probably don't want to burn GPUs for like a much larger training scale,

Starting point is 00:15:45 you know, and now the next, the next thing is the Lama optimal, which is 200 times tokens per parameter. So another train GPD3, you need 35 trillion tokens, which is like, if you take all the books publish each year, like all of Kindle unlimited, all of like, all of that stuff, it's only 100 billion tokens. So to get to like 35 trillion, you need. a lot of data. But again, is this the new optimal? We also don't know because it's not easy to find ways to train models at this size. Like, it's not easy to find the computer. It's not easy to find the data. So once we record datasets, you know, 201 in a couple of years, we're probably going to say, it was crazy that we were using like 20x, you know? It's actually like this iteration now.

Starting point is 00:16:34 One thing I want to basically mention is that as far as I can tell, the researchers that I'm following, talking about this stuff. Because we've asked this question on the podcast, every opportunity that I've had. And essentially, we're going from compute optimal, which is like at training time, to inference optimal, which is at inference time, right?

Starting point is 00:16:53 Lama was designed for an inference optimal situation where we start caring about the latency of the inferences. And basically that just means the size of the models have to be smaller. They cannot be hundreds of billions of parameters. And they're all, you know,

Starting point is 00:17:08 round about this double-digit parameter size now, maybe even single-digit with the 7B and 3B-type models from MPT that we talked about with Jonathan Frankel. This is a nice evolution. It basically balances practical requirements. It's very funny coming from a software background studying AI because in software, performance means speed. Whereas in machine learning, performance does not mean speed. Performance means capabilities and evaluations on benchmarks, where inference time doesn't matter. but now inference time does matter because AI is crossing over from research into software, into practical application.

Starting point is 00:17:46 So now we do care about things like inference costs and memory that you need to run all these systems. So I just think it's just fascinating that like Lama Optimal is like purposely overtraining Chinchilla. Like DeepMind showed that Chinchilla is compute optimal. And now we're saying we just don't care. Like we will purposely overtrain and be suboptimal there in order to be more optimal in inference because inference is more important now.

Starting point is 00:18:08 Yeah, exactly. The question is like optimal for what? You know, if you're writing a paper, you're optimizing for like training and kind of building the proof. If you're training a model for like production use case, the training cost is actually just a small part of overall lifetime cost of the model. So the pendulum is kind of keeps swinging depending on the application. Maybe I'll go on to the next bit, which is LLM's databases. Okay, this is interesting. So we have this concept of tokens. And then we have the concept of. training in a compute budget. Basically, what the training process is, you can be abstractly viewed as a way to compress the data sets. For example, we have actually a nice conversion ratio between billions of parameters and the amount of data that they generate. So just as a rule of thumb, right?

Starting point is 00:18:58 Let's say each parameter is 8 bits. Usually it's 16, right? We always talk about FP16 in our podcast with Jortats. But let's just say 8 bits is one parameter. then 175 billion parameters is 175 gigabytes, right? One billion parameters is one gigabyte. That is actually the definition of a gigabyte, which is one billion bytes.

Starting point is 00:19:19 So that's super intuitive. And so therefore, a full point, full in point precision 16 bits means 175 billion parameters uses 350 gigabytes to store parameters. That's how much memory that you technically need to do that inference and just to load the model itself. And most graphics cards will not even have that. Even the professional grade, like A100 cards,

Starting point is 00:19:43 would be like 80 gigabytes for a single one of them. So to fit 350 gigabytes, you need to network all these things together with fair high bandwidth. But they're trained on 3,000 gigabytes of data. So that is 3,000 gigabytes of data being compressed into 350 gigabytes of data. And that is a form of compression because from there you can sort of, it's lossy compression, but it's a lossy compression in a way that learns how to decompress itself, such that when you ask it to spit out some facts, it spits out something

Starting point is 00:20:10 in the approximate neighborhood of what you started with. So I think that's an interesting analogy. Some people don't like this idea that LLM's a databases, but it's super cool. Another very prominent description of this I remember from last year, which was stable diffusion, which compresses all the images that it creates and all the images that it trains on into, I think something like two to four gigabytes. Like it has knowledge of flying saucers and horses and humans and beaches and, you know, computers in images all in like a downloadable file that you can host on your laptop, which is crazy. Yeah. I think people get very surprised by how many things the model knows, you know.

Starting point is 00:20:54 And if you were putting the raw, again, going back to how much memory you need, if you put the raw data in memory, it would be like impossible to actually run. but if you take especially if you think about 16-bit versus 8 versus 4 and like all the quantization work is like compressing the compression you know now all of a sudden you can put data sets and like knowledge and like was impossible to like put on a certain device like I can run the red pajama like 3 billion model on my phone and my phone is like 64 gigabyte of storage if I were to put all the data that got into the training of it would be like running my phone way out of surge instead it's only like a few gigabytes of model so that's another interesting thing especially as we think about using these models at the edge and how much stuff

Starting point is 00:21:43 we can get there on time but that's for another episode quantization 101 and for those who are again coming from our george hots episode um at the around the one hour mark he talks about compressing humanity a person's consciousness or knowledge or all your life experiences how much information, how many bits of information is that? He thinks it's two gigabytes. There's probably too much compression. You probably be a really bad copy of yourself. But there is a point at which you should probably be able to replicate yourself as a digital twin, which is the whole mind-oploading phenomenon that we discussed. Cool. I think we sort of maybe knocked on that a little bit too much, in fact. I want to basically verbally go through this chart of tokens and scale, because when we talk about,

Starting point is 00:22:28 We say things like billions of tokens, trillions of tokens, billions of parameters. It's just very, very big numbers that we don't know how to estimate. And this chart that we had from this person, S Rush, on Twitter, actually was super helpful to me. I was actually planning to make something like this, but this guy already made it. So I'm just going to quote him. So maybe we'll just kind of go through the token chart. So this is just memorized tokens in terms of orders of magnitude.

Starting point is 00:22:55 Being able to do token math, I think is super important. and order of magnitude math is super important when it comes to deep learning, right? Because everything is like so big that like the significant numbers don't really matter. It's just the order magnitude really matters. Okay, so 10 to the power of zero, which is one, that's on the order of like a hello world, like individual word tokens, right? And then 10 to order of one, which is like 10 tokens would be like, you know, one sentence, one phrase or whatever like that.

Starting point is 00:23:22 Tent to the power of 2, that's 100. That would be blank space chorus, apparently. by Taylor Swift. Yeah, this is definitely a Swifty because he talks about Taylor Swift later. 10 to the power of 3, that's 1,000 tokens. That's the Wikipedia article on Fermi estimation. 10 to the power of 4. Again, that's the Taylor Swift article on Wikipedia. 10 to the power of 5, that is the GPT3 paper itself, including the appendices.

Starting point is 00:23:48 10 to the power 6 is one year of the New Yorker. So that's a big jump, right? 10 to power 5 to 10 power 6. That's a one order of magnitude jump, but we've gone from a single paper to 1 year's work. 3-10-power-7 is the whole of Encyclopedia Britannica. 10-the-power-of-8 is number of Reddit posts per month. 10 to power of 9 is English Wikipedia, all of Wikipedia. 10-to-power of 10 is the number of WhatsApp messages per hour. 10 to the power of 11 is the number of published books per year.

Starting point is 00:24:16 Not the number, the amount of tokens inside of published books per year. And then finally, we get to 10 to power of 12, which is the order of magnitude that large language model datasets operated. And so that's intended upon itself is $1 trillion. I was mostly surprised by the fact that Taylor Swift's Wikipedia page is longer than the Fermi estimation one. But the amount of data that you need for these models is great. And yet, I think the $1 trillion tokens is kind of like the NPT 30B that was released yesterday. So when we publish this episode, it's going to be pretty new.

Starting point is 00:24:51 But that's the amount of tokens that they used for there. Yes. We're trying to keep this episode current. It's an evergreen episode. Yeah, that's how much was used there. But it just gives you, again, a way to reason about this. You know, so when you read online next time, it's like a trillion in tokens, you understand this like 100 times all of English Wikipedia,

Starting point is 00:25:11 which is a lot of text to collect. Took us years to get that online. There's a question about whether we're running out, right? Like, is there more orders of magnitude to this? And arguably there are, but, you know, that is one of the issues that we'll discuss at the end. Just to recap, because I'm so excited about this, because this to me is an evolution in terms of smaller models, right? DPP3, the breakthrough that got a lot of us interested, was 300 billion tokens for 175 billion parameters, right? That's the 1.7 ratio from Kaplan.

Starting point is 00:25:41 But then Lama is 1.2 trillion tokens for a 7 billion parameter model. So one order of magnitude higher tokens for one order of magnitude lower parameters. That's crazy. It really is. And I think, like, again, we already made this point. but it's like we're just really early in terms of like how much data we need, what model size we need, that I wouldn't preclude use cases today based on the on the size of these models. And also for an enterprise, the other thing, the analogy that I like with databases is that when you get a database like, you know, Postgres, MongoDB, there's nothing in it. There's kind of like a cold start problem.

Starting point is 00:26:18 Like you install the database you need to start putting stuff in it. Now machine learning in the past used to be similar. where before you even start using it, you need to collect a lot of provider data. You train your own model, and then you start to do the production. The way it works now is that the foundation model labs and researchers, like Open AI, Anthropic and the likes, they've already done all the work for you.

Starting point is 00:26:41 So you already get these models and meta. Of course, we cannot miss one of our open source paladins. Once you get that at a company level, what you need to understand is, okay, I have this model that knows a lot. How do I prompt it and how do I fine-tune it to make it good for my use case? And then run my own inference on that fine-tune plus prompted model. So the scale of data that you need is an enterprise is like so much lower. Like you don't need to collect billions and billions of like examples and tokens.

Starting point is 00:27:14 The model, for example, for code, the model is already really good at code. You just need to give it, you know, dozens of examples, you know, maybe 100 if you want to be really thorough. and you're going to get really good performance. And I know, Shania, you were a fan of Carpathy's presentation at Bill Conference. Yeah, it was really insightful and authoritative, I guess, coming from Andre. And recently at the Microsoft Build Conference, Carpathie had a state of GPT talk that I think was very well received. And he is just really good at outlining the important things

Starting point is 00:27:50 in a lot of the mess that we have to wade through in order to understand language models. And so he basically outlined this famous slide, which is the GPT assistant training pipeline that outlined essentially four kinds of data sets that go into making something like chat GPT. And obviously he's extremely authoritative on that. And I just think it's useful to have this in your head, this slide in your head. Again, refer to the show notes if you want to see the image. But the four data sets are raw internet, which is just the raw data sets that we pull from Common Crawl and Wikipedia and books and all that. And the second one, it would which is the demonstrations data set,

Starting point is 00:28:26 where you're demonstrating ideal assistant responses, so basically prompt and response pairs, comparisons, so basically comparing outputs between, you know, output A, output B, and seeing which one's better, and just sort of reward modeling that on the language model. And then finally doing reinforcement learning with prompts. And so those are different kinds of data that has be collected in different ways.

Starting point is 00:28:49 There's different orders and magnitude of them as well. So there's trillions of low quality, large quantity data from the raw internet. And then there's tens of thousands of demonstrations, hundreds of thousands of comparisons, because that's easier to just choose A&B, and then tens of thousands of prompts. So that's basically the kinds of datasets

Starting point is 00:29:08 that they think is representative of the chat GBT. Like goes into building something like a chat GPD. Do we want to, at the last, get to the datasets part? I know we had a little bit also on instruction tuning, but we have a bunch of things to get through, so maybe we want to start there. I'll just quickly mention that instruction tuning was another paper coming out, OpenEI,

Starting point is 00:29:30 and obviously a very important part. That is under a subset of the demonstration stuff that we're talking about. There is some debate from the Lima paper, LIMA, about how much data we actually need. And so we interviewed Databricks, and Mike Conover is a very good friend in the pod, and they collected like 15,000 pieces of data

Starting point is 00:29:50 to instruction tune themselves, open assistant from Yanuked, also collected tens of thousands of pieces of responses to instruction tune on. And there's always something on some research that indicates that maybe, you know, it's not so much as we need. So Lima is something that we'll call out as interesting there. But yeah, we can move on to the major datasets because this is Datasets 101. It doesn't have to be today in datasets, which can be a whole different podcast.

Starting point is 00:30:16 So first we start with Common Crawl. That is the OG, that is the bread and butter of every single dataset, including the image ones that we'll talk about later. So it was founded by this guy, Gil Albus, and I actually did some research on him, and then you fact-checked my research. So this guy, Gil, is actually, he has his own Wikipedia page so you can look him up.

Starting point is 00:30:33 He basically started the predecessor to AdSense and sold it to Google and worked at Google for a while. And obviously, Google, and this was like in the late 1990s, early 2000s. And he saw firsthand how important, like obviously Google's crawling was important was for Google and basically quit Google and started Common Crawl. Like basically quit Google and started empowering competitors to Google, which is kind of

Starting point is 00:31:01 interesting and scandalous. You did some interesting. What did you find? That's kind of like the beginning of the open web. So mostly the data was used for like surfacing pages. And then the whole big data thing kind of came to be. And one of Gil's ideas was like, okay, this data is not only good for like Google search, like indexing, there's a lot more work that you can do with it.

Starting point is 00:31:24 We're like, I think, 20 years into it almost. Yeah, in 2008 was like when the first published the first dataset. And it's one of the biggest ones out there. So there's 3.1 billion web pages, 400 terabytes of content, 43 million hosts. Only 46% of the content is in English, going back to our discussion before. It's run as a nonprofit.

Starting point is 00:31:46 So there's a lot of unique things about it. I don't know if today you will see a nonprofit getting started just to provide massive amounts of data to the public, especially in this world of AI where everybody's hiding the data that they have. So maybe we do need it, but I'm not sure if we're going to get it. I think it's actually super interesting how it got started itself. This is obviously a very significant effort that all research, all NLP research and all language modeling is downstream of.

Starting point is 00:32:15 They were started in 2008, but I think they were stealth for four years because the, you The earliest example I can find of them releasing any of the data was in 2013. Sorry, they called it 2012, but the press release was 2013. And it looked like they used to crawl once a year, crawl as far as they know all the internet. And now they crawl once every two months, right? And it's just an interesting example of a nonprofit-driven approach that people don't really question or look into, but it's actually secretly driving all of the LLMs that we

Starting point is 00:32:47 had today. There's some issues that are very well known in Common Crawl. So Common Crawl actually say is that they only cover a fraction of the web. It's a nonprofit, works on nonprofit resources, doesn't cover all of the web. And all language models are trained on Common Crawl. Therefore, our language models are not trained on all the web. It is also a biased sampling. It's definitely biased towards the United States.

Starting point is 00:33:12 There's a lot of data quality issues. I think we talked about in some of our other episodes, where the labels for some of the languages that Common Crawl has might be completely off. Like I think someone mentioned about the Arabic issues being tagged, if you actually look at all those pages that are tagged as Arabic, they're not Arabic at all. So just like really, really basic coding errors or, you know, nobody checks these trillions of pages that are being called.

Starting point is 00:33:40 So it's just really, really difficult. If you have Robots.TXT that blocks Google, you will also block Common Crawl. if the page is too big because of just the sheer amount of data that's on it or the images on it, the pages are deleted or if they're duplicated on multiple sites because of spammers, it's just really, really difficult. Or if your pages are written in SPAs as JavaScript, because Common Crawl doesn't render to JavaScript. It only executes limited JavaScript. So, for example, much of Facebook is not under Common Crawl, right?

Starting point is 00:34:12 And this is increasingly a problem with the closed gardens or the walled gardens of the internet. it, right? As information migrates from the open web into apps like Discord and Slack and Facebook, it's just not available to the Common Crawl. No, no, then Common Crawl is its own clean subset, so to speak, which is Google's C4, which was created during the training up their T5 model. Jonathan Franco on our podcast called it weirdly good. I think that kind of explains a lot about datasets. Sometimes they're good and we don't know why. And, C4 is made by using a few heuristics. So it's about 10% I think of Common Crawl.

Starting point is 00:34:55 Like it's much smaller. And it tries to filter Common Crawl by different ways. So one, it's using this open source thing called List of Dirty, Noddy, Obscene, and Otherwise Bad Words, which is 402 terms written in English and one emoji, which you can guess which emoji it is. The list was created by Shutterstock, actually. they basically wanted to avoid bad words to be auto-filled in their search. So they created this blacklist of things that they wouldn't auto-fill for the U-S search, which I think is a funny way to end up being one of the foundation pieces of modern large-language models.

Starting point is 00:35:33 The other thing is there's a lot of stuff that didn't get filtered out, like certain piece of fortune, like Kiwi farms, things like that. Again, the episode is not about, you know, giving our judgment on the data. is just about what's actually in them. And there was a Washington Post article that kind of went through the whole list that we're going to link in the show notes. The other thing that I found fascinating is that if you look at what domains are in the C4 dataset, the patents.gov.com website is like twice as large as the second one, which is Wikipedia. And then Wikipedia is like three times as larger as the third one, which is script.com.

Starting point is 00:36:14 So there's, you know, it's obviously like less than half of a percent percentage point, but it's still interesting to see very formal kind of like text as the largest represented one. But yeah, C4 is another obviously core data set. So if you're looking to train your foundation models, that's one you should check out. Yeah. And in fact, a lot of models will list both Common Core and C4 as part of their datasets. And it will be a very, very heavy weight. It will be something like 30 to 60% of the amount of the token budget that they have,

Starting point is 00:36:50 which is super important. I mean, this is the starting point of all of our language models. It's really, really fascinating. Going off of the list, basically trying to reintroduce where GPT3 gets its datasets from. So if you pull out the GPT3 paper, we're basically going in order of explaining which of the data sets and telling a little bit of the story behind each of the data sets, right? Wikipedia is the next dataset, and obviously it's a very high-quality data set

Starting point is 00:37:19 because a lot of people have spent a lot of hours anything, them. But except for the fact that Wikipedia itself has his own bias, right? I always have this fun fact that I pulled from Google here. 77% of Wikipedia articles are written by 1% of Wikipedia's editors, Meaning there's just an extreme, extreme skew in terms of the representation of the kind of people that write Wikipedia articles and the decisions that are being made, right? In particular, this one guy, Stephen Pruitt, because it's constantly made the rounds as the highest ranking Wikipedia editor. He's made over 5 million edits and has made one edit to one-third of all English Wikipedia articles. So if you want it basically to seriously affect machine learning datasets or large language models,

Starting point is 00:38:08 you should edit Wikipedia, is what I'm saying. That's funny. It's not just Wikipedia, you know. Yeah, Reddit. Yeah, exactly. Webtext is another major data set that was also used in GPT2. It's about 45 million links and the text of those web pages. The way they collected is basically scrape every URL from every read of submission up to December,

Starting point is 00:38:31 2017, that add at least three upvotes just to make it somewhat, I guess, like, less spammy. And they've removed all of Wikipedia from it. So there's no Wikipedia in this. It's all Reddit links that are not Wikipedia between 2017. And I forget when the release date of this was. And then they did another round of euristic-based cleaning, which again is like, who knows what that means, which makes it kind of complicated to then scale these models, these datasets, right? Because if we knew the cleaning process, then we could say, oh, let's take all

Starting point is 00:39:03 submissions from like 2010 and clean them up, but we don't actually have the step-by-step rules for some of them. That's it. It's been replicated by the Luther organization, right? Luther is something is one of the organizations that are consistently shouted out by our guests as doing really good work in LLMs. And so they've replicated web text from Open AI. So OpenEI, as far as I know, did not release webtext. They did not really release the rules around them, but the Illuther organization created open webtext, which is an open source reproduction of webtext. And so we're going to leave in the show notes the link to open webtext two, which is the latest reproduction of webtext. We also mentioned the issue with Reddit,

Starting point is 00:39:43 which is also a very hot topic right now with them shutting on their APIs. But let's keep moving on in terms of data sets. Next we'll go on to books. So this is just basically, as far as I can Intel open source books, quote unquote open source books. It's a data set of 196,000 books in plain text for training large language models such as GPT. And it's also included in the pile, which we'll talk about later. But one of the interesting factors in the books dataset is that apparently the copyright for some of them is not clear. So for example, when Jonathan Frankel and his and Abbey trained MPT7B, they actually got attacked by this. person who was basically saying, how can MPC7B or MC30B be commercially

Starting point is 00:40:31 licensed if the data set it was trained on had some issues with copyright, right? And they went back and forth from it. We linked to that discussion on our episode page. But it's an interesting thing. Like if any of these are compromised, like if they're not copyright-free books, then that's an issue. So we need to actually make sure that our data sets are clear such that our models are clear. Yeah. And then copyright expires. So hopefully the more time goes by, the more data we get. Yes, yes. But like unless Disney says they want more time and then Disney will extend the copyright.

Starting point is 00:41:06 But as far as I understand, I think Sherlock Holmes copyright recently expired. So a lot of people are making like Sherlock Holmes fanfic now because you can start using that. And I think maybe like Winnie the Pooh maybe. I don't know. I forget what the recent expiry was. But every year there's there's an expiry day and a bunch of stuff becomes public domain, which is useful for, I guess, training and children's book storytelling. Okay, then we go from everything we've covered so far as natural language data sets,

Starting point is 00:41:35 but then we go from there to code datasets. And there's a lot of different code datasets. Salesforce code gen, I think I would point to as a predecessor to what I'm going to talk about. The most significant one in my mind is the stack from Eluther. It's basically a scrape of a GitHub archive, 6 terabytes of permissively licensed code data. So the beautiful thing about code datasets is that most of the code on GitHub will include a license file who will tell you what license they have.

Starting point is 00:42:02 And so you can just kind of pick the set of licenses that you think are permissively licensed and you can kind of just get them out. So the raw data was 102 terabytes from 153 million repos, and that's 320 times larger than Wikipedia. Then you clean them for file extensions. So, for example, you can take out the images and take out blob files for whatever reason. And then you're left for 69 terabytes.

Starting point is 00:42:25 Then you clean out the licenses, and that's 90% of code that's thrown away because they're not permissively licensed or they have no license at all. Please, please, please, please, by the way, if you open source any code on GitHub, please add a license file so that issues like this don't come up. And so we end up with six terabytes of permissive code data that anybody can train up. It's great. Yeah. And they didn't stop there with datasets.

Starting point is 00:42:47 There's also the pile, which it's like my favorite data set. name. It's sort of like a pile of data. It's 825 gigabytes from 22, basically like 22 smaller datasets combined. So in there you'll find PubMed data, archive papers, get up data, the free law project, the Ubuntu, IRC channel, Anchor News, YouTube. There's a bunch of things in there. The thing that we know about this from Stella and Luther from one of her tweets is that only about one-third of the contents are duplicated, and we'll chat a little bit later about deduplication and whatnot, but they actually deliberately up-sampled the original data set to include some of the duplicate data. So even without that, the data set is quite large,

Starting point is 00:43:37 but at 825 KGB of kind of mixed data as another one of the core data sets people use. Yeah, so obviously we can keep going and going and going. There's no end to these datasets. Alexa just mentioned a whole bunch of interesting data sets that we don't have time to go into, right? But you can always research them if you want to. We're just trying to give a one-on-one. But no one-on-one would be complete without mention of other modalities of datasets. So I'll just mention two of them, and then hopefully we can just kind of go on to issues. Otherwise, this will be a 10-hour lecture.

Starting point is 00:44:10 Lyon is the Eleuther for images. Lyon stands for large-scale artificial intelligence open network. That is the sister organization that came out of Eluther. that stability and Iman Mostak worked with to create stable diffusion, right? So this is all the predecessors of stable diffusion. And actually came out of COVID, right? Everyone was sort of board sitting at home and looking at Dali and going, why don't we have an open source replication of Dali?

Starting point is 00:44:37 And so the first thing that you do when trying to replicate a model is you go collect data, right? And so Lyon in 2021 collected 400 million images from where? From Common Crawl, right? Common Crawl keeps coming up. It's the OG. It's so goaded. And then in 2022, they released Lion 5B, which is 5 billion images for data sets. They've also released an aesthetic subset of the Lion 5B dataset.

Starting point is 00:45:03 And basically, there's a lot of filtering in it comes to images. Obviously, there's a lot of porn that you need to filter out and not safe for work. And also copyrighted images, right? You have to make some decisions around, do you want images of Spider-Man? And I was just at the Figma conference yesterday where Figma was very, where Adobe was very proudly showing off Adobe Firefly, which has no idea what Spider-Man is. And for them, it's a feature because for the kind of companies that are using Adobe to create

Starting point is 00:45:30 images, they need to not run into trouble with Disney. So it's fantastic. It's very hard to get images of Spider-Man out of your image dataset. And then Whisper is the other one. So instead of images, we need transatlore. we need transcription like ASR. So Whisper, if you look at the paper from Open AI, again, it's Libri Speech, Common Voice, Voxforge, Switchboard, and Fisher Corpus.

Starting point is 00:45:54 That's all I know about them. There's a lot. Whisper is so good. I feel like nobody's saying, let's do another whisper. You know, I think the NLP and text part is the most active one right now. Yeah. So for those who want to research more datasets, I think the best place to go is probably hugging Face Hub.

Starting point is 00:46:10 A lot of people for training datasets, they also go to Kaggle. I maintain a list of useful big datasets of my repo of useful resources, so I'll link that in the show notes. But yeah, I think that's going to be the high-speed tour over all datasets. So again, we'll come back to the key question, like, why are data sets important? First of all, like, we have to figure out, we have to know the fundamentals about data sets to figure out whether or not we're running out of it or how we're using it. But also fundamentally, a little bit concerned about the number of dataset producers to the number of the ratio of the dataset consumers. consumers, right? Everyone wants the glory of training language models and saying, like, I train this model that does X, but not that many people are interested in cleaning data, right?

Starting point is 00:46:52 Like, it was just a common meme in the enterprise sort of data science world, the machine learning world, that everyone wants to be data scientists. No one wants to be a data janitor. Right. So I don't know if you've, like, run into any of these conversations in your line of work. Yeah, that makes sense. I think, like, especially now, there's a lot of pressure on companies to use AI. And everybody wants to use AI. Nobody wants to do the work of getting the data ready to make AI useful for their company. So it definitely resonates.

Starting point is 00:47:20 Yeah. And companies that are sitting on top of a lot of data are actually realizing that they are sitting a lot of gold. I call this data is the new oil part two, right? Bloomberg recently came out with Bloomberg GBT because guess what? They have proprietary license. They have the license on a few decades worth of Bloomberg financial news report. So if you ever need to generate or do any reasoning around financial data, Bloomberg has the best data set by far. And it's close source and you have to pay Bloomberg to do it. Right. Notion, if you think about it, notion has 22 million users and all their users use their notion as a knowledge store. Right. So notion has a tricky issue because notion doesn't own their customer's data. They just hold it for them. And so they don't actually have the right to train on them. And so it's like a this tricky dance between them. But if you you enable, for example, customers to fine tune on their own company data within

Starting point is 00:48:13 Notion just with a single click, that becomes extremely valuable as well. So individual companies are realizing their modes. Just yesterday, the Stackoverful CEO was saying, like, you know, they're joining Reddit in terms of shutting down their API access because they realize they have good data as well and they want to be paid. So, and that's a part of the issues that we'll go into. But finally, we should also cover the counterculture movement for open datasets, like open replication of datasets.

Starting point is 00:48:38 And Yanukkulture, I think, is one of the leaders here, as well as E Luther, for reproducing the instruction-tuning datasets that people will need to train their own chat GPT. It's pretty interesting that YouTube influencers are coming to the rescue of open source, because there's no other source of influence that is powerful enough to compete with OpenEI except for YouTube. Yeah, I think that was fair, Sean. Maybe we want to run through some of the issues and kind of things to keep in mind for the dataset. Yes. Okay, so I put this first because it's also fairly current. There's always this issue of data set quality, right? So when I ask researchers like, hey, why do you think we're running out of data? There's obviously hundreds of petabytes of data produced every single day. Why don't we just use that as data? And this typical answer to me is, well, that data is low quality, right? Which is true. And you need diverse sets of data as well. So for example, one of Stella's tweets about how we're not running on a data says, oh, oh, there's like, you know, hundreds of terabytes of legal filings that are generated every single year. But all those legal filings have the same format. We don't learn very much by going over, you know, 1,000 pages of parking files or whatever, right?

Starting point is 00:49:52 Like, they have to be unique and actually useful and high information. And so curating those datasets and making them useful is emerging, is becoming more and more important. And the way that we know this is actually from something that happened, recently, which is Microsoft started training this small language model, Phi1, that is 1.3 billion parameters. So not even a large model anymore. It's just one billion parameters. So this is the size of GPT2 on a dataset size of 7 billion tokens. So again, way smaller. This is a 7 to 1 ratio, not 200 to 1. Way, way, way smaller. And it's basically comparable in terms of the benchmarks to standard-of-the-art models that are something like 10 times bigger, right?

Starting point is 00:50:39 Like, it's comparable on human eval, because it's a co-gen model. It's comparable on human eval and MBPP, which is their own benchmarks. And so it's just like very interesting. I think that's something that's an emerging area of research, like how to improve the quality of data such that you train smaller models with fewer tokens, right? That is the final step of this evolution. But also Falcon 40B, the model that came out of the UAE that is now the top open source model. That was also on a proprietary new data set that the UAE government collected as well.

Starting point is 00:51:11 And so that's just super interesting that you are not competing on size anymore. You're competing on quality. Yeah. The other issue that we mentioned, obviously, is copyright and privacy. We're not going to go over the same issues again, but there's a couple of interesting thing going on. So there's the stable diffusion litigation, which is from the same council that is doing the get up copilot litigation. Basically, the whole idea is that, hey, this is not really fair use. You cannot really use my data to do this.

Starting point is 00:51:39 So they're basically suing the model trainers on whether or not they should be able to use their work, even though they didn't specifically license it, you know, to not be used. There's kind of like an ethical question there. The interesting thing here is that now some of the AI providers are citing on the user's behalf. So, for example, if you use GitHub copilot and copilot generate, GPL code, which is license and in theory you should not. It's basically like a contamination license. If you use GPL code and your code base, all of your code base becomes GPL and you need to open source it. I actually wrote a very long post on the history of open source license,

Starting point is 00:52:18 which will link in the show notes. But basically, get up a saying, we'll literally pay for your lawyers and like will send lawyers to fight on your behalf. So it's really interesting how the risk piece is not being clear by saying, hey, we definitely 100% not use copyright data. They're basically saying maybe we do, and if we did, we mess up and we're going to pay for it, but ideally we're not doing it. And there's also different articles out there that I mentioned before by the Washington Post and companies like that where a lot of the training data comes from newspapers, comes from magazines, and people have not always opted in to having that as a training of

Starting point is 00:53:00 the model. so some of the work then comes out in the inference. So yeah, that's kind of like another interesting piece of development. And again, if you're turning your own model, you should be really careful. If you're using an off-the-shelf model, you should also be somewhat careful. But it seems like there's a lot more insurance on that. Yeah. Licensing issues also come in a form of terms of use, which is not an official license,

Starting point is 00:53:25 but it's something you agree to when you use services. So OpenEI has this very famous clause in their terms of service, which basically states that you can't use Open AI output as input for your training models, which is exactly what the Alpaca Vikunia students did in Stanford to train their models that now compete with Chachupit. So this is why in our conversation with Mike Conover, he was very excited about Red Pajama, which is an open source replication of Lama, because Lama also has similar licensing issues. Lama doesn't allow you to use it commercially, right?

Starting point is 00:54:02 All these licensing issues, copyright issues, permissions issues are emerging areas that are being litigated. People are coming up with different ways to license this stuff. So, for example, Hugging Face has this rail license, responsible AI license that is different from MIT, different from Apache II. And that's the license that Stable Fusion is under. But it has never been litigated in the court of law. is not accepted by the OSI Institute as open source. So it's just unclear. Like, can you use it?

Starting point is 00:54:31 You have to consult your lawyer, quote, unquote, which is a real cop-out to basically say nobody knows until some judge rules when a case is brought up. So that's a licensing thing. There's a lot of work there, too, like Hugginface has built like a PIA removal pipeline for like their development. You can also go on Hugginface and check if your information is in the stack and whatnot. But again, we're going to put all of this in the show notes. kind of bored you live on all the details. Yeah, just for social people know, we have 17 pages of show notes that we've been collecting for two months.

Starting point is 00:55:02 So it's just a little bit of a crazy thing to compress into one hour, but let's try to do that. All right. The next few issues are to do equality as well. So we'll talk about duplication and filtering. And basically, like, the amount of duplicates that, there have been studies done on the amount duplicates in open web text in C4, and there's still quite a significant amount. It's pretty interesting because every time you duplicate something, you're exposing the model to that set of raw data again without knowing it.

Starting point is 00:55:33 And therefore, it's basically going to try to memorize that text way more because it's just been exposed to it a lot more. And you're just basically wasting compute because that's not the kind of training that you want. So there's a bunch of research here that is reflective of studying these datasets and identifying those duplicates and removing them. But just the impact of this is really interesting. So we have this paper here that basically states that a sequence that is present 10 times in the training data is on average generated 1,000 times more often than a sequence that is present only once. So basically there is a disproportionate amount of weight that is placed on repeated information, which makes sense.

Starting point is 00:56:09 If you show a model repeated information, it's going to overfit to that set of information. But it's only order of 1,000 times more frequent. and that is a concern when trying to train general purpose models. The other thing is contamination. It's specially related to benchmarks. So, for example, one of the things that GPT4 showed is like, oh, they do very well on code forces, code puzzles. But you'll see that the models, for example, does 10 out of 10 on the pre-2020 problems,

Starting point is 00:56:41 and then on the more recent ones post-training cutoff date, it does zero out of 10. And it's basically, I think I looked at the scoring, it's like worse than a person doing it wrong on purpose, which is, you know, it's actually pretty impressive. So when you're turning a model, understanding what goes into it also helps you understand how to benchmark it. Like if you're benchmarking it against things that the model has learned, it's not super helpful. So that's another thing to keep in mind. Yeah. And this is why also, you know, releasing models and showing how you e-value your models or releasing data sets, right? That's the topic of this episode.

Starting point is 00:57:17 It's very important. So when Falcon 40B came out, they went right to the top of the Hugging Face Leaderboard on the benchmarks that Hugging Face Leaderboard provides. But we actually don't know if Falcon 40B's datasets were contaminated with the things that are evaluated on. If you just copy paste the exact results of the test that you're going to be tested on, of course you're going to do extremely well on it. There's actually a current thread of people.

Starting point is 00:57:41 Now that the model at least has been released, the dataset has not been released, but the model, as far as I can tell, has been released, people are replicating those evals and finding that it actually is falling shorter than claimed. So I haven't done it myself, so I can't really claim to know one way or the other. But I think it's just one of those things where you have to have some healthy skepticism

Starting point is 00:57:59 because there's a lot of people trying to gain benchmarks, and the easiest way to gain benchmarks is to conveniently forget to remove the benchmarks from your data sets. Final point, because we're running out of time, dataset imbalance. Obviously, we're all talking in English. The world is very English-centric, but there's other languages in the world. And we already talked about the tokenization issues, which will cost more, right?

Starting point is 00:58:22 Because all these APIs are charged based on the number of tokens generated. And if your tokenizer uses more tokens per language, then you will cost more to generate those language. Some amount of that is honestly not the fault of anybody because the language is just more complex, right? As a Chinese speaker, I'm well aware that the average Chinese person is required. to learn 10,000 Chinese words to be considered literate, which is absurd. Right. But there are, obviously, China is making a lot of progress on language modeling as well. So there's actually a lot of papers coming on to China for Chinese datasets and English

Starting point is 00:58:58 to Chinese conversion datasets, right, which is a lot of the original translation benchmarks. So there's some Chinese data sets that we've outlined here, the CMRC, Do-Reader and CHID, all of which will link in the show notes. Is there anything else that you wanted to comment on in particular? No, I think that's a lot of it. Actually, one of my friends and former co-founder, Andrea, he's working on an Italian language model. So I'm curious to see more of them come online. And I think like in the episode we're going to release with the practical AI crossover one,

Starting point is 00:59:29 we talked about how language is also used differently in different countries. So some are very oral driven. So like a language model that is only tax is not as important. so the voice data is also crucial. Well, again, data sets 201. We'll get back to it. But I think we're ready at one hour 10. So I think we covered it a lot.

Starting point is 00:59:50 Yeah, yeah. Hopefully that was a good overview, especially for people who like keep hearing about things like Common Crawl, keep hearing things about contamination and keep hearing things about tokens even. And this is a ground-up reintroduction to these concepts. We are recording these one-on-one episodes in order to be evergreen, right? That you can listen to this a year from now, and hopefully you still not be out of date.

Starting point is 01:00:12 Who knows? Hopefully. We can't keep going at this pace. But I really want to emphasize, yeah, datasets are great. Let's spend more time applauding dataset creators because we're downstream of them. Yeah, if you want to train on the Latenspace podcast,

Starting point is 01:00:26 please go for it. We got all the transcriptions and the show notes. So we're doing our part. We're doing our part. All right, everyone. Thanks for listening. All right. Bye-bye.

Latent Space: The AI Engineer Podcast - AI Fundamentals: Datasets 101

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.