Tech Brew Ride Home - Thu. 07/25 – Reddit’s Throwing Elbows Again

Starting point is 00:00:00 On April 4th, 2023, around 2 in the morning, a man was found stabbed multiple times on a sidewalk in downtown San Francisco. Hey, who did this to you? What happened next turned the story into a political firestorm. Reports have identified the victim as Bob Lee, the founder of Cash App. From Bloomberg Podcasts, this is Foundering, the Killing of Bob Lee, beginning April 16. Welcome to the Tech meme right home for Thursday, July 25th, 2024. I'm Brian McCullough today. Once again, Reddit looks like it's not worried about upsetting people. New generative search on Bing, new models from Mistral and a new video model from stability, but did Runway train its video models on YouTube videos? We might have a smoking gun there. But what if the dream of synthetic data for AI training is just a mirage? Here's what you miss today in the world of tech. Reddit appears to be blocking search engines that don't relate to. on Google's indexing. Bing, Duck, Duck, Go, and others are not showing recent results from Reddit. Quoting 404 Media. Google is now the only search engine that can surface results from Reddit,

Starting point is 00:01:18 making one of the web's most valuable repositories of user-generated content, exclusive to the Internet's already dominant search engine. If you use Bing, Duck, Duck, Go, Mojik, quant, or any other alternative search engine that doesn't rely on Google's indexing, and search Reddit by using site Reddit.com, you will not see any results from the last week. Duck, Duck, Go is currently turning up seven links when searching Reddit, but provides no data on where the links go or why. Instead, only saying that, quote, we would like to show you a description here, but the site won't allow us. Older results will still show up, but these search engines are no longer able to crawl Reddit, meaning that Google is the only search engine that will turn up results from Reddit going forward.

Starting point is 00:02:00 Searching for Reddit still works on Kaji, an independent paid search engine that buys part of its search index from Google. The news shows how Google's near monopoly on search is now actively hindering other companies' ability to compete at a time when Google is facing increasing criticism over the quality of its search results. And while neither Reddit nor Google responded to a request for comment, it appears that the exclusion of other search engines is the result of a multi-million dollar deal that gives Google the right to scrape Reddit for data to train its AI products. They're killing everything for search but Google. Colin Hayhurst, CEO of the search engine Mojeeke, told me on a call. Robots. text files are just instructions, which crawlers can and have ignored. But according to Hayhurst, Reddit is also actively blocking its crawler.

Starting point is 00:02:45 Reddit has been upset about AI companies scraping the site to train large language models and has taken public and aggressive steps to stop them from continuing to do so. Last year, Reddit broke a lot of third-party apps beloved by the Reddit community when it started charging for access to its API, making many of those third-party apps too expensive to operate. Earlier this year, Reddit announced that it signed a $60 million deal with Google, allowing it to license Reddit content to train its AI products. Reddits robots. Text used to include a bunch of jokes, like forbidding the robot bender from Futurama, scraping it, and specific pages that search engines were and were not allowed to access,

Starting point is 00:03:21 like dotRSS was allowed, while slash login was not allowed. Today, Reddit's robot.text is much simpler and more strict. In addition to a few links to Reddit's new public content policies, the file simply includes instructions which basically mean no user agent or bot should scrape any part of the site. Reddit appears to have updated its robots.comptex file around June 25th after Mojeeks Hayhurst noticed its crawler was getting blocked. That announcement said that, quote, good faith actors like researchers and organizations such as the Internet Archive, will continue to have access to Reddit content for non-commercial use, and that we are selective about who we work with and trusts with large-scale access to Reddit content, end quote.

Starting point is 00:04:01 It also links to a guide on accessing Reddit data, which plainly states Reddit considers search or website ads as a commercial purpose, and that no one can use Reddit data without permission or paying a fee, end quote. Microsoft, by the way, confirmed to Barry Schwartz at Search Engine Land that Bing stopped crawling Reddit after Reddit updated its robots.com text file on July 1st. Meanwhile, Microsoft unveiled Bing generative search, which shows AI-generated answers with the sources used to create them, currently available to a small subset of users, quoting Windows Central. At the very top of the page will be an AI-generated answer created by large and small language models that have reviewed millions of sources to provide the most accurate answer.

Starting point is 00:04:47 It will break down that answer into a document index that can provide more information about particular subjects within that search query if you'd like to learn more. The search engine will also list the sources that the AI-generated text was created from below the answer and will even present traditional search results in a sidebar on the right for those who are uninterested in Bing's curated AI experience. Microsoft says it continues to evaluate the impact that AI and search is having on websites in terms of direct traffic and readership. There's a growing concern in the industry that websites that create content for free will eventually go out of business if AI bots scrape that content to present it directly in a chat window or a search page. This new AI search experience has been built from the ground up with this concerned in mind.

Starting point is 00:05:28 Microsoft says the company claims this new experience maintains the same number of clicks to websites that traditional search does, but time will tell if that's true, end quote. Which means it must be time to do a whip around to discuss the newest models folks have released. Mistral has announced Mistral Large 2, its new generation for its flagship model with 123 billion parameters, quoting Venture Beat. However, in an important caveat, the model is only licensed as open for non-commercial research uses, including open weights, allowing third parties to fine-tune it to their liking. For those seeking to use it for commercial-slash-enterprice-grade applications, they will need to obtain a separate license and usage agreement from Mistral, as the company states in its blog post and an ex-posts from research scientist,

Starting point is 00:06:17 Devendra Singh Chaplot. While having a lower number of parameters or internal model settings that guide its performance, than Lama 3.1's $405 billion, it still nears the former's performance. Available on the company's main platform and via cloud partners, Mr. Alarge 2 builds on the original large model and brings advanced multilingual capabilities with improved performance across reasoning, cogeneration, and mathematics. It is being hailed as a GPT4 class model with performance closely matching GPT40, Lama 3.1405, and Anthropics Claude 3.5 Sonnet across several benchmarks, end quote. And Stability AI unveiled stable video 4D, a model based on its stable video diffusion model that takes video input and generates videos from eight new perspectives.

Starting point is 00:07:03 Also quoting Venture Beat. While there is a growing set of Gen. Gen.A.I. Tools for VideoA.I.S.ora, Runway, Haper, and Luma AI, among others, stable video 4D is something a bit different. Stable Video 4D builds on the foundation of Stability AI's existing stable video diffusion model, which converts images into videos. The new model takes this concept further by accepting video input and generating multiple novel view videos from eight different perspectives. We see stable video 4D being used in movie production, gaming, AR, VR, and other use cases where there is a need to view dynamically moving 3D objects from arbitrary camera angles.

Starting point is 00:07:41 Varun Jampani, team lead 3D research at Stability AI told Venture Beat. Jampani noted that stable video 4D is a first-of-its-kind network where a single network does both novel view synthesis and video generation. Existing works leverage separate video generation and novel view synthesis networks for this task. He also explained that stable video 4D is different from stable video diffusion and stable video 3D in terms of how the attention mechanisms work. We carefully design attention mechanisms in the diffusion network which allow generation of each video frame to attend to its neighbors at different camera views or timestamps, thus resulting in better 3D coherence and temporal smoothness in the output videos.

Starting point is 00:08:21 Mpani said, end quote. Shot and Chaser, 404 Media has a source and has seen an internal document that they say reveals AI startup runway scraped thousands of videos from YouTube creators and brands, including Disney and Vice News, to train its Gen 3 AI video generation tool. Quote, the model initially codenamed Jupiter and released officially as Gen 3, drew widespread praise from the AI development community and technology outlets covering its launch when Runway released it in June. Last year, Runway raised $141 million from investors, including Google and Nvidia, at a $1.5 billion valuation. When TechCrunch asked Runway co-founder Anastasus

Starting point is 00:09:09 Jermindis in June, where the training data for Gen 3 came from, he would not offer specifics. We have an in-house research team that oversees all our training, and we use curated internal data sets to train our models. Jermindis told TechCrunch, the spreadsheet of training data viewed by 404 media and our testing of the model indicates that part of its training data is popular content from the YouTube channels of thousands of media and entertainment companies, including The New Yorker, Vice News, Pixar, Disney, Netflix, Sony, and many others. It also includes links to channels and individual videos belonging to popular influencers and content creators including Casey Neistadt, Sam Coulter, Benjamin Hardman, Marquez Brownlee,

Starting point is 00:09:48 and numerous others. While 404 media couldn't confirm that every single video included in the spreadsheet was used to train Gen 3, It's possible that some content was filtered out later or that not every single link on the spreadsheet was scraped. The training data reveals specifics about the generative AI industry, which has been repeatedly accused of training models on copyrighted material. Runway did not respond to multiple requests for comment via email, LinkedIn, and its official Discord channel. When reached for comment, Google, which operates YouTube and is a runway investor, pointed us to a Bloomberg story from April in which the company told the publication that OpenAI training its AI video generator, SORA with YouTube videos would violate YouTube's rules. Our previous comments on this still stand, a Google spokesperson told 404 Media in an email when asked about runway scraping YouTube videos,

Starting point is 00:10:36 there was a company-wide effort to compile videos into spreadsheets to serve as training, a former runway employee told 404 Media. After the list of videos was compiled, Runway scraped the videos using open source software, specifically YouTube DL, which has a proxy configuration option. Runway purchased proxies from a provider, the source said, which gives customers an IP address that routes requests for downloads through in order to not get blocked by YouTube. 404 Media granted the source in this article anonymity because they feared professional retribution. The channels in that spreadsheet were a company-wide effort to find good quality videos to build the model with, the former employee said. This was then used as input to a massive web crawler

Starting point is 00:11:15 which downloaded all the videos from all those channels using proxies to avoid getting blocked by Google, end quote. The document contains 14 spreadsheets, each labeled with different categories. spreadsheets contains what appears to show a list of 117 terms like Beach, Doctor, and Rain, and the names of runway employees next to each of those terms. The former employee told 404 media that these names were either people tasked by others to find videos related to the keywords or the employees themselves, noting that they were working on that keyword. Next to the term rainbow and the employee name, someone wrote a note that said, no channels or playlist dedicated to it, but found good individual videos for fine-tuning, end quote.

Starting point is 00:11:52 Notes in the documents show that the company was trying to obtain videos that had a specific type of subject matter, camera work, and with a diverse set of people in them. The high camera movement sheet contains 177 links to YouTube channels, including the official Call of Duty channel, filmmaker Josh Newman's channel, Unreal Engine, and Vans channels. A spreadsheet titled Cinematic Masterpieces contains 206 links to individual channels and videos of especially high quality, including animated shorts and student films. On that sheet, a note next to a link to the Defi Studio YouTube channel says, the Holy Grail of Car Cinematics so far. Single great videos for fine-tuning is a stockpile of another

Starting point is 00:12:30 253 videos along with a column for topics like waxing eyebrows, ice sculpting, smiling, and screaming. The non-Y YouTube source sheet also contains a link to an archive of Studio Ghibli films, several anime piracy sites, and a fan site for Xbox game clips, as well as a now offline movie piracy site called AZI Movies that has a note with it from someone at runway, quote, of stuff in here, end quote. And finally, pair that with this. Researchers suggest that using synthetic data created by AI systems to train other AI systems could lead to the rapid degradation of AI models and a collapse over time, quoting the FT. The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results,

Starting point is 00:13:24 according to new research that highlights looming challenges to the emerging technology, leading AI companies, including OpenAI and Microsoft, have tested the use of so-called synthetic data, information created by AI systems, and then also train large language models, as they reach the limits of human-made material that can improve the cutting-edge technology. Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output. The work underlines why AI developers have hurried to buy troves of human-generated data for training

Starting point is 00:14:02 and raises questions of what will happen once those finite sources are exhausted. Synthetic data is amazing if we manage to make it work, said Ilya Shumailov, lead author of the research, but what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens, end quote. The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training. The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process, and the quality of data used. The early stages of collapse

Starting point is 00:14:39 typically involve a, quote, loss of variance, which means majority subpopulations in the data become progressively overrepresented at the expense of minority groups. In late stage collapse, all parts of the data may descend into gibberish, end quote. So given the Lama news this week, interested to see this interaction, Sam Lesson threaded the original tech meme link of this story to say, quote, the whole story of using AI to generate training data to keep training AI has always been a head scratcher to me. It never made sense to me why that would work versus just drift rapidly into nonsense. But for a long time, lots of people seem to believe it. I am glad to see that the research is coming back to in line with what intuitive expectations would be, end quote. To which Mark Zuckerberg

Starting point is 00:15:24 himself responded, quote, distilling models into smaller models that are almost as small, but only a fraction as expensive to run, clearly works, and is a lot of what I expect people to use Lama 3.1405B to do. There's also evidence that you can further train these smaller models to surpass the intelligence of the teacher model, although that's not necessarily using synthetic data from the teacher, end quote. Also, there's this crunch base article that says in the first half of this year, generative AI startups raised five. $500 million across 198 angel or seed deals. Ride Home AI Fund was in about a dozen of those by my count.

Starting point is 00:16:12 And add one more because we made our final bet just yesterday. Closing the investment tomorrow, the AI Fund has officially deployed all of its capital in a little over a year, quick and dirty, just like Chris and I said we do. Talk to you tomorrow.

Tech Brew Ride Home - Thu. 07/25 – Reddit’s Throwing Elbows Again

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.