Tech Brew Ride Home - Wed. 07/19 – More Proof The Big Folks Are Winning AI (For Now)

Starting point is 00:00:00 On April 4th, 2023, around 2 in the morning, a man was found stabbed multiple times on a sidewalk in downtown San Francisco. Hey, who did this to you? What happened next turned the story into a political firestorm. Reports have identified the victim as Bob Lee, the founder of Cash App. From Bloomberg Podcasts, this is Foundering, the Killing of Bob Lee, beginning April 16. Welcome to the TechMeme right home for Wednesday, July 19th, 2023. I'm Brian McCullough. Today, Microsoft announced aggressive pricing for their AI products and got rewarded with an all-time high stock market valuation. More price aggression in the streaming wars, Meadow releases Lama 2. Google is asking for employees to

Starting point is 00:00:51 voluntarily work on air-gapped machines, and let me introduce you to the concept of synthetic data for LLMs. Here's what you miss today in the world of tech. Let me squeeze this headline in here real quick. and Activision Blizzard have agreed to extend their merger agreement to October 18th, pending the outcome of negotiations with UK regulators. Quoting the Verge, Microsoft Vice Chair and President Brad Smith says the three-month extension is designed, quote, to provide ample time to work through the final regulatory issues, end quote. Both Microsoft and Activision Blizzard have also agreed to a higher termination fee and new commercial arrangements for the transaction. A termination fee payable if Microsoft or Activision walks away from the deal is now set at $3.5 billion

Starting point is 00:01:40 if the deal doesn't close by August 29th, and it jumps to $4.5 billion if September 15th passes without a finalization. Activision has also agreed to potentially, quote, hold separate the company or certain assets of the company or to implement other lawful alternatives to consummate the merger with UK regulators. This is a key part, as it may allow the merger to go ahead with restrictions from the UK's competition and markets authority, end quote. Speaking of Microsoft, Microsoft's stock closed up 4% yesterday at $359.49.49 a share, which represents an all-time, all-time, all-time high. Microsoft's stock is up actually around 50% just since the beginning of the year. But why did it rise so much just yesterday?

Starting point is 00:02:33 Probably because the company announced co-pilot for Microsoft 365 will cost, $30 per user per month for business accounts, and they also debuted BingChat Enterprise, offering improved privacy. So what was it I was saying yesterday about how so far all of the value of the AI moment seems to be accruing to the big incumbents? Basically, Microsoft signaled that it thinks its new AI tools are so useful as add-ons to their productivity suites. They think folks will be willing to pay $360 a year for them. And Wall Street believes folks might, in fact, actually pay that. Quoting and gadget. Revealed in March, Microsoft 365 co-pilot is the company's vision of the future of work. The GPT4 powered suite of tools lets you generate office content using natural language

Starting point is 00:03:21 text prompts. For example, you can ask PowerPoint to create a presentation based on a word document, generate a proposal from spreadsheet data or summarize emails and draft responses in Outlook, all by typing simple commands. By grounding answers and business data like your documents, emails, calendar chats, meetings, and contacts, and combining them with your working context, the meeting you're in now, the emails you've exchanged on a topic, the chats you had last week. Co-Pilot delivers richer, more relevant, and more actionable responses to your questions. Frank X. Shaw, Microsoft's chief communications officer wrote in an announcement today. Microsoft began testing co-pilot with a small group of select enterprise partners earlier this year,

Starting point is 00:04:03 but hasn't yet announced when all business customers will gain access. However, announcing its pricing could mean that that date is fast approaching. The $30 per month pricing will apply to Microsoft 365 E3E5 business standard and business premium customers. The company still hasn't announced co-pilot consumer pricing or availability. Meanwhile, BingChat Enterprise is Microsoft's more security-minded variant of the popular AI chatbot that launched for consumers in February. Since launching the new Bing in February, we've heard from many corporate customers, who are excited to empower their organizations with powerful new AI tools, but are concerned that their company's data will not be protected, Shaw wrote.

Starting point is 00:04:44 That's why today we are announcing BingChat Enterprise, which gives organizations AI-powered chat for work with commercial data protection. What goes in and comes out remains protected, giving commercial customers manage access to better answers, greater efficiency, and new ways to be creative, end quote. And quoting CNBC. Microsoft's co-pilot subscription service adds AI to the company's popular office products such as Word, Excel, and Teams. It will cost an additional $30 per month and could increase

Starting point is 00:05:11 monthly prices for enterprise customers as much as 83%, bringing in additional revenue through recurring subscriptions. The announcement shows how Microsoft is continuing to build on its suite of office software, making it more attractive for businesses that are seeking to add AI into their workflows. Microsoft has been pouring money into generative AI largely through a multi-billion dollar investment in OpenAI, the creator of ChatGBT, end quote. And I want to hit this up again since we've been discussing it this week. Streamers raising prices, because even though this isn't a price raise, it is an aggressive attempt to make more money by another name.

Starting point is 00:05:54 Netflix has removed its $10 basic tier, taking it away. This was the most affordable ad-free tier for U.S. and UK customers. It's going away for new and rejoining members, which signals again that Netflix thinks that cost-conscious customers should just do the ad tier instead. Quoting Cordbusters. With the basic tier off the table for new subscribers, those who wish to watch Netflix without adverts will now have to pay at least 1099 a month in the UK

Starting point is 00:06:24 or 1549 a month in the U.S. Audiences in the UK, US and likely other regions soon are now facing a new viewing norm and during occasional ad breaks during their Netflix marathons if they desire an affordable streaming experience. and quote. So how many times have we said on this show over the years, we're just blowing up the cable bundle only to reconstitute it. Not only are we headed back to a world of paying $130 a month for TV, but also, it looks like it'll increasingly be paying $130 a month for TV with ads again. The whole last 20 years is just going to end up feeling like a hallucination.

Starting point is 00:07:09 Meta yesterday released Lama 2. It's open source large language model with double the context length for free for research and commercial use. It's that commercial use that is key, but also open source that is key. Quoting Ars Technica, on Tuesday, meta announced Lama 2, a new open source family of AI language models, notable for its commercial license, which means the models can be integrated into commercial products, unlike its predecessor. They range in size from 7 to 7 billion parameters, and reportedly,

Starting point is 00:07:42 quote, outperform open source chat models on most benchmarks we tested, according to Mehta. According to Meta, its Lama 2 pre-trained models, the Barebones Models, are trained on 2 trillion tokens and have a context window of 4,096 tokens, fragments of words. The context window determines the length of the content the model can process at once. Meta also says that the Lama 2 fine-tuned models developed for chat applications similar to chat GPT have been trained on over 1 million human annotations. While it can't match OpenAI's GBT4 in performance, Lama 2 apparently fares well for an open source model. According to Jim Fan, a senior AI scientist at NVIDIA, quote, $7 billion is close to GBT 3.5 on reasoning tasks, but there is a significant gap on

Starting point is 00:08:30 coding benchmarks. It's on par or better than Palm 540B on most benchmarks, but still far behind GPT4 and Palm 2L. More detail. on Lama 2's performance benchmarks and construction can be found in a research paper released by Meta on Tuesday. Although Lama 2 is open source, meta did not disclose the source of the training data used in creating the LAMA2 models, which Mozilla, Senior Fellow of Trustworthy AI, Abeba Burhane pointed out on Twitter. Lack of training data transparency is still a sticking point for some LLM critics because the training data that teaches these LLMs what they know often comes from an unauthorized scrape of the Internet with little regard for privacy.

Starting point is 00:09:09 or commercial impact. Meta says it, quote, made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals in the Lama 2 research paper, but it did not list what those sites are. Currently, anyone can request access to download Lama 2 by filling out a form on META's website. Ars Technica submitted a request for the download and received a download link about an hour later, suggesting that the list may be manually screened, end quote. This is wild, but also it makes sense. sources are telling CNBC that Google is internally piloting an opt-in program where some Google employees will be restricted to internet-free PCs to reduce the risk of cyber attacks.

Starting point is 00:09:59 Quote, the company originally selected more than 2,500 employees to participate, but after receiving feedback, the company revised the pilot to allow employees to opt out, as well as opening it up to volunteers. The company will disable internet access on the select desktops, with the exception of internal web-based tools and Google-owned websites like Google Drive and Gmail. Some workers who need the internet to do their job will get exceptions, the company stated in materials. In addition, some employees will have no route access, meaning they won't be able to run administrative commands or do things like install software. Google is running the program to reduce the risk of cyber attacks, according to internal materials. Googlers are frequent targets of

Starting point is 00:10:37 attacks, one internal description viewed by CNBC stated. If a Google employee's device is compromised, the attackers may have access to user data and infrastructure code, which could result in a major incident and undermine user trust, the description added. Turning off most internet access ensures attackers cannot easily run arbitrary code remotely or grab data, the description explained. The program comes as companies face increasingly sophisticated cyber attacks. Last week, Microsoft said Chinese intelligence hacked into company email accounts belonging to two dozen government agencies, including the State Department in the U.S. and Western Europe in a, quote, significant breach. Google has been pursuing U.S. government contracts since launching in

Starting point is 00:11:16 public sector division last year, end quote. Finally today, back to the AI and back to a sign of how quickly this stuff is moving. It's moving so fast that the internet now is no longer good enough source material for the AI. Microsoft, OpenAI, Cohere, and others are apparently testing the use of synthetic data as they are increasingly finding generic data from the way. is no longer good enough for training LLMs. Although, I don't know, was it ever? Or was this just the easiest stuff to get at the beginning? Quoting the Financial Times.

Starting point is 00:11:58 Currently, LLMs that power chatbot such as OpenAI's chat GPT and Google's Bard are trained primarily by scraping the Internet. Data used to train these systems includes digitized books, news articles, blogs, search queries, Twitter, and Reddit posts, YouTube videos, and flicker images, among other content. Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback, R-L-H-F. But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on.

Starting point is 00:12:32 Meanwhile, they are under fire from regulators, artists, and media organizations around the world over the volume and provenance of personal data consumed by the technology. At an event in London in May, OpenAI's chief executive Sam Altman was asked whether he was worried about regulatory probes into chat GPT's potential privacy violations. Altman brushed it off, saying he was, quote, pretty confident that soon all data will be synthetic data, end quote. Generic data from the web is no longer good enough to push the performance of AI models, according to developers. If you could get all the data that you needed off the web, that would be fantastic, said Aidan Gomez, chief executive of $2 billion LLM startup coher. In reality, the web is so noisy and messy that it's not really representative of the data

Starting point is 00:13:13 that you want. The web just doesn't do everything we need, end quote. To dramatically improve their performance and be able to address challenges in science, medicine, or business, AI models will require unique and sophisticated data sets. These will either have to be created by world experts, such as scientists, doctors, authors, actors, or engineers, or acquired as proprietary data from large corporations such as pharmaceuticals, banks, and retailers. However, human-created data is extremely expensive, Gomez said. The new trend of using synthetic data sidesteps this costly requirement. Instead, companies can use AI models to produce text code or more complex information related to health care or financial fraud. This synthetic data is then used to train advanced

Starting point is 00:13:59 LLMs to become ever more capable. According to Gomez, cohere, as well as several of its competitors, already used synthetic data, which is then fine-tuned and tweaked by humans. Synthetic data is already huge, even if it's not broadcast widely, he said. For example, to train a model on advanced mathematics, cohere might use two AI models talking to each other, where one acts as the math tutor and the other is the student. They're having a conversation about trigonometry, and it's all synthetic, Gomez said. It's all just imagined by the model. And then the human looks at this conversation and goes in and corrects it if the model said something wrong. That's the status quo today, end quote. Two recent studies for Microsoft research showed that synthetic data could be used to train models that were smaller and simpler than state-of-the-art software, such as OpenAI's GPT4 or Google's Palm 2.

Starting point is 00:14:48 One paper described a synthetic data set of short stories generated by GPT4, which only contained words that a typical four-year-old might understand. This data set known as tiny stories was then used to train a simple LLM that was able to produce fluent and grammatically correct stories. The other paper showed that AI could be trained on synthetic Python code in the form of textbooks and exercises, which they found performed relatively well on coding tasks. Startups such as Scale AI and Gretel AI have sprung up to provide synthetic data as a service. Gretel, set up by former U.S. intelligence analysts from the National Security Agency and the CIA, works with companies including Google, HSBC, Riot Games, and Illumina to augment their existing data with synthetic versions that can help train better AI models. The key component of synthetic data, according to Gretel chief executive Ali Gullschen, is that it preserves the privacy of all individuals in a data set while still maintaining its statistical integrity. Well-crafted synthetic data can also remove biases and imbalances in existing data, he added.

Starting point is 00:15:49 Hedge funds can look at black swan events and, say, create 100 variations to see if our models crack, Gulschen said. For banks where fraud typically constitutes less than 100th of a percent of total data, Gretel's software can generate, quote, thousands of edge case scenarios on fraud and train AI models with it, end quote. Critics point out that not all synthetic data will be carefully curated or reflect or improve on real-world data. As AI generated text and images start to fill the Internet, it is likely that AI companies crawling the web for training data will inevitably end up using raw data produced by primitive versions of their own models, a phenomenon known as dogfooding. Research from universities including Oxford and Cambridge recently warned that training AI models on their own raw outputs, which may contain falsehoods or fabrications, could corrupt and degrade the technology over time, causing irreversible defects, end quote.

Starting point is 00:16:50 So usually when I ask you the hive mind for help on something, I get about a dozen responses, maybe 30, which is always great. The whole reason I found the developer who is working with me on this AI side project is because when I, asked for recommendations, you all delivered in a big way. So when I asked yesterday for some alpha test volunteers, I expected I'd get, you know, the usual 30 or so responses. Instead, I got over 200, which is great. Much appreciated. But there's no way I can ask all of you to do this testing. There's no way I need that many testers. Really, only about a dozen should do it. Tonight is the crunch meeting with the developer to see if we're ready to deploy. So a dozen or so of you might get an email from me either tomorrow or by Saturday. If you don't get an email about

Starting point is 00:17:40 helping me test my AI experiment, don't take it personally. I just don't need quite that much help from that many people. But I still want you to know that even if you don't hear from me, I am deeply, deeply appreciative that so many of you were willing to help. Talk to you tomorrow.

Tech Brew Ride Home - Wed. 07/19 – More Proof The Big Folks Are Winning AI (For Now)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.