Tech Brew Ride Home - Wed. 07/19 – More Proof The Big Folks Are Winning AI (For Now)
Episode Date: July 19, 2023Microsoft announced aggressive pricing for their AI products and got rewarded with an all-time high stock market valuation. More price aggression in the streaming wars. Meta released Llama 2. Google i...s asking for employees to voluntarily work on air gapped machines. And let me introduce you to the concept of “synthetic data” for LLMs. Links: Microsoft and Activision Blizzard extend merger agreement to October (The Verge) Microsoft will charge businesses $30 per user for its 365 AI Copilot (Engadget) Microsoft closes at record after revealing pricing for new A.I. subscription (CNBC) Netflix Shakes Up Pricing: ‘Basic’ Tier Axed in UK, US (Cord Busters) Meta launches Llama 2, an open source AI model that allows commercial applications (Ars Technica) Google restricting internet access to some employees to reduce cyberattack risk (CNBC) Why computer-made data is being used to train AI models (FT) Learn more about your ad choices. Visit megaphone.fm/adchoices
Transcript
Discussion (0)
On April 4th, 2023, around 2 in the morning, a man was found stabbed multiple times on a sidewalk in downtown San Francisco.
Hey, who did this to you?
What happened next turned the story into a political firestorm.
Reports have identified the victim as Bob Lee, the founder of Cash App.
From Bloomberg Podcasts, this is Foundering, the Killing of Bob Lee, beginning April 16.
Welcome to the TechMeme right home for Wednesday, July 19th, 2023. I'm Brian McCullough. Today, Microsoft announced
aggressive pricing for their AI products and got rewarded with an all-time high stock market valuation.
More price aggression in the streaming wars, Meadow releases Lama 2. Google is asking for employees to
voluntarily work on air-gapped machines, and let me introduce you to the concept of synthetic data for
LLMs. Here's what you miss today in the world of tech. Let me squeeze this headline in here real quick.
and Activision Blizzard have agreed to extend their merger agreement to October 18th, pending
the outcome of negotiations with UK regulators. Quoting the Verge, Microsoft Vice Chair and President
Brad Smith says the three-month extension is designed, quote, to provide ample time to work
through the final regulatory issues, end quote. Both Microsoft and Activision Blizzard have also agreed
to a higher termination fee and new commercial arrangements for the transaction. A termination
fee payable if Microsoft or Activision walks away from the deal is now set at $3.5 billion
if the deal doesn't close by August 29th, and it jumps to $4.5 billion if September 15th passes
without a finalization. Activision has also agreed to potentially, quote, hold separate the company
or certain assets of the company or to implement other lawful alternatives to consummate the merger
with UK regulators. This is a key part, as it may allow the merger to go ahead with restrictions from
the UK's competition and markets authority, end quote.
Speaking of Microsoft, Microsoft's stock closed up 4% yesterday at $359.49.49 a share,
which represents an all-time, all-time, all-time high. Microsoft's stock is up actually around
50% just since the beginning of the year. But why did it rise so much just yesterday?
Probably because the company announced co-pilot for Microsoft 365 will cost,
$30 per user per month for business accounts, and they also debuted BingChat Enterprise, offering improved
privacy. So what was it I was saying yesterday about how so far all of the value of the AI
moment seems to be accruing to the big incumbents? Basically, Microsoft signaled that it thinks
its new AI tools are so useful as add-ons to their productivity suites. They think folks will
be willing to pay $360 a year for them. And Wall Street believes folks might, in fact,
actually pay that. Quoting and gadget. Revealed in March, Microsoft 365 co-pilot is the company's vision of
the future of work. The GPT4 powered suite of tools lets you generate office content using natural language
text prompts. For example, you can ask PowerPoint to create a presentation based on a word document,
generate a proposal from spreadsheet data or summarize emails and draft responses in Outlook,
all by typing simple commands. By grounding answers and business data like your documents, emails,
calendar chats, meetings, and contacts, and combining them with your working context, the meeting you're
in now, the emails you've exchanged on a topic, the chats you had last week.
Co-Pilot delivers richer, more relevant, and more actionable responses to your questions.
Frank X. Shaw, Microsoft's chief communications officer wrote in an announcement today.
Microsoft began testing co-pilot with a small group of select enterprise partners earlier this year,
but hasn't yet announced when all business customers will gain access.
However, announcing its pricing could mean that that date is fast approaching. The $30 per month pricing
will apply to Microsoft 365 E3E5 business standard and business premium customers. The company still
hasn't announced co-pilot consumer pricing or availability. Meanwhile, BingChat Enterprise
is Microsoft's more security-minded variant of the popular AI chatbot that launched for consumers
in February. Since launching the new Bing in February, we've heard from many corporate customers,
who are excited to empower their organizations with powerful new AI tools,
but are concerned that their company's data will not be protected, Shaw wrote.
That's why today we are announcing BingChat Enterprise,
which gives organizations AI-powered chat for work with commercial data protection.
What goes in and comes out remains protected,
giving commercial customers manage access to better answers,
greater efficiency, and new ways to be creative, end quote.
And quoting CNBC.
Microsoft's co-pilot subscription service adds AI to the company's popular office
products such as Word, Excel, and Teams. It will cost an additional $30 per month and could increase
monthly prices for enterprise customers as much as 83%, bringing in additional revenue through
recurring subscriptions. The announcement shows how Microsoft is continuing to build on its suite of office
software, making it more attractive for businesses that are seeking to add AI into their
workflows. Microsoft has been pouring money into generative AI largely through a multi-billion
dollar investment in OpenAI, the creator of ChatGBT, end quote.
And I want to hit this up again since we've been discussing it this week.
Streamers raising prices, because even though this isn't a price raise,
it is an aggressive attempt to make more money by another name.
Netflix has removed its $10 basic tier, taking it away.
This was the most affordable ad-free tier for U.S. and UK customers.
It's going away for new and rejoining members,
which signals again that Netflix thinks that
cost-conscious customers should just do the ad tier instead.
Quoting Cordbusters.
With the basic tier off the table for new subscribers,
those who wish to watch Netflix without adverts will now have to pay at least 1099 a month in the UK
or 1549 a month in the U.S.
Audiences in the UK, US and likely other regions soon are now facing a new viewing norm
and during occasional ad breaks during their Netflix marathons
if they desire an affordable streaming experience.
and quote. So how many times have we said on this show over the years, we're just blowing up the
cable bundle only to reconstitute it. Not only are we headed back to a world of paying $130 a month
for TV, but also, it looks like it'll increasingly be paying $130 a month for TV with ads again.
The whole last 20 years is just going to end up feeling like a hallucination.
Meta yesterday released Lama 2.
It's open source large language model with double the context length for free for research and
commercial use.
It's that commercial use that is key, but also open source that is key.
Quoting Ars Technica, on Tuesday, meta announced Lama 2, a new open source family of AI language
models, notable for its commercial license, which means the models can be integrated into
commercial products, unlike its predecessor.
They range in size from 7 to 7 billion parameters, and reportedly,
quote, outperform open source chat models on most benchmarks we tested, according to Mehta.
According to Meta, its Lama 2 pre-trained models, the Barebones Models, are trained on 2 trillion
tokens and have a context window of 4,096 tokens, fragments of words.
The context window determines the length of the content the model can process at once.
Meta also says that the Lama 2 fine-tuned models developed for chat applications similar to chat GPT
have been trained on over 1 million human annotations. While it can't match OpenAI's GBT4 in performance,
Lama 2 apparently fares well for an open source model. According to Jim Fan, a senior AI scientist at
NVIDIA, quote, $7 billion is close to GBT 3.5 on reasoning tasks, but there is a significant gap on
coding benchmarks. It's on par or better than Palm 540B on most benchmarks, but still far behind
GPT4 and Palm 2L. More detail.
on Lama 2's performance benchmarks and construction can be found in a research paper released by
Meta on Tuesday. Although Lama 2 is open source, meta did not disclose the source of the training
data used in creating the LAMA2 models, which Mozilla, Senior Fellow of Trustworthy AI,
Abeba Burhane pointed out on Twitter. Lack of training data transparency is still a sticking point
for some LLM critics because the training data that teaches these LLMs what they know often
comes from an unauthorized scrape of the Internet with little regard for privacy.
or commercial impact. Meta says it, quote, made an effort to remove data from certain sites known to
contain a high volume of personal information about private individuals in the Lama 2 research paper,
but it did not list what those sites are. Currently, anyone can request access to download
Lama 2 by filling out a form on META's website. Ars Technica submitted a request for the download
and received a download link about an hour later, suggesting that the list may be manually screened,
end quote. This is wild, but also it makes sense.
sources are telling CNBC that Google is internally piloting an opt-in program where some Google
employees will be restricted to internet-free PCs to reduce the risk of cyber attacks.
Quote, the company originally selected more than 2,500 employees to participate, but after
receiving feedback, the company revised the pilot to allow employees to opt out, as well as
opening it up to volunteers. The company will disable internet access on the select
desktops, with the exception of internal web-based tools and Google-owned websites like Google Drive and
Gmail. Some workers who need the internet to do their job will get exceptions, the company stated
in materials. In addition, some employees will have no route access, meaning they won't be able
to run administrative commands or do things like install software. Google is running the program
to reduce the risk of cyber attacks, according to internal materials. Googlers are frequent targets of
attacks, one internal description viewed by CNBC stated. If a Google employee's device is compromised,
the attackers may have access to user data and infrastructure code, which could result in a major
incident and undermine user trust, the description added. Turning off most internet access ensures
attackers cannot easily run arbitrary code remotely or grab data, the description explained.
The program comes as companies face increasingly sophisticated cyber attacks.
Last week, Microsoft said Chinese intelligence hacked into company email accounts belonging to two dozen
government agencies, including the State Department in the U.S. and Western Europe in a, quote,
significant breach. Google has been pursuing U.S. government contracts since launching in
public sector division last year, end quote. Finally today, back to the AI and back to a sign of how
quickly this stuff is moving. It's moving so fast that the internet now is no longer good enough
source material for the AI. Microsoft, OpenAI, Cohere, and others are apparently testing the
use of synthetic data as they are increasingly finding generic data from the way.
is no longer good enough for training LLMs.
Although, I don't know, was it ever?
Or was this just the easiest stuff to get at the beginning?
Quoting the Financial Times.
Currently, LLMs that power chatbot such as OpenAI's chat GPT and Google's Bard
are trained primarily by scraping the Internet.
Data used to train these systems includes digitized books, news articles, blogs,
search queries, Twitter, and Reddit posts,
YouTube videos, and flicker images, among other content.
Humans are then used to provide feedback and fill gaps in the information in a process known as
reinforcement learning by human feedback, R-L-H-F.
But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on.
Meanwhile, they are under fire from regulators, artists, and media organizations around the world over the volume and provenance of personal data consumed by the technology.
At an event in London in May, OpenAI's chief executive Sam Altman was asked whether he was
worried about regulatory probes into chat GPT's potential privacy violations. Altman brushed it off,
saying he was, quote, pretty confident that soon all data will be synthetic data, end quote.
Generic data from the web is no longer good enough to push the performance of AI models,
according to developers. If you could get all the data that you needed off the web,
that would be fantastic, said Aidan Gomez, chief executive of $2 billion LLM startup coher.
In reality, the web is so noisy and messy that it's not really representative of the data
that you want. The web just doesn't do everything we need, end quote. To dramatically improve their
performance and be able to address challenges in science, medicine, or business, AI models will require
unique and sophisticated data sets. These will either have to be created by world experts,
such as scientists, doctors, authors, actors, or engineers, or acquired as proprietary data from
large corporations such as pharmaceuticals, banks, and retailers. However, human-created data is
extremely expensive, Gomez said. The new trend of using synthetic data sidesteps this costly
requirement. Instead, companies can use AI models to produce text code or more complex information
related to health care or financial fraud. This synthetic data is then used to train advanced
LLMs to become ever more capable. According to Gomez, cohere, as well as several of its competitors,
already used synthetic data, which is then fine-tuned and tweaked by humans. Synthetic data is already
huge, even if it's not broadcast widely, he said. For example, to train a model on advanced mathematics,
cohere might use two AI models talking to each other, where one acts as the math tutor and the other
is the student. They're having a conversation about trigonometry, and it's all synthetic, Gomez said.
It's all just imagined by the model. And then the human looks at this conversation and goes in and
corrects it if the model said something wrong. That's the status quo today, end quote.
Two recent studies for Microsoft research showed that synthetic data could be used to train models that were smaller and simpler than state-of-the-art software, such as OpenAI's GPT4 or Google's Palm 2.
One paper described a synthetic data set of short stories generated by GPT4, which only contained words that a typical four-year-old might understand.
This data set known as tiny stories was then used to train a simple LLM that was able to produce fluent and grammatically correct stories.
The other paper showed that AI could be trained on synthetic Python code in the form of textbooks and exercises, which they found performed relatively well on coding tasks.
Startups such as Scale AI and Gretel AI have sprung up to provide synthetic data as a service.
Gretel, set up by former U.S. intelligence analysts from the National Security Agency and the CIA, works with companies including Google, HSBC, Riot Games, and Illumina to augment their existing data with synthetic versions that can help train better AI models.
The key component of synthetic data, according to Gretel chief executive Ali Gullschen,
is that it preserves the privacy of all individuals in a data set while still maintaining its statistical integrity.
Well-crafted synthetic data can also remove biases and imbalances in existing data, he added.
Hedge funds can look at black swan events and, say, create 100 variations to see if our models crack,
Gulschen said.
For banks where fraud typically constitutes less than 100th of a percent of total data,
Gretel's software can generate, quote, thousands of edge case scenarios on fraud and train AI models with it, end quote.
Critics point out that not all synthetic data will be carefully curated or reflect or improve on real-world data.
As AI generated text and images start to fill the Internet, it is likely that AI companies crawling the web for training data will inevitably end up using raw data produced by primitive versions of their own models, a phenomenon known as dogfooding.
Research from universities including Oxford and Cambridge recently warned that training AI models on their own raw outputs,
which may contain falsehoods or fabrications, could corrupt and degrade the technology over time, causing irreversible defects, end quote.
So usually when I ask you the hive mind for help on something, I get about a dozen responses, maybe 30, which is always great.
The whole reason I found the developer who is working with me on this AI side project is because when I,
asked for recommendations, you all delivered in a big way. So when I asked yesterday for some
alpha test volunteers, I expected I'd get, you know, the usual 30 or so responses. Instead,
I got over 200, which is great. Much appreciated. But there's no way I can ask all of you to do
this testing. There's no way I need that many testers. Really, only about a dozen should do it.
Tonight is the crunch meeting with the developer to see if we're ready to deploy. So a dozen or so
of you might get an email from me either tomorrow or by Saturday. If you don't get an email about
helping me test my AI experiment, don't take it personally. I just don't need quite that much help
from that many people. But I still want you to know that even if you don't hear from me,
I am deeply, deeply appreciative that so many of you were willing to help. Talk to you tomorrow.
