The AI Daily Brief: Artificial Intelligence News and Analysis - The Significance of OpenAI's New Data Partnerships Program

Episode Date: November 10, 2023

OpenAI closed the week with yet another announcement -- this time about a new data partnerships program through which they hope to expand the data that AI models have access to. NLW explores why it ma...tters. ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI breakdown, we're looking at the new OpenAI Data Partnerships program as well as questions around their relationship with Microsoft. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.netnetwork for more information about our YouTube channel, our Discord, and our newsletter. Welcome back to the AI Breakdown Brief, all the AI headline news you need in around five minutes. Well, not content to just launch a set of revolutionary new products in the form of custom GPs and the assistant. Appetance API and the new GBT4 Turbo, OpenAI has concluded the week with an announcement around its new approach to data partnerships. The company writes, we are introducing OpenAI data partnerships where we'll work together with organizations to produce public and private
Starting point is 00:00:51 datasets for training AI models. To ultimately make AGI that is safe and beneficial to all of humanity, we'd like AI models to deeply understand all subject matters, industries, cultures, and languages, which requires as broad a training dataset as possible. Introducing your content can AI models more helpful to you by increasing their understanding of your domain. We're already working with many partners who are eager to represent data from their country or industry. For example, we recently partnered with the Icelandic government to improve GPT4's ability to speak Icelandic by integrating their current data sets. We've also partnered with nonprofit organization Free Law Project, which aims to democratize
Starting point is 00:01:25 access to legal understanding by including their large collection of legal documents in AI training. So I actually think that while this is sort of a small announcement relative to all the features that we've got this week, there is a lot embedded here, and probably more than it seems like, on the surface alone. So why might OpenAI be interested in this? Reason one. Just getting access to more data is a good thing. Now, we've been in a paradigm currently up till now, more or less, where the datasets that large language models are trained on are just data sets that are already out there natively. The companies training these LLMs have more or less just had to go out and get what is publicly available to them, or in some cases, rare or frankly, licensed different
Starting point is 00:02:01 datasets that are available. And so the knowledge of the world of these LLMs, which are becoming increasingly interwoven with so many different functions across business and society, is dictated by just what happened to be available before, or more specifically, what the makers of LLMs, like OpenAI, could find and get access to. I think in many ways these open AI data partnerships represent an inflection point moment, where we're moving away from a world where LLMs just use whatever they can get their hands on, to a world in which LLMs are the companies that train them more specifically, are very intentional about pulling a specific breadth of different data sets into their models. This means actively going out and recruiting and finding those data sets where they are,
Starting point is 00:02:38 which in many cases going to be in private stores that aren't easily available unless they are made available. So one part of this is the never-ending quest for more data and the related shift from a world where LLM makers just grab whatever they can to a world in which they're specifically going out and trying to get valuable datasets to be a part of what they're building. Now, the second thing that's interesting about this is also a paradigm shift. We have firmly moved into the ask forgiveness portion of the ask forgiveness not permission when it comes to acquiring data for AI training. Obviously, if you've been listening to this show and you've heard about any one of the numerous lawsuits around copyright and AI model training practices, you know that there are a lot of unsolved questions
Starting point is 00:03:15 when it comes to what AI companies should and shouldn't have access to or how they should and shouldn't be able to access certain types of data. I think then viewed in this light, this represents a shift to these companies, again, trying to be above board and get access to data with the permission of the data holders. A third part of this, which I think is also a little bit related to policy, is an attempt to deal with some of the issues that many people in Washington and Brussels and other areas of political power have been concerned about, which is reifying the bias that exists in existing data sets in this next generation of technology. Basically, if the publicly available data that most of these models are going to be trained on contains
Starting point is 00:03:51 within it implicit bias, or even just big gaps in its understanding that amounts to an effective bias because entire group's histories or identities are just not part of the training set, then this is a way to proactively go and try to fill those gaps. Now, a fourth piece of this is what happens when we run out of data to train these models on. This is a lot less ludicrous than it sounds, given how big these models are getting. This is kind of an overturning every stone approach, I think. And while it doesn't mean that open AI doesn't see value in synthetic data, for example, it certainly means they're trying to get as much of the real-world data as they can get their hands on as a part of their development of frontier models. Now, a fifth interesting piece of this
Starting point is 00:04:27 has to do with what type of data they're looking, and specifically the fact that they don't just want text, but they also are looking for images, audio, or video. They write, we're interested in large-scale data sets that reflect human society and that are not easily accessible online to the public today. We're particularly looking for data that expresses human intention, e.g. long-form writing or conversations rather than disconnected snippets across any language, topic, or format. Lastly, OpenAI says that they can work with people on an open basis, where data becomes widely available, or they can work to keep certain data private. Overall, they say, we are seeking partners who want to help us teach AI to understand our world
Starting point is 00:05:00 in order to be maximally helpful to everyone. Together, we can move towards AGI that benefits all of humanity. Now, speaking of all of humanity, over the last couple days, it seems like all of humanity has been trying to use OpenAI's new custom GBT's. For the first time, basically since chat GPT launched, there were big periods of time over the last 48 hours or so where the service was just simply unavailable. The problem, according to an OpenAI outage page, was, quote, we're dealing with periodic outages due to an abnormal traffic pattern reflective of a DDoS attack. We are continuing work to mitigate this. Now, on Thursday, hackers from the group Anonymous Sudan claimed responsibility for the DDoS attack on Telegram. This group has also
Starting point is 00:05:38 orchestrated DDoS attacks against Microsoft, as well as organizations in Sweden with an attempt to screw up their NATO application. Cybersecurity company TruSec has previously labeled Anonymous Sudan as a Russian-backed hacker group. Now, another story that had to do with ChatGPT came from Snap's AR event held on Thursday. The Verge writes ChatGPT is powering a new kind of Snapchat lens. Now, of course, Snap is no stranger to artificial intelligence features. For some time now, they've had something called MyAI, which is a chatbot that users can talk to just like they're messaging a friend.
Starting point is 00:06:08 Now, in a demo of the new updated lens studio for developers, Snapchat showed off the ability for those developers to create filters that incorporate ChatGBTGPT. The example they gave was a solar system-themed filter where the user asked how far away is Neptune and where the lens, again powered by ChatGPT, could actually answer that question. Still, the biggest news about ChatGBTGT and other big tech companies, was that due to security concerns, Microsoft, which of course owns almost half of OpenAI, had for a time restricted employee access to ChatGBT. In an update on an internal website, Microsoft said, due to security and data concerns and
Starting point is 00:06:42 number of AI tools are no longer available for employees to use. In a comment, they said to reporters, while it is true that Microsoft has invested in OpenAI and that ChatGPT has built-in safeguards to prevent improper use, the website is nevertheless a third-party external service. That means you must exercise caution using it due to risks of privacy and security. This goes for any other external AI service such as Mid Journey or Replica as well. Now, adding some amount of intrigue, it wasn't very long before Microsoft reinstated access to ChatchipT. And in a statement to CNN, they said that the temporary blockage was actually just a mistake. A spokesperson said, we were testing endpoint control systems for LLMs and inadvertently turn them on for all employees. We restored service shortly after we
Starting point is 00:07:20 identified our error. As we've said previously, we encourage employees and customers to use services like BingChat Enterprise and ChatGPT Enterprise that come with greater levels of privacy and security provisions. Now, a lot of the chatter that I saw on Twitter slash X about this tried to make it seem like it was exemplary of a growing wedge or divide between Microsoft and OpenAI. And frankly, I'm just not sure. I think it's certainly possible, and we've seen some moves that would suggest, or at least reinforce the fact that these still do remain independent companies. For example, Microsoft's partnership with Databricks, as well as just the weird frenemy status, where OpenAI has a much stronger incentive to get users for its chat GPT enterprise tools directly rather than
Starting point is 00:07:58 having to go through Microsoft where they get a much lower portion of revenue. What I mean when I say I'm not sure, is that I'm not sure if this represents something new or just the inherent weirdness of this very new type of relationship between these companies, that is somewhere perhaps uncomfortably between an investment and an acquisition. OpenAI's Sam Altman did try to make it clear that any scuttlebutt around retaliation, such as the rumors that they were blocking Microsoft 365 in response to this, were simply not true. And speaking of tension in the world of big tech partnerships with emergent AI labs, if anyone has a better sense of how the love triangle between Anthropic Google and Amazon is going to work, please let me know in the comments or in our discord.
Starting point is 00:08:36 Remember, a couple of months ago, Anthropic announced this big investment in partnership with Amazon, where they were talking about how they would be using Amazon's chips. But then a few weeks later, they announced another big investment this time from Google. And now they've said that they're going to use Google's chips. Bloomberg writes AI startup Anthropic to use Google chips in expanded partnership. Basically, Anthropic is going to be a guinea pig for Google Cloud's new TPUV-5E chips. And apparently they've agreed to spend more than $3 billion on Google's cloud computing services over the next four years. Now, of course, this is an extension of a partnership, given that they've used Google Cloud service since 2021. However, when Anthropic announced their deal with
Starting point is 00:09:09 Amazon, Amazon had said that Anthropic plans to run, quote, the majority of its workloads on AWS. Anthropic then said to Bloomberg that the company was taking a, quote, multi-cloud approach, meaning that it won't be exclusive with any one provider. When Bloomberg asked Google Cloud CEO Thomas Curian about this, he said that he wasn't bothered by Anthropics work with AWS. Quote, large companies always want to choose multiple clouds. It helps them use the best of each. We're used to competing and also collaborating with other cloud providers. So, I don't know, man. But right now, Anthropics threading the needle, so more power to them. Lastly, today, just to be clear, in case you haven't seen it yet, if you are a chat GPT plus subscriber, you now have access to the
Starting point is 00:09:47 custom GPT builder. I would highly recommend you get in there and start trying it out. I'm fairly certain that within about six months, the vast majority of users' interactions with chat GPT are going to be mediated by one of these custom application type things. However, for now, that is going to do it for the brief. If you want to hear about the biggest announcement of the day, which is, of course, Humane's AIPIN, you'll have to come back. for the main episode. Hello, friends. Well, as you heard from that section, I had intended to do the main part of the episode about the Humane Pin, which just formally launched yesterday. However, just after I recorded the brief, I had a few family emergency type things, nothing overly serious,
Starting point is 00:10:25 but things distracting enough that I am going to have to move that Humane episode to one of the weekend shows. So for today, we will just be putting out this brief. Apologies for the shorter content, but I will catch you back here tomorrow. Thanks for listening as always. Peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.