The AI Daily Brief: Artificial Intelligence News and Analysis - ChatGPT Vision: 8 Ways People Are Using It Already

Episode Date: September 30, 2023

ChatGPT Vision hasn't even been broadly rolled out yet and already people who do have access are showing off some amazing use cases. Before that on the Brief: have the AI phone wars begun? TAKE OUR SU...RVEY ON EDUCATIONAL AND LEARNING RESOURCE CONTENT: https://bit.ly/aibreakdownsurvey ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI breakdown, we're looking at some of the most impressive use cases of GPT4 vision before it's even fully come out. Before that on the brief, have the AI phone wars begun? The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our Discord, our YouTube channel, and our newsletter. Welcome back to the AI breakdown brief, all the AI headline news you need in around five minutes. We pick up a story that we started talking about earlier this week, the information, reported that Johnny Ive, the designer behind Apple's most iconic products, the iPod, the IMac, the iPhone, has been having conversations with Sam Altman about an AI device, some AI hardware
Starting point is 00:00:46 company. Now, obviously, this got a lot of people excited. There is always this allure of new hardware, even more than software sometimes. And the news got a little more meat on the bone when the Financial Times reported yesterday that not only are these guys talking, and that not only is Masayoshi's son from SoftBank involved, but that a fundraise worth around a billion dollars from SoftBank has been in the conversation. Now, the explicit language around this is the iPhone of artificial intelligence. Of course, what that means is not clear. But according to FT sources, it sort of seems like Sam Altman had this idea and then went out and recruited Ive and that they've been holding brainstorming sessions in San Francisco. The best information
Starting point is 00:01:26 we can get about what they're thinking is the Financial Times rights. They hope to create a more natural and intuitive user experience for interacting with AI, in the way that the iPhone's innovations in touchscreen computing unleashed the mass market potential of the mobile internet. Now, what that turns into, whether it's a particular design or even a particular type of device, has apparently many different ideas on the table. In terms of how likely to move forward this is, the sources characterize the discussions as serious, but there isn't any deal that has been agreed, and it still could be months before anything is formally announced. Now, the way that the information summed this news up was calling it the beginning of the AI phone wars.
Starting point is 00:02:04 Kate Clark writes, For years, Silicon Valley has been trying to figure out if hardware companies can generate venture-scale returns given the example of high-profile venture-backed disappointments, like essential products and magic leap. They go on, phones have proved particularly tricky. The dominance of Apple and Google with mobile app developers has made it very hard for competing startups and even huge companies like Microsoft and Amazon to break into that market. The question is whether the advancements open AI has,
Starting point is 00:02:28 has made an AI can upset the balance and change the equation for Altman and Ives' potential new business. Now, as you might expect, then, there is a fair bit of skepticism around this. Entrepreneur, writer, and investor, Sam Lesson writes, these aspiring tech platform companies always with the we need to build a phone. Geez. Lesson also wrote, mark my words, from repeated personal experience, when Masa buys, you do whatever you can to sell. Everything. On the flip side, others who purport to have more information are pretty excited. Brian Romley writes, As my clients have known for over three years, OpenAI is building devices.
Starting point is 00:03:02 Now, former Apple designer Johnny Ive is on board. Wait till you see what this is. World-changing. Alas, for us with that intriguing comment, we have to move on to our next topic. Google has made it easier for publishers to opt out of their content becoming fodder for AI training. And this is really interesting, particularly in the context of Google. The reason that publishers let Google crawl their site is because there is value in them being indexed for search. Publishers want to be found. They want people to come to their site. There's an entire industry
Starting point is 00:03:32 SEO that's entirely designed around that fact. However, Google is now separating out publishers' ability to still opt in to being indexed for search while opting out of their content becoming training data. Now, for Google, this represents a couple things. One, them trying to be good stewards and give publishers more control, but two, it's probably also trying to get out ahead of regulatory queries and legal battles by showing that they are in good faith giving people more tools to opt out should they not want to have their content be a big part of AI data training sets. OpenAI, as you guys know, have done something similar. When they announced their own web crawler, they also announced the way that publishers
Starting point is 00:04:07 could block that crawler, and many publishers, including the New York Times, CNN, Reuters, and Medium have chosen to do so. Now, interestingly, it appears that for some of these publishers, those tools for simply blocking web crawlers may not go far enough. Medium CEO, Tony Stubblebine said, I'm not a hater, but I also want to be plain spoken that the current state of generative AI is not a net benefit to the internet. They're making money on your writing without asking for your consent, nor are they offering you compensation and credit. AI companies have leached value from writers in order to spam internet readers. Now, I am not here
Starting point is 00:04:38 endorsing that view, but fair enough. But then he goes on, Medium is not alone. We are actively recruiting for a coalition of other platforms to help figure out the future of fair use in the age of AI. I've talked to many big companies. These are the big organizations that you could probably guess, but they aren't ready to publicly work together. Now, TechCrunch reports, reported this as a nascent media coalition to block AI crawlers, and that certainly seems like kind of what this is. Ultimately, I think this is going to be a question that gets settled in courts, around to what extent and how fair use applies to AI data training, but certainly coalitions of companies can use soft power to try to influence outcomes as well. Over in the world of Amazon,
Starting point is 00:05:14 Bedrock is now generally available. Bedrock, they write, is a fully managed service that offers a choice of high-performing foundation models from leading AI companies, along with a broad set of capabilities to build generative AI applications, simplifying development while maintaining privacy and security. Basically, Amazon has been focused rather than on offering their own foundation model, instead on giving the enterprises they work with the chance to build new models from scratch or customize existing models to suit their enterprise needs. That model is now apparently available to anyone who wants to use it. Now, here's an interesting twist. When Meta renamed itself to Meta, there was a fair bit of skepticism around Zuckerberg's vision. Metaverse to many,
Starting point is 00:05:54 seemed like a buzzword that was destined to be thrown in the trash along with other big terms from the crypto top as soon as that cycle hemorrhaged. But it appears that the metaverse may be back. Yesterday, Lex Friedman tweeted, here's my conversation with Mark Zuckerberg, his third time on the podcast. But this time we talked in the metaverse as photorealistic avatars. This was one of the most incredible experiences of my life. It really felt like we were talking in person, but we were miles apart. It's hard to put into words how awesome this was for someone like me who values the intimacy of in-person conversation. It gave me a glimpse of an exciting future with many new possibilities and fascinating questions about the nature of reality and human connection. Now, of course,
Starting point is 00:06:29 that was shared with a video that has now been seen around 10 million times. And in that video, you can see that they've moved far away from the weird little digital avatars with no feet to actual photorealistic representations of the person you're speaking with. The reactions have been really positive. Shriram Krishnan writes, this is one of the most mind-blowing things I've seen. It's not even Uncanny Valley anymore, just stunning. Tsar Haribakti writes, Zuck is on his I Told You So Revenge Tour. And Rao Paul writes,
Starting point is 00:06:55 The Exponential Age Accelerates again. Moving on to our penultimate topic, Rob Joyce, the director of cybersecurity at the National Security Agency, has announced that the NSA is creating a new center for AI security. The NSA calls this a crucial mission as AI capabilities are increasingly acquired,
Starting point is 00:07:12 developed, and integrated into U.S. defenses and intelligence systems. Said Army General Paul Nakasone, we maintain an advantage in AI in the United States today, that AI advantage should not be taken for granted. Now, in terms of how they plan to use AI, Nacosone said, AI helps us, but our decisions are made by humans, and that's an important distinction. The end of the day, decisions will be made by humans and humans in the loop. What's interesting is that this follows the announcement just a couple days ago that the CIA is itself working on a version of ChatGPT, but for the 18 intelligence agencies that make up the U.S. intelligence apparatus. Again, as much hemming and hawing and debating as there is on Capitol Hill about
Starting point is 00:07:47 the right policies regarding AI. The military establishment, at least, is moving fully ahead. Lastly, today, a really interesting tweet from Andre Carpathy. Carpathy is, of course, at OpenAI, and he argues that we shouldn't be thinking about LLMs as a chatbot, but instead as, quote, the kernel process of a new operating system. As he puts it today, it orchestrates, input and output across modalities, text audio and vision, code interpreter, the ability to write and run programs, browser and internet access, and embeddings database for files and internal memory storage and retrieval. He ends the thought. TLDR, looking at LLMs as chatbots, is the same as looking at early computers as calculators.
Starting point is 00:08:24 We're seeing an emergence of a whole new computing paradigm, and it is very early. I think anyone who's really spent the time thinking from first principles about what we might be doing with these technologies in the long run can certainly agree that they are even more than they seem today. But for now, that is going to do it for today's AI breakdown brief. Next up, the main AI breakdown. Today we are looking at the fruits of one of the more exciting AI product announcements recently, which is, of course, ChatGBT with Vision.
Starting point is 00:08:51 Today we're going to go through eight different ways that people are already discovering how to use this new tool in hopes that they give you ideas for how you might use it when it becomes widely available. Now, one caveat, unfortunately, I myself have not had a chance to experiment with this yet. I, alas, have not been gifted this incredible tool. So this is all a curation from what other people have done. And because of that, and because access is limited right now, you will notice that some of the folks who are sharing their results are held across a number of these different categories. Let's start first and foremost with just the obvious and most basic use case of visual research.
Starting point is 00:09:25 Roan Chung showed an example of a photo looking out from the mouth of a cave on what looks like a lush tropical environment. He writes, where is this? Chatchapitu responds, the image appears to be taken from inside a cave overlooking a coastline with a distinctly curving road. Based on the scenery and the characteristics of the landscape, it strongly resembles the. a view from Macaroo Point on the island of Oahu in Hawaii. Now, I've seen other people post similar demonstrations where they give, for example, a photo of a landscape or a city and ask where is this. I've seen other people experiment with just visual recognition tasks, asking what type of an animal is in a shot, for example. And so far, at least chat GPT with vision seems to
Starting point is 00:09:59 perform that pretty well. Now, especially with the integration into the mobile app, I actually think this is going to be a use case that people use a lot. It feels to me like a very standard part of travel in the future, could be pointing your chat GPT app at something you're looking at and saying, what is that or tell me about that? But now let's move on to some slightly different use cases. One that's in the column of just creative, quirky, and fun is GBT4 Vision for Interior Design. Pietro Scurano, who you're going to hear about a lot in this video as he has done a ton of experiments, writes, I love how it's incorporating what it knows about me in the suggestion because of custom instructions. So basically, he has posted a picture of a room and says, how could I improve it?
Starting point is 00:10:38 Chachapitee gives a number of suggestions to enhance the room, from color to lighting to plants to art. Now, in terms of custom instructions, that feature is the one in which you can give Chatsypte more information about yourself, so it has that as context when it answers future queries. And one of the places that that comes up here is in the art suggestion. Chachapiti T writes, given your background in classical studies in art, perhaps adding some artwork on the walls could be a great personal touch. They could be prints of classical artworks or something contemporary to create a blend of old and new. Now, Pietro also shows off our third use case that people are experimenting with. And frankly, this is the one that if I had to pick what people are most excited about, it's this.
Starting point is 00:11:16 That is the use case of building websites and coding. Pietro writes, from image to live website using GPT4 Vision and Replit in less than a minute, things are about to get so interesting. So basically, Pietro shares a video of him posting in an example UI in a photo and saying, replicate this exactly, don't skip anything, write the code, from which he's able to export it and get it in an IDE in an incredibly quick amount of time. McKay Rigley did something similar. He writes,
Starting point is 00:11:43 I gave ChatGBTGBT a screenshot of a SaaS dashboard and it wrote the code for it. This is the future. Now, nearly 7 million people have watched this video to see how GPT with Vision moves from just a screenshot to an actual working prototype, but McKay wasn't done there. He also tweeted,
Starting point is 00:11:59 You can give ChatGBTT a picture of your team's whiteboarding session and have it write the code for you. This is absolutely insane. And sure enough, in that video, which has just under 10 million views, McKay shows an image of the whiteboard that's actually in his room, posts an image of it to GPT4, and says,
Starting point is 00:12:15 you're an expert software developer. This was my team's whiteboarding session for our onboarding flow. You need to write the code for this. Take a deep breath and think step by step about how you will do this. Now write the complete code for this working one step out of time. You'll notice that language that we talked about in an earlier episode this week of taking a deep breath and thinking step by step, which apparently increases the success of results dramatically.
Starting point is 00:12:35 But let's take a step back here because all three of these examples are sort of similar and one of the most powerful uses of this new technology. When people talk about why AI, even though it will disrupt the jobs of today, will enable new jobs, I think you start to get a glimpse of that watching these types of demos. The reduction in the barrier between idea and execution is so monumental here that from a silly, hard-to-interpret image on a whiteboard, within minutes there can be working code, is just unlike. anything we've seen. It's hard for me to imagine that that doesn't increase the quantity of what we produce. Now, I could stop here, and I think chat GPT vision would still be exciting, but there are many, many other use cases that people are exploring, so we will move on. Fourth, on our list, reading and explaining diagrams. Now, there are so many examples of this that are posted, but one of my favorite ones comes from John Stokes and Sean Spriggins that is this unbelievably information-dense
Starting point is 00:13:33 slide apparently from the Pentagon titled Integrated Defense Acquisition Technology and Logistics Lifecycle Management System. And for this to really get the full effect, if you're not watching the video, if you're listening to this as the podcast, I suggest you go check it out. There has to be 3,000 words on this page and hundreds of boxes all flowing between each other, and yet Chatshabit is able to make some sense of it. Now, one of the things that's interesting about being able to understand diagrams is that some diagrams are also entirely different types of information. For example, Marco Mascaro posted the electronic schematics of the Arduino design, and Chatschapit with Vision was able to understand that it was an electronic circuit
Starting point is 00:14:08 and explain how the different components interconnected and worked. Now, another example of breaking down a diagram suggests a fifth use case, which is education. McKay once again writes, Chachybt breaks down this diagram of a human cell for a ninth grader. McKay posts a picture of the type that you might see in any sort of standard science textbook, and Chachybt gives a ton of additional information about what each of the different components are and what they do in the context of the cell. Now, what he also shows in this video is that it's not just the initial result, but that you can interact with chat GPT to ask for further clarification.
Starting point is 00:14:39 This sort of dialogue between machine and person is like a non-argumentative, Socratic dialogue, but coming to AI. Now, the flip side, of course, is that education systems are going to have to have a real rethink when it comes to homework. Peter Yang posted a worksheet from mathaids.com with a bunch of addition problems into GPT Plus and says, give me the answers. ChatGBTGBT says, certainly, let's solve these addition problems.
Starting point is 00:15:03 Peter's comment sharing it to Twitter is, kids will never do homework again. I actually had a conversation today about the fact that if teachers can figure out exercises that are actually valuable for kids to do, that aren't something that ChatGPT can do, it's likely to mean that that's a much more valuable use of those kids' time, frankly, when it comes to learning. Now, from here, we move into some higher order type of use cases. I'm calling use case number six higher order interpretation. And in some ways, it's a variation on the theme of explaining diagrams. But one of the examples, once again from Pietro, shows that there's a lot more than just image recognition going on here. The image that Pietro inputted to ChatGPT was a four-panel cartoon,
Starting point is 00:15:44 in which three people say, I'm glad we all agree, each thinking about a different shape, one thinking about a square, one thinking about a circle, one thinking about a triangle. The second panel appears to show the images revealed, at which point the three people realized that they actually didn't agree. A third panel seems to show a transmutive process where the shapes combined to become a different shape, to which all the participants say aha, and a fourth panel repeats the message from the first, I'm glad we all agree, but with each of the participants actually thinking about the same shape. When Pietro asked, what do you think is the meaning of this image? Chattebti responds, the image portrays the concept of group dynamics and perspectives. It's then
Starting point is 00:16:20 able to articulate what happens in each of the panels and how they relate to one another, and comes to the conclusion that, quote, overall, it seems to highlight the importance of communication, understanding, and alignment in group settings. It suggests that even if individuals think they are aligned, without clear communication, misunderstandings can occur. But with effort and discussion, a shared understanding can be achieved. But what's so different about this example in particular is just how much interpretation and understanding of group dynamics and things like that go into this. It's not just an electronic circuit which can be recalled as complex as it might be. This really does feel like it requires higher order thinking, and is in that way a pretty fundamentally different
Starting point is 00:16:59 use case than what we've seen before. Somewhat related to that higher order thinking, one more from Pietro, he writes, using GPT4 vision to name never-before-seen architectural styles created with Mid-Journey. It excels at identifying diverse elements and assigning names to these distinctive creations. So I'm calling the seventh category of use case's creative expression. The images that Pietro shares are a little hard to see, but they look like they have big marble stone, sumptuous classic bedroom furniture, but combined with interesting modern touches and lights. Chatchibati says, observing the blend of traditional Greco-Roman motifs and elements with sleek modern lines, innovative lighting, and contemporary furnishings, I would suggest the name Athenian modernism.
Starting point is 00:17:37 Then goes on to explain why it wants that name, but I think it's pretty perfect. And once again, feels a bit higher order than just interpreting what's in a photo from the real world. And this gets us to our eighth and final use case easily the most important. Once again, we turn to Peter Yang, who has presented Chatchipt with vision and image. of the most confusing set of street parking rules that you have ever seen. No parking 11 to 1 Tuesday street cleaning. To away school days. No stopping Monday to Friday.
Starting point is 00:18:03 School day exceptions. To away school days. This is six feet tall, at least, of parking rules. Peter posts the image in and says it's Wednesday at 4 p.m. Can I park at the spot right now? Tell me in one line. Chad GBT says, yes, you can park for up to one hour starting at 4 p.m. Peter writes, I will never get a parking ticket again.
Starting point is 00:18:22 Now, the question comes up, of course, are there things? that chat GPT with Vision can't do. Are there any areas where people have been disappointed? I'm sure that we're going to get a lot more of that once more people have their hands on it. But for some initial thoughts, we turned to a blog post from RoboFlo, from James Gallagher and Piotr-Skalski, all about their first impressions with GPT4 Vision. On some tests, it did well, including visual question answering, and object detection, but it wasn't perfect in optical character recognition. They posted a slightly blurry image of attire and said,
Starting point is 00:18:50 read the serial number, return only number no additional text. say GPT4V was unable to correctly identify the serial number in an image of a tire. Some numbers were correct, but there were several errors in the results from the model. When it came to CAPTCHA, as they write, we found that GPT4V was able to identify that an image contained a CAPTCHA, but often filled the test. In a traffic light example, GPT4V missed some boxes that contained traffic lights. There were also some mistakes on crosswords. They write, the model appeared to read the clues correctly, but misinterpreted the structure of the board. As a result, the provided answers were incorrect. The same limitation they say was exhibited in their Sudoku tests. Now, these may seem like
Starting point is 00:19:25 minor quibbles with such an impressive piece of technology, and they are. I present them only just to try to give a more robust picture and reminder of the fact that as incredible as this is, it isn't perfect and there is still advances to be made. But overall, it is a fairly huge update, and it makes sense why many people inside OpenAI think that this is the biggest product launch in some ways that they've had since ChatGBTGBT came out in the first place. Anyways, guys, hope you are as excited as I am, about getting your hands on GPT with vision. I know I will be eagerly refreshing until the day it actually shows up.
Starting point is 00:19:57 Appreciate you listening as always, and until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.