The AI Daily Brief: Artificial Intelligence News and Analysis - ChatGPT Vision: 8 Ways People Are Using It Already
Episode Date: September 30, 2023ChatGPT Vision hasn't even been broadly rolled out yet and already people who do have access are showing off some amazing use cases. Before that on the Brief: have the AI phone wars begun? TAKE OUR SU...RVEY ON EDUCATIONAL AND LEARNING RESOURCE CONTENT: https://bit.ly/aibreakdownsurvey ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're looking at some of the most impressive use cases of GPT4 vision before it's even fully come out.
Before that on the brief, have the AI phone wars begun?
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.network for more information about our Discord, our YouTube channel, and our newsletter.
Welcome back to the AI breakdown brief, all the AI headline news you need in around five minutes.
We pick up a story that we started talking about earlier this week, the information,
reported that Johnny Ive, the designer behind Apple's most iconic products, the iPod, the IMac,
the iPhone, has been having conversations with Sam Altman about an AI device, some AI hardware
company. Now, obviously, this got a lot of people excited. There is always this allure of new
hardware, even more than software sometimes. And the news got a little more meat on the bone
when the Financial Times reported yesterday that not only are these guys talking, and that not
only is Masayoshi's son from SoftBank involved, but that a fundraise worth around a billion
dollars from SoftBank has been in the conversation. Now, the explicit language around this
is the iPhone of artificial intelligence. Of course, what that means is not clear. But according
to FT sources, it sort of seems like Sam Altman had this idea and then went out and recruited
Ive and that they've been holding brainstorming sessions in San Francisco. The best information
we can get about what they're thinking is the Financial Times rights. They hope to create a
more natural and intuitive user experience for interacting with AI, in the way that the iPhone's
innovations in touchscreen computing unleashed the mass market potential of the mobile internet.
Now, what that turns into, whether it's a particular design or even a particular type of
device, has apparently many different ideas on the table. In terms of how likely to move
forward this is, the sources characterize the discussions as serious, but there isn't any deal that
has been agreed, and it still could be months before anything is formally announced. Now, the way
that the information summed this news up was calling it the beginning of the AI phone wars.
Kate Clark writes,
For years, Silicon Valley has been trying to figure out if hardware companies can generate
venture-scale returns given the example of high-profile venture-backed disappointments,
like essential products and magic leap.
They go on, phones have proved particularly tricky.
The dominance of Apple and Google with mobile app developers has made it very hard for competing
startups and even huge companies like Microsoft and Amazon to break into that market.
The question is whether the advancements open AI has,
has made an AI can upset the balance and change the equation for Altman and Ives' potential
new business. Now, as you might expect, then, there is a fair bit of skepticism around this.
Entrepreneur, writer, and investor, Sam Lesson writes,
these aspiring tech platform companies always with the we need to build a phone. Geez.
Lesson also wrote, mark my words, from repeated personal experience, when Masa buys,
you do whatever you can to sell. Everything. On the flip side, others who purport to have more
information are pretty excited. Brian Romley writes,
As my clients have known for over three years, OpenAI is building devices.
Now, former Apple designer Johnny Ive is on board.
Wait till you see what this is.
World-changing.
Alas, for us with that intriguing comment, we have to move on to our next topic.
Google has made it easier for publishers to opt out of their content becoming fodder for AI training.
And this is really interesting, particularly in the context of Google.
The reason that publishers let Google crawl their site is because there is value in them being indexed for search.
Publishers want to be found. They want people to come to their site. There's an entire industry
SEO that's entirely designed around that fact. However, Google is now separating out
publishers' ability to still opt in to being indexed for search while opting out of their
content becoming training data. Now, for Google, this represents a couple things. One, them trying
to be good stewards and give publishers more control, but two, it's probably also trying to get out
ahead of regulatory queries and legal battles by showing that they are in good faith giving people more
tools to opt out should they not want to have their content be a big part of AI data training sets.
OpenAI, as you guys know, have done something similar.
When they announced their own web crawler, they also announced the way that publishers
could block that crawler, and many publishers, including the New York Times, CNN, Reuters,
and Medium have chosen to do so.
Now, interestingly, it appears that for some of these publishers, those tools for simply
blocking web crawlers may not go far enough.
Medium CEO, Tony Stubblebine said, I'm not a hater, but I also want to be plain spoken
that the current state of generative AI is not a net benefit to the internet. They're making money
on your writing without asking for your consent, nor are they offering you compensation and credit.
AI companies have leached value from writers in order to spam internet readers. Now, I am not here
endorsing that view, but fair enough. But then he goes on, Medium is not alone. We are actively
recruiting for a coalition of other platforms to help figure out the future of fair use in the age of
AI. I've talked to many big companies. These are the big organizations that you could probably guess,
but they aren't ready to publicly work together. Now, TechCrunch reports,
reported this as a nascent media coalition to block AI crawlers, and that certainly seems like
kind of what this is. Ultimately, I think this is going to be a question that gets settled in courts,
around to what extent and how fair use applies to AI data training, but certainly coalitions of
companies can use soft power to try to influence outcomes as well. Over in the world of Amazon,
Bedrock is now generally available. Bedrock, they write, is a fully managed service that offers
a choice of high-performing foundation models from leading AI companies, along with a broad set of
capabilities to build generative AI applications, simplifying development while maintaining privacy
and security. Basically, Amazon has been focused rather than on offering their own foundation model,
instead on giving the enterprises they work with the chance to build new models from scratch or
customize existing models to suit their enterprise needs. That model is now apparently available to
anyone who wants to use it. Now, here's an interesting twist. When Meta renamed itself to Meta,
there was a fair bit of skepticism around Zuckerberg's vision. Metaverse to many,
seemed like a buzzword that was destined to be thrown in the trash along with other big terms from
the crypto top as soon as that cycle hemorrhaged. But it appears that the metaverse may be back.
Yesterday, Lex Friedman tweeted, here's my conversation with Mark Zuckerberg, his third time on the
podcast. But this time we talked in the metaverse as photorealistic avatars. This was one of the
most incredible experiences of my life. It really felt like we were talking in person, but we were
miles apart. It's hard to put into words how awesome this was for someone like me who values the
intimacy of in-person conversation. It gave me a glimpse of an exciting future with many new
possibilities and fascinating questions about the nature of reality and human connection. Now, of course,
that was shared with a video that has now been seen around 10 million times. And in that video,
you can see that they've moved far away from the weird little digital avatars with no feet
to actual photorealistic representations of the person you're speaking with. The reactions have been
really positive. Shriram Krishnan writes, this is one of the most mind-blowing things I've seen.
It's not even Uncanny Valley anymore, just stunning.
Tsar Haribakti writes,
Zuck is on his I Told You So Revenge Tour.
And Rao Paul writes,
The Exponential Age Accelerates again.
Moving on to our penultimate topic,
Rob Joyce, the director of cybersecurity
at the National Security Agency,
has announced that the NSA is creating
a new center for AI security.
The NSA calls this a crucial mission
as AI capabilities are increasingly acquired,
developed, and integrated into U.S. defenses and intelligence systems.
Said Army General Paul Nakasone,
we maintain an advantage in AI in the United States today, that AI advantage should not be taken for granted.
Now, in terms of how they plan to use AI, Nacosone said, AI helps us, but our decisions are made by humans,
and that's an important distinction. The end of the day, decisions will be made by humans and humans in the
loop. What's interesting is that this follows the announcement just a couple days ago that the CIA
is itself working on a version of ChatGPT, but for the 18 intelligence agencies that make up the
U.S. intelligence apparatus. Again, as much hemming and hawing and debating as there is on Capitol Hill about
the right policies regarding AI. The military establishment, at least, is moving fully ahead.
Lastly, today, a really interesting tweet from Andre Carpathy. Carpathy is, of course, at OpenAI,
and he argues that we shouldn't be thinking about LLMs as a chatbot, but instead as, quote,
the kernel process of a new operating system. As he puts it today, it orchestrates, input and
output across modalities, text audio and vision, code interpreter, the ability to write and run programs,
browser and internet access, and embeddings database for files and internal memory storage and retrieval.
He ends the thought.
TLDR, looking at LLMs as chatbots, is the same as looking at early computers as calculators.
We're seeing an emergence of a whole new computing paradigm, and it is very early.
I think anyone who's really spent the time thinking from first principles about what we might
be doing with these technologies in the long run can certainly agree that they are even more than
they seem today.
But for now, that is going to do it for today's AI breakdown brief.
Next up, the main AI breakdown.
Today we are looking at the fruits of one of the more exciting AI product announcements recently,
which is, of course, ChatGBT with Vision.
Today we're going to go through eight different ways that people are already discovering
how to use this new tool in hopes that they give you ideas for how you might use it when
it becomes widely available.
Now, one caveat, unfortunately, I myself have not had a chance to experiment with this yet.
I, alas, have not been gifted this incredible tool.
So this is all a curation from what other people have done.
And because of that, and because access is limited right now, you will notice that some of the folks who are sharing their results are held across a number of these different categories.
Let's start first and foremost with just the obvious and most basic use case of visual research.
Roan Chung showed an example of a photo looking out from the mouth of a cave on what looks like a lush tropical environment.
He writes, where is this?
Chatchapitu responds, the image appears to be taken from inside a cave overlooking a coastline with a distinctly curving road.
Based on the scenery and the characteristics of the landscape, it strongly resembles the.
a view from Macaroo Point on the island of Oahu in Hawaii. Now, I've seen other people post
similar demonstrations where they give, for example, a photo of a landscape or a city and ask where
is this. I've seen other people experiment with just visual recognition tasks, asking what type
of an animal is in a shot, for example. And so far, at least chat GPT with vision seems to
perform that pretty well. Now, especially with the integration into the mobile app, I actually think
this is going to be a use case that people use a lot. It feels to me like a very standard part of
travel in the future, could be pointing your chat GPT app at something you're looking at and saying,
what is that or tell me about that? But now let's move on to some slightly different use cases.
One that's in the column of just creative, quirky, and fun is GBT4 Vision for Interior Design.
Pietro Scurano, who you're going to hear about a lot in this video as he has done a ton of
experiments, writes, I love how it's incorporating what it knows about me in the suggestion because
of custom instructions. So basically, he has posted a picture of a room and says, how could I improve it?
Chachapitee gives a number of suggestions to enhance the room, from color to lighting to plants to art.
Now, in terms of custom instructions, that feature is the one in which you can give Chatsypte more information about yourself, so it has that as context when it answers future queries.
And one of the places that that comes up here is in the art suggestion.
Chachapiti T writes, given your background in classical studies in art, perhaps adding some artwork on the walls could be a great personal touch.
They could be prints of classical artworks or something contemporary to create a blend of old and new.
Now, Pietro also shows off our third use case that people are experimenting with.
And frankly, this is the one that if I had to pick what people are most excited about,
it's this.
That is the use case of building websites and coding.
Pietro writes, from image to live website using GPT4 Vision and Replit in less than a minute,
things are about to get so interesting.
So basically, Pietro shares a video of him posting in an example UI in a photo and saying,
replicate this exactly, don't skip anything, write the code, from which he's able to export it
and get it in an IDE in an incredibly quick amount of time.
McKay Rigley did something similar.
He writes,
I gave ChatGBTGBT a screenshot of a SaaS dashboard
and it wrote the code for it.
This is the future.
Now, nearly 7 million people have watched this video
to see how GPT with Vision moves from just a screenshot
to an actual working prototype,
but McKay wasn't done there.
He also tweeted,
You can give ChatGBTT a picture of your team's whiteboarding session
and have it write the code for you.
This is absolutely insane.
And sure enough, in that video,
which has just under 10 million views,
McKay shows an image of the whiteboard that's actually in his room,
posts an image of it to GPT4,
and says,
you're an expert software developer.
This was my team's whiteboarding session for our onboarding flow.
You need to write the code for this.
Take a deep breath and think step by step about how you will do this.
Now write the complete code for this working one step out of time.
You'll notice that language that we talked about in an earlier episode this week
of taking a deep breath and thinking step by step,
which apparently increases the success of results dramatically.
But let's take a step back here because all three of these examples are sort of similar and one of the most powerful uses of this new technology.
When people talk about why AI, even though it will disrupt the jobs of today, will enable new jobs, I think you start to get a glimpse of that watching these types of demos.
The reduction in the barrier between idea and execution is so monumental here that from a silly, hard-to-interpret image on a whiteboard, within minutes there can be working code, is just unlike.
anything we've seen. It's hard for me to imagine that that doesn't increase the quantity of what we
produce. Now, I could stop here, and I think chat GPT vision would still be exciting, but there are
many, many other use cases that people are exploring, so we will move on. Fourth, on our list,
reading and explaining diagrams. Now, there are so many examples of this that are posted, but one of
my favorite ones comes from John Stokes and Sean Spriggins that is this unbelievably information-dense
slide apparently from the Pentagon titled Integrated Defense Acquisition Technology and Logistics
Lifecycle Management System. And for this to really get the full effect, if you're not watching
the video, if you're listening to this as the podcast, I suggest you go check it out. There has to
be 3,000 words on this page and hundreds of boxes all flowing between each other, and yet
Chatshabit is able to make some sense of it. Now, one of the things that's interesting about being
able to understand diagrams is that some diagrams are also entirely different types of information.
For example, Marco Mascaro posted the electronic schematics of the Arduino design,
and Chatschapit with Vision was able to understand that it was an electronic circuit
and explain how the different components interconnected and worked.
Now, another example of breaking down a diagram suggests a fifth use case, which is education.
McKay once again writes, Chachybt breaks down this diagram of a human cell for a ninth grader.
McKay posts a picture of the type that you might see in any sort of standard science textbook,
and Chachybt gives a ton of additional information about what each of the different components are
and what they do in the context of the cell.
Now, what he also shows in this video is that it's not just the initial result,
but that you can interact with chat GPT to ask for further clarification.
This sort of dialogue between machine and person
is like a non-argumentative, Socratic dialogue, but coming to AI.
Now, the flip side, of course,
is that education systems are going to have to have a real rethink
when it comes to homework.
Peter Yang posted a worksheet from mathaids.com
with a bunch of addition problems into GPT Plus and says,
give me the answers. ChatGBTGBT says, certainly, let's solve these addition problems.
Peter's comment sharing it to Twitter is, kids will never do homework again. I actually had a
conversation today about the fact that if teachers can figure out exercises that are actually
valuable for kids to do, that aren't something that ChatGPT can do, it's likely to mean that
that's a much more valuable use of those kids' time, frankly, when it comes to learning.
Now, from here, we move into some higher order type of use cases. I'm calling use case number six
higher order interpretation. And in some ways, it's a variation on the theme of explaining diagrams.
But one of the examples, once again from Pietro, shows that there's a lot more than just image
recognition going on here. The image that Pietro inputted to ChatGPT was a four-panel cartoon,
in which three people say, I'm glad we all agree, each thinking about a different shape,
one thinking about a square, one thinking about a circle, one thinking about a triangle.
The second panel appears to show the images revealed, at which point the three people realized that
they actually didn't agree. A third panel seems to show a transmutive process where the shapes
combined to become a different shape, to which all the participants say aha, and a fourth panel
repeats the message from the first, I'm glad we all agree, but with each of the participants
actually thinking about the same shape. When Pietro asked, what do you think is the meaning of this
image? Chattebti responds, the image portrays the concept of group dynamics and perspectives. It's then
able to articulate what happens in each of the panels and how they relate to one another, and comes to
the conclusion that, quote, overall, it seems to highlight the importance of communication,
understanding, and alignment in group settings. It suggests that even if individuals think they
are aligned, without clear communication, misunderstandings can occur. But with effort and discussion,
a shared understanding can be achieved. But what's so different about this example in particular
is just how much interpretation and understanding of group dynamics and things like that go into this.
It's not just an electronic circuit which can be recalled as complex as it might be. This really does
feel like it requires higher order thinking, and is in that way a pretty fundamentally different
use case than what we've seen before. Somewhat related to that higher order thinking, one more from
Pietro, he writes, using GPT4 vision to name never-before-seen architectural styles created with Mid-Journey.
It excels at identifying diverse elements and assigning names to these distinctive creations.
So I'm calling the seventh category of use case's creative expression. The images that Pietro shares
are a little hard to see, but they look like they have big marble stone, sumptuous classic bedroom
furniture, but combined with interesting modern touches and lights. Chatchibati says,
observing the blend of traditional Greco-Roman motifs and elements with sleek modern lines,
innovative lighting, and contemporary furnishings, I would suggest the name Athenian modernism.
Then goes on to explain why it wants that name, but I think it's pretty perfect. And once again,
feels a bit higher order than just interpreting what's in a photo from the real world.
And this gets us to our eighth and final use case easily the most important.
Once again, we turn to Peter Yang, who has presented Chatchipt with vision and image.
of the most confusing set of street parking rules that you have ever seen.
No parking 11 to 1 Tuesday street cleaning.
To away school days.
No stopping Monday to Friday.
School day exceptions.
To away school days.
This is six feet tall, at least, of parking rules.
Peter posts the image in and says it's Wednesday at 4 p.m.
Can I park at the spot right now?
Tell me in one line.
Chad GBT says, yes, you can park for up to one hour starting at 4 p.m.
Peter writes, I will never get a parking ticket again.
Now, the question comes up, of course, are there things?
that chat GPT with Vision can't do. Are there any areas where people have been disappointed?
I'm sure that we're going to get a lot more of that once more people have their hands on it.
But for some initial thoughts, we turned to a blog post from RoboFlo, from James Gallagher and
Piotr-Skalski, all about their first impressions with GPT4 Vision.
On some tests, it did well, including visual question answering, and object detection,
but it wasn't perfect in optical character recognition.
They posted a slightly blurry image of attire and said,
read the serial number, return only number no additional text.
say GPT4V was unable to correctly identify the serial number in an image of a tire. Some numbers were
correct, but there were several errors in the results from the model. When it came to CAPTCHA, as they
write, we found that GPT4V was able to identify that an image contained a CAPTCHA, but often
filled the test. In a traffic light example, GPT4V missed some boxes that contained traffic lights.
There were also some mistakes on crosswords. They write, the model appeared to read the clues
correctly, but misinterpreted the structure of the board. As a result, the provided answers were
incorrect. The same limitation they say was exhibited in their Sudoku tests. Now, these may seem like
minor quibbles with such an impressive piece of technology, and they are. I present them only just to try
to give a more robust picture and reminder of the fact that as incredible as this is, it isn't perfect
and there is still advances to be made. But overall, it is a fairly huge update, and it makes sense
why many people inside OpenAI think that this is the biggest product launch in some ways that they've
had since ChatGBTGBT came out in the first place. Anyways, guys, hope you are as excited as I am,
about getting your hands on GPT with vision.
I know I will be eagerly refreshing
until the day it actually shows up.
Appreciate you listening as always,
and until next time, peace.
