The AI Daily Brief: Artificial Intelligence News and Analysis - The 7 Use Case Categories of ChatGPT-4 Vision
Episode Date: October 15, 2023ChatGPT-4 Vision is one of the biggest AI product updates of the last few months and people are still just exploring all the ways it can be used. NLW explores a recent framework from Greg Kamradt for ...the 7 categories of use got GPT-4V, including: Describe, Interpret, Recommend, Convert, Extract, Assist, Evaluate. Read more: https://twitter.com/GregKamradt/status/1711772496159252981 TAKE OUR SURVEY ON EDUCATIONAL AND LEARNING RESOURCE CONTENT: https://bit.ly/aibreakdownsurvey ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI Breakdown, we're looking at a framework of seven categories of use cases for the incredibly powerful new chat GPT4 vision.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.netnetwork for more information about our YouTube, our Discord, and our newsletter.
Welcome back to the AI breakdown.
One of the things that I feel very strongly about when it comes to how to learn AI tools is that there are two incredibly important things.
The first is simply to see what other people are doing with the tools that you're interested in using.
The use cases and case studies that people come up with are pretty much the best way to get a sense of what a tool can do.
However, all of that is useless unless you actually try it yourself.
Now, within that, I also find that it's incredibly valuable to have frameworks that help organize one's thoughts about different categories of ways to use a tool.
And so I was incredibly excited to see when Greg Comrat, who is an excellent YouTuber and content creator in this,
space, put together a framework of seven different categories of use for GPT4 Vision. He breaks the
use cases down into describe, interpret, recommend, convert, extract, assist, and evaluate. So what I'm
going to do in today's episode is my own example of each of these categories of use to give you a
sense of what's possible. I'll also share the subcategories where relevant that Greg has come up with.
First up, we have Describe, and for this I posted a picture of me holding my son up to pick apples,
which was taken a few weeks ago at an apple orchard in New York.
I said simply what's happening in this image.
Chad GBT responds, in the image there is a man lifting a young child towards an apple tree.
The child wearing a hat and dressed in a plaid shirt and shorts is reaching out to pick or touch
one of the apples on the tree.
The man wearing a cap in a plaid shirt has a tattoo on his arm and is holding a tote bag
with the text Rose Hill Farm and a rose design on it.
They are surrounded by apple trees filled with ripe apples and some other green plants.
The setting appears to be an apple orchard.
All of course correct, but I decided to challenge it a little further and said any guesses as to
what time of year this is or what part of the USA it might have been taken in.
Chat Chb-T guessed that because Apple picking season is in the fall, it was probably in the fall.
Reinforcing that was the fact that we were wearing long sleeves, but not jackets.
And when it came to location, it was a little less sure, suggesting only that New England
and the Pacific Northwest were some places where apples often grew.
Now, this is sort of the layer one use case of this.
Interestingly, it's the one that many demos point to, but it's probably the one
that people will use the least in practice because they can already identify what's going on in an
image themselves. Next up, we have interpret. Greg sums that up as explaining the meaning or providing
more context. He points to subcategories including medical interpretation, technical interpretation of
diagrams and schema, image analysis, content analysis, artistic interpretation, and data
interpretation. So for this, I did a couple tests. One of them is a sort of complex slide about
the EU AI Act's risk-based approach to artificial intelligence regulation. Basically, the EU has
determined that different types of uses require different types of regulation, and that's based on how
risky they are. They often, as in this case, use a pyramid image to show that, with the least risky
examples on the bottom and the most risky examples on the top. Now, I don't think this one makes sense
to read all the way through, but it does a good job extracting the key information. It understands the
pyramid of risks, it's able to contextualize it with the language around it, and I was actually
trying to find an even more complex version of this chart because I've seen some really powerful
examples of ChadGBTVT vision being able to extract what is incredibly information-dense imagery.
But I wanted to do one more interpret test, and so I basically took a described type prompt,
but then asked Chatchapit with Vision to provide larger context. So I copied in the famous Pablo
Picasso painting Guernica and asked, can you please provide the societal context for this work of art?
Chad Chagipt nailed it. The artwork you've shared it wrote is Guernica by Pablo Picasso.
painted in 1937, it is one of Picasso's most famous works and stands as a powerful political
statement as well as an artistic achievement. Now, it then goes through what the context was
of the painting and where it was inspired. It talked about various interpretations of the art,
and it talks about the world into which it came. Now, I think here you can see how,
going beyond just description, this can be a really useful tool, especially for an educational
use case. Being able to drop in images of art and get all of that information is a really
interesting and useful way to collect information. Now, what about the next category recommend?
as Greg puts it to offer critiques or suggest changes. The two subcategories were critiques and feedback
and recommended actions, and so I fed in a mid-jurney created two-by-two grid of circular white
background symbols for a podcast focus on artificial intelligence. For context, I said, I have a podcast
focused on AI. I'm looking for a symbol to use on the cover art of the show. Imagine you're a
listener. Can you weigh the pros and cons of these options and make a recommendation of one to use?
Now, of course, if you were listening, not watching, this will be a little bit more difficult,
but I'll just do a brief review so you get a sense of how ChatGBTBT was quote-unquote thinking.
First of all, it gave each of them a name based on the core visual identity.
The first, for example, it called robot with circuitry background, which is pretty much exactly
what it looks like. The pros, it said, were clearly represents AI and robotics. The circuitry design
implies complexity and intricacy. Friendly appearance can make the podcast seem approachable.
Cons, robots are a somewhat cliched representation of AI. Also might give the impression
that the podcast is mainly about robotics rather than broader AI topics. Now, I think those
are super strong critiques, and so it was fairly impressed with that one. The next option, it called
circular circuitry with eye, and it sort of does look like an eye in the middle of a circuit board.
The pros, it said, were the eye can symbolize AI's ability to see, analyze, and understand.
Circuitry and a circular design is aesthetically pleasing and symbolizes interconnectedness.
Looks modern and might appeal to a tech-savvy audience.
Cons might be interpreted as surveillance which could have negative connotations.
Also a con less explicit in its representation of AI.
Honestly, this is one of the categories that I thought it did the best job relative to what
I had expected.
Now, when it came to what recommendations it would make, it hedged a little bit and gave
too. It said if you're covering a broad range of topics and you want a general audience with a
friendly and approachable image, go with the robot with circuitry. If it's more tech-oriented
and you're looking for a modern, sophisticated look, go with circular circuitry for the eye. This is
definitely a use case where, after having done this experiment, I can see myself using it more.
Now we move to a really fun category, the category that Greg calls convert, convert images
into other forms or generate something new. His subcategories are designed to implementation,
media to text, visual to narrative, visual to graphic. For this, I dropped a very very
quickly hand-scratched diagram of an idea for a fake AI education company. I had lines pointing
out from a top bubble that said AI education company. I had lines coming out that said online
courses, mobile app, case studies, and community. I asked, could you please make a graphic from this
quick sketch? Now, I kind of expected it to kick back something with Dolly 3, but what it actually
came back with was an SQL schematic. It said for a proper graphic design, you need to use a graphic
design software or services, but I hope this gives a clear idea based on the sketch you shared.
But to try to get something out of this conversion, I asked it to then from there, write up a quick
description based on it that I could send to prospective investors, which it did, and then from there,
recommend a simple landing page site map based on that, which it once again did, homepage header
hero section about us features, and then finally, I asked it to write the code for that landing page,
and once again, it did that.
And so I think what's powerful here is that in the course of probably three or four total
minutes, I went from a very chicken scratch handwritten thing to the code for a website, which
would be at least close to an approximation of what that chicken scratch drawing was representing
in the first place.
Next up, another really fun category, extract.
Greg describes this as extract entities within the image or provided structured output.
His subcategories were handwriting extraction, formal document extraction, and qualitative extraction.
Now, it being spooky season and me being a history buff, I posted in a fragment of a letter
around the Salem Witch trials from Cotton Mather.
I asked, could you please transcribe the words written here?
Do you know anything about the context of the note based on that?
Now, this was a particularly hard document because, in addition to just very old 1691 writing,
it was also on double-sided paper where the ink bled through,
so I think that made it extra hard for GBT4 Vision.
It was able to get around 50%, I would say, of the lines,
which frankly was not necessarily enough to really understand.
Although I certainly found that with the 50% that it was able to interpret,
it made it a little bit easier to guess the in-betweens.
What it did get was the date and the person, September 2nd, 1691,
your sincere and most humble servant, C. Mather. From there, it said it appears to be a letter or
note from the late 17th century, and the signature at the bottom suggests it was written by
C. Mather, likely referring to Cotton Mather, a prominent New England Puritan minister, prolific
author, and pamphleteer. He is often remembered for his role in the Salem Witch Trials.
The context of this note, given its date and the content, could relate to a wide range of
issues Mather was involved in, but without more specifics, it's hard to pinpoint its exact context.
This obviously wasn't a slam dunk, but then again, the text didn't really make explicit mention
of the witch trials, so all it really had to go on was that date. I think it was still pretty strong for it to
identify correctly who the author was, and it definitely feels to me like this is a very promising use
case for anyone who's working with these sort of old documents. The sixth and penultimate category
that Greg writes is assist offer solutions based on the image. This could include task solution
and solver, explanatory assistance, or strategy recommendations. Remember when GPT4 Vision was premiered,
one of the examples they gave was someone who was trying to assemble their bike, and who was able to take
pictures and then ask GPT4 Vision if they were putting parts in the right spot. For this, I took an image
of an old NES, the very first generation of Nintendo, and I asked what I would need to install this on a
modern TV. Now, I actually like these old ancient systems better than new ones, so we have a bunch
of them kicking around here, an original Nintendo, a Super Nintendo, and of course an N64 for GoldenE.
I will also tell you that I have frequently not bought in all the correct pieces that I need to
get it to run on my TV and had to race to Amazon to figure out what I actually needed. On the other hand,
this image into ChatGBTBT, it gave me the full list of exactly what I'd need. An AV to
HTML converter, an HTML cable, the power supply, the TV with HTML input, the controllers,
the game cartridge, and optional an RF to coaxial adapter. It also, without me asking, gave
a set of instructions for setting it up. I tend to think, especially on mobile, this type of use
case is going to be the default for many, many people, and the thing that has them pull out
their app to use it more than anything else. The ability to take a picture in the real world
of something that you need help with and get that help, is where you.
where a chat GPT could start to replace calling your tech-savvy friend or whatever else you would have done in the
past. Greg's last category is evaluate, subjective judgment based on the image. He includes a subcategories,
aesthetic evaluation, subject evaluation, and accuracy check. Now, I wasn't exactly sure how it was going
to handle this. Subjective interpretation is, of course, subjective. And so what I did is create a
landscape painting on Mid Journey, and I asked it simply, how would you improve this landscape painting?
Now, part of why I thought this was an interesting question is that it is so inherently a subjective question,
what does improve mean when improvement is in the eye of the beholder.
But what it did is actually give a really technical explanation, not of what could make it better, per se,
but of what I could do if there were different goals that I had that I might want to accentuate.
For example, it suggested depth and atmosphere changes.
Increase the atmospheric perspective by slightly fading out the background mountains and trees to enhance the feeling of depth.
This can be done by adding a light blue or gray wash to distant objects.
Contrast, boosting the contrast in certain areas can help draw the viewer's eye.
For instance, emphasizing the highlights on the water where sunlight might hit it can give
the illusion of sparkling, moving water. And so on and so forth, it went through a ton of different
suggestions, warmth, texture, wildlife, reflections, variety of vegetation, sky, even narrative elements,
under which it wrote introducing a small element that tells a story, like a lone cabin,
a bridge, or a boat, can give viewers more to engage with. And what's cool and I think reflective of
this, is that basically these say, if you are trying to achieve X,
If X would be an improvement in your estimation, here's what you can do to actually achieve that.
That's a type of subjective analysis that's actually incredibly valuable.
And my guess is for many artists or writers going to be more valuable than the type of feedback
that they're normally used to getting from other people, which is I would like it better if.
So we'll wrap there.
Those are the seven use cases.
Again, describe, interpret, recommend, convert, extract, assist, and evaluate.
Let me know if you've found other use cases for GPT4 Vision that weren't represented in here.
I would love to see what you are working on.
Share examples.
Come to the Discord and do it.
We're at bit.ly slash AI breakdown and have fun creating.
Until next time, guys.
Peace.
