The AI Daily Brief: Artificial Intelligence News and Analysis - The 7 Use Case Categories of ChatGPT-4 Vision

Starting point is 00:00:00 Today on the AI Breakdown, we're looking at a framework of seven categories of use cases for the incredibly powerful new chat GPT4 vision. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.netnetwork for more information about our YouTube, our Discord, and our newsletter. Welcome back to the AI breakdown. One of the things that I feel very strongly about when it comes to how to learn AI tools is that there are two incredibly important things. The first is simply to see what other people are doing with the tools that you're interested in using. The use cases and case studies that people come up with are pretty much the best way to get a sense of what a tool can do. However, all of that is useless unless you actually try it yourself.

Starting point is 00:00:50 Now, within that, I also find that it's incredibly valuable to have frameworks that help organize one's thoughts about different categories of ways to use a tool. And so I was incredibly excited to see when Greg Comrat, who is an excellent YouTuber and content creator in this, space, put together a framework of seven different categories of use for GPT4 Vision. He breaks the use cases down into describe, interpret, recommend, convert, extract, assist, and evaluate. So what I'm going to do in today's episode is my own example of each of these categories of use to give you a sense of what's possible. I'll also share the subcategories where relevant that Greg has come up with. First up, we have Describe, and for this I posted a picture of me holding my son up to pick apples, which was taken a few weeks ago at an apple orchard in New York.

Starting point is 00:01:38 I said simply what's happening in this image. Chad GBT responds, in the image there is a man lifting a young child towards an apple tree. The child wearing a hat and dressed in a plaid shirt and shorts is reaching out to pick or touch one of the apples on the tree. The man wearing a cap in a plaid shirt has a tattoo on his arm and is holding a tote bag with the text Rose Hill Farm and a rose design on it. They are surrounded by apple trees filled with ripe apples and some other green plants. The setting appears to be an apple orchard.

Starting point is 00:02:02 All of course correct, but I decided to challenge it a little further and said any guesses as to what time of year this is or what part of the USA it might have been taken in. Chat Chb-T guessed that because Apple picking season is in the fall, it was probably in the fall. Reinforcing that was the fact that we were wearing long sleeves, but not jackets. And when it came to location, it was a little less sure, suggesting only that New England and the Pacific Northwest were some places where apples often grew. Now, this is sort of the layer one use case of this. Interestingly, it's the one that many demos point to, but it's probably the one

Starting point is 00:02:28 that people will use the least in practice because they can already identify what's going on in an image themselves. Next up, we have interpret. Greg sums that up as explaining the meaning or providing more context. He points to subcategories including medical interpretation, technical interpretation of diagrams and schema, image analysis, content analysis, artistic interpretation, and data interpretation. So for this, I did a couple tests. One of them is a sort of complex slide about the EU AI Act's risk-based approach to artificial intelligence regulation. Basically, the EU has determined that different types of uses require different types of regulation, and that's based on how risky they are. They often, as in this case, use a pyramid image to show that, with the least risky

Starting point is 00:03:10 examples on the bottom and the most risky examples on the top. Now, I don't think this one makes sense to read all the way through, but it does a good job extracting the key information. It understands the pyramid of risks, it's able to contextualize it with the language around it, and I was actually trying to find an even more complex version of this chart because I've seen some really powerful examples of ChadGBTVT vision being able to extract what is incredibly information-dense imagery. But I wanted to do one more interpret test, and so I basically took a described type prompt, but then asked Chatchapit with Vision to provide larger context. So I copied in the famous Pablo Picasso painting Guernica and asked, can you please provide the societal context for this work of art?

Starting point is 00:03:46 Chad Chagipt nailed it. The artwork you've shared it wrote is Guernica by Pablo Picasso. painted in 1937, it is one of Picasso's most famous works and stands as a powerful political statement as well as an artistic achievement. Now, it then goes through what the context was of the painting and where it was inspired. It talked about various interpretations of the art, and it talks about the world into which it came. Now, I think here you can see how, going beyond just description, this can be a really useful tool, especially for an educational use case. Being able to drop in images of art and get all of that information is a really interesting and useful way to collect information. Now, what about the next category recommend?

Starting point is 00:04:20 as Greg puts it to offer critiques or suggest changes. The two subcategories were critiques and feedback and recommended actions, and so I fed in a mid-jurney created two-by-two grid of circular white background symbols for a podcast focus on artificial intelligence. For context, I said, I have a podcast focused on AI. I'm looking for a symbol to use on the cover art of the show. Imagine you're a listener. Can you weigh the pros and cons of these options and make a recommendation of one to use? Now, of course, if you were listening, not watching, this will be a little bit more difficult, but I'll just do a brief review so you get a sense of how ChatGBTBT was quote-unquote thinking. First of all, it gave each of them a name based on the core visual identity.

Starting point is 00:04:54 The first, for example, it called robot with circuitry background, which is pretty much exactly what it looks like. The pros, it said, were clearly represents AI and robotics. The circuitry design implies complexity and intricacy. Friendly appearance can make the podcast seem approachable. Cons, robots are a somewhat cliched representation of AI. Also might give the impression that the podcast is mainly about robotics rather than broader AI topics. Now, I think those are super strong critiques, and so it was fairly impressed with that one. The next option, it called circular circuitry with eye, and it sort of does look like an eye in the middle of a circuit board. The pros, it said, were the eye can symbolize AI's ability to see, analyze, and understand.

Starting point is 00:05:28 Circuitry and a circular design is aesthetically pleasing and symbolizes interconnectedness. Looks modern and might appeal to a tech-savvy audience. Cons might be interpreted as surveillance which could have negative connotations. Also a con less explicit in its representation of AI. Honestly, this is one of the categories that I thought it did the best job relative to what I had expected. Now, when it came to what recommendations it would make, it hedged a little bit and gave too. It said if you're covering a broad range of topics and you want a general audience with a

Starting point is 00:05:52 friendly and approachable image, go with the robot with circuitry. If it's more tech-oriented and you're looking for a modern, sophisticated look, go with circular circuitry for the eye. This is definitely a use case where, after having done this experiment, I can see myself using it more. Now we move to a really fun category, the category that Greg calls convert, convert images into other forms or generate something new. His subcategories are designed to implementation, media to text, visual to narrative, visual to graphic. For this, I dropped a very very quickly hand-scratched diagram of an idea for a fake AI education company. I had lines pointing out from a top bubble that said AI education company. I had lines coming out that said online

Starting point is 00:06:28 courses, mobile app, case studies, and community. I asked, could you please make a graphic from this quick sketch? Now, I kind of expected it to kick back something with Dolly 3, but what it actually came back with was an SQL schematic. It said for a proper graphic design, you need to use a graphic design software or services, but I hope this gives a clear idea based on the sketch you shared. But to try to get something out of this conversion, I asked it to then from there, write up a quick description based on it that I could send to prospective investors, which it did, and then from there, recommend a simple landing page site map based on that, which it once again did, homepage header hero section about us features, and then finally, I asked it to write the code for that landing page,

Starting point is 00:07:03 and once again, it did that. And so I think what's powerful here is that in the course of probably three or four total minutes, I went from a very chicken scratch handwritten thing to the code for a website, which would be at least close to an approximation of what that chicken scratch drawing was representing in the first place. Next up, another really fun category, extract. Greg describes this as extract entities within the image or provided structured output. His subcategories were handwriting extraction, formal document extraction, and qualitative extraction.

Starting point is 00:07:30 Now, it being spooky season and me being a history buff, I posted in a fragment of a letter around the Salem Witch trials from Cotton Mather. I asked, could you please transcribe the words written here? Do you know anything about the context of the note based on that? Now, this was a particularly hard document because, in addition to just very old 1691 writing, it was also on double-sided paper where the ink bled through, so I think that made it extra hard for GBT4 Vision. It was able to get around 50%, I would say, of the lines,

Starting point is 00:07:56 which frankly was not necessarily enough to really understand. Although I certainly found that with the 50% that it was able to interpret, it made it a little bit easier to guess the in-betweens. What it did get was the date and the person, September 2nd, 1691, your sincere and most humble servant, C. Mather. From there, it said it appears to be a letter or note from the late 17th century, and the signature at the bottom suggests it was written by C. Mather, likely referring to Cotton Mather, a prominent New England Puritan minister, prolific author, and pamphleteer. He is often remembered for his role in the Salem Witch Trials.

Starting point is 00:08:27 The context of this note, given its date and the content, could relate to a wide range of issues Mather was involved in, but without more specifics, it's hard to pinpoint its exact context. This obviously wasn't a slam dunk, but then again, the text didn't really make explicit mention of the witch trials, so all it really had to go on was that date. I think it was still pretty strong for it to identify correctly who the author was, and it definitely feels to me like this is a very promising use case for anyone who's working with these sort of old documents. The sixth and penultimate category that Greg writes is assist offer solutions based on the image. This could include task solution and solver, explanatory assistance, or strategy recommendations. Remember when GPT4 Vision was premiered,

Starting point is 00:09:02 one of the examples they gave was someone who was trying to assemble their bike, and who was able to take pictures and then ask GPT4 Vision if they were putting parts in the right spot. For this, I took an image of an old NES, the very first generation of Nintendo, and I asked what I would need to install this on a modern TV. Now, I actually like these old ancient systems better than new ones, so we have a bunch of them kicking around here, an original Nintendo, a Super Nintendo, and of course an N64 for GoldenE. I will also tell you that I have frequently not bought in all the correct pieces that I need to get it to run on my TV and had to race to Amazon to figure out what I actually needed. On the other hand, this image into ChatGBTBT, it gave me the full list of exactly what I'd need. An AV to

Starting point is 00:09:41 HTML converter, an HTML cable, the power supply, the TV with HTML input, the controllers, the game cartridge, and optional an RF to coaxial adapter. It also, without me asking, gave a set of instructions for setting it up. I tend to think, especially on mobile, this type of use case is going to be the default for many, many people, and the thing that has them pull out their app to use it more than anything else. The ability to take a picture in the real world of something that you need help with and get that help, is where you. where a chat GPT could start to replace calling your tech-savvy friend or whatever else you would have done in the past. Greg's last category is evaluate, subjective judgment based on the image. He includes a subcategories,

Starting point is 00:10:17 aesthetic evaluation, subject evaluation, and accuracy check. Now, I wasn't exactly sure how it was going to handle this. Subjective interpretation is, of course, subjective. And so what I did is create a landscape painting on Mid Journey, and I asked it simply, how would you improve this landscape painting? Now, part of why I thought this was an interesting question is that it is so inherently a subjective question, what does improve mean when improvement is in the eye of the beholder. But what it did is actually give a really technical explanation, not of what could make it better, per se, but of what I could do if there were different goals that I had that I might want to accentuate. For example, it suggested depth and atmosphere changes.

Starting point is 00:10:54 Increase the atmospheric perspective by slightly fading out the background mountains and trees to enhance the feeling of depth. This can be done by adding a light blue or gray wash to distant objects. Contrast, boosting the contrast in certain areas can help draw the viewer's eye. For instance, emphasizing the highlights on the water where sunlight might hit it can give the illusion of sparkling, moving water. And so on and so forth, it went through a ton of different suggestions, warmth, texture, wildlife, reflections, variety of vegetation, sky, even narrative elements, under which it wrote introducing a small element that tells a story, like a lone cabin, a bridge, or a boat, can give viewers more to engage with. And what's cool and I think reflective of

Starting point is 00:11:28 this, is that basically these say, if you are trying to achieve X, If X would be an improvement in your estimation, here's what you can do to actually achieve that. That's a type of subjective analysis that's actually incredibly valuable. And my guess is for many artists or writers going to be more valuable than the type of feedback that they're normally used to getting from other people, which is I would like it better if. So we'll wrap there. Those are the seven use cases. Again, describe, interpret, recommend, convert, extract, assist, and evaluate.

Starting point is 00:11:56 Let me know if you've found other use cases for GPT4 Vision that weren't represented in here. I would love to see what you are working on. Share examples. Come to the Discord and do it. We're at bit.ly slash AI breakdown and have fun creating. Until next time, guys. Peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - The 7 Use Case Categories of ChatGPT-4 Vision

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.