The AI Daily Brief: Artificial Intelligence News and Analysis - AI that Can See the World? Meet MiniGPT-4 an Open Source Image-to-Text Model

Episode Date: April 19, 2023

We've had many examples of text-to-image but fewer AI models that can interpret images. MiniGPT-4 is a new open source model that can look at an image of food and give you the recipe, look at a white ...board mockup of a website and give you the code, look at a picture of a person and their dog at sunset and write a poem. Subscribe to the YouTube channel here: https://www.youtube.com/@TheAIBreakdown

Transcript
Discussion (0)
Starting point is 00:00:00 The episode you're about to hear was first released as a YouTube video on Wednesday, April 19th. This AI breakdown is all about mini GPT4, a new open source software just debuted by researchers that can take an image and actually describe it in words. And more than just describe it, it can infer things from it, turn it into working code, or even write a poem about it. Over the last six months, we've had our minds blown about AI's ability to take words. and turn them into images. But what if it could go the reverse order, where AI could take images and turn them into words? What if AI could look at a food and turn it into a recipe?
Starting point is 00:00:47 What if it could look at an image of a whiteboard and turn it into working code? That's been on the table for companies and projects like OpenAI for a while. But this week we got an open source project called Mini GPT, Just research at this stage, which shows exactly that happening in practice. So in this video, we're going to talk about MiniGPT4, where it comes from, what it does differently, and then look at some demos to see if it actually works. MiniGPT4 is a research project that was just announced a couple of days ago.
Starting point is 00:01:22 The researchers call it enhancing vision language understanding with advanced large language models. And effectively, as I mentioned in the intro, what they're trying to do is, to allow AI to work in the opposite direction of tools like Mid Journey, to be able to see images and then describe them, interact with them, generate information on the basis of them. Now, they've done a few things differently, which are a little bit more technical than I want to get into in the context of this video, but suffice it to say that they've trained a smaller data set, but more, and that's producing some really interesting results. For those of you who are interested in the technicals, let me know, and maybe we'll do a deeper video on some of that
Starting point is 00:02:01 aspect of the research, but for now I want to focus on what the use cases are and what's coming out of this. Nate Chan here really sums it up. Ask questions about pictures, he says. What is this? Tell me what's wrong with my plant. Give me a recipe for this dish. Write the code for this rough mock-up. Now these all come from the mini-GPT4 demos that they show, and so I'll walk you through these kind of quickly. They show identifying problems from photos providing solutions. So this is an image of a plant that looks sick. The person says, what is the issue with my plant? What should I do now?
Starting point is 00:02:37 Mini GPT4 says the image shows a leaf with brown spots on it. They're likely caused by a fungal infection which can be treated with fungicide. It then gives a set of steps to follow up to treat the plant. The next example they show is a product advertisement. The prompter says, we design new mugs as shown in the image. Can you help us write an ad? Mini GPT says, sure. Introducing our new line of mugs.
Starting point is 00:02:59 featuring adorable cat designs. These mugs are perfect for coffee lovers and cat enthusiasts alike. Another use case, which I think many of us Pinteresters, will be very excited about, is food recipe generation. The prompter drops a set of beautiful-looking lobster tails that are stuffed and says, how should I make something like this? Mini GBT figures out what it is and suggests both ingredients as well as the steps to the recipe. This one could change how people build websites, right?
Starting point is 00:03:28 generating website code from handwritten text and then the rendered website. So this is a whiteboard that says a joke website, content of joke one, punchline one, and the prompter asks to turn this into HTML and JavaScript, which Mini GPT then does. There's even more creative use cases that the researchers have stumbled upon. So this is an example where the prompter says, can you write me a beautiful poem about this image? It's a dog sitting with presumably its owner on a head. hill during a beautiful sunset. Many GBT said, sure, here is a poem about the image you provided.
Starting point is 00:04:04 A man sits on a hill with his dog by his side. The sun sets behind them casting a golden light. The city below is bustling, but they are at peace, enjoying the moment in each other's embrace. Now, we can get into whether it's a good poem or not, but it's a poem, and that's pretty remarkable. Then they can also, and this is the last example they use, discover unusual content. So the prompter posts an image of a cactus sitting on ice in a lake. It seems clearly unreal and says describe the image. Mini GPT says the image shows a cactus plant standing in the middle of a frozen lake and then goes on to describe other aspects of it.
Starting point is 00:04:41 When the prompter says, can this happen in the real world? Mini GPT says, no, this image is not common in the real world. While it is possible for cacti to grow in cold climates, it is rare to see one standing in the middle of a frozen lake. So these are the demos they presented, but really the question of course is what can people repeat or do? And this isn't a full-fledged release. This is just a demo release, so that should be kept in mind. But the initial responses that people have are pretty impressed. Simon Willison says mini-GPT-4 is pretty astonishing. An AI chatbot you can use to ask questions about an image,
Starting point is 00:05:15 a feature that's been promised but not yet shipped by GPT4. So in his posts, he shows an image of an ice cream that's in a heart shape with sprinkles on top, sitting on a black napkin on some sort of table. He says to Mini GPT4, what is this thing? Mini GP24 responds, this is a pink ice cream cone with sprinkles on top. It is sitting on a black napkin on a table. Another person who tried this out was Bro Ski on Twitter. They say, so I tried out Mini GPT4 and used an image of Lionel Messi and asked the LLM which soccer player it is on the image and for what soccer club he plays for on the image because Messi right now plays for PSG. The results are astonishing. Basically, the Mini GPT4 was able to correctly identify Messi and what team he was playing
Starting point is 00:05:59 for in that image based on his jersey. Broosky also had another test saying, so I tried out mini-GPt4 again and asked it to describe this Mid-Jurney image I made a while back. This might be the perfect way to do reverse image prompting for Mid-Journey. Now, of course, Mid-Journey actually just released a similar feature, which they called The Scribe. So using the Describe, so using the describe command on mid-Jurney, you can now ask Mid-Journey to figure out how it would prompt to get the result of a particular image. It's a really great way to reverse engineer prompts that you like or images that you're impressed with. It's a great learning tool, and I think it's probably going to build into Mid-Journey workflows going forward. But this is another
Starting point is 00:06:37 example of a similar type of technology. Now, from a technical perspective, Shiva Kintali here says, forget about LLMs. I'm suggesting a new name, MMLM's. mighty might language models. Remember the new mantra, train a smaller model for a longer time. The latest breakthrough is mini GPT4. So Shiva here is saying that this is a different category. It's a new approach to training that involves a smaller model for a longer time. But all of this, of course, made me want to try it out. And like I said, this is just a demo, so we have no idea if it will be as performant as the tests we saw or the test that they suggested in their demo, but let's try it out. I've got here a picture of the David, and we're going to
Starting point is 00:07:20 upload it and start the chat. So let's ask it, what is this and where is it? Let's see how long it it takes to figure it out. This is a statue of David, a biblical figure known for his victory over Goliath. The statue is located in the Academia Gallery in Florence, Italy. So obviously got it right. Now let's try one of those creative responses. I have no idea if we can get it to write a poem, but let's see. What type of emotion? does this artwork inspire in visitors who come to see it? I have no idea how it'll handle that question because it's so subjective. The statue of David in the Academia Gallery in Florence is likely to inspire feelings of awe and admiration in visitors.
Starting point is 00:08:03 The statue is a masterpiece of Renaissance art and is considered one of the greatest achievements of the humanist movement. The depiction of David as a strong, confident, and heroic figure is likely to evoke feelings of pride and inspiration in visitors. The statue is also likely to inspire feelings of wonder and amazement at the skill and artistry of the sculptor, Michelangelo. Pretty impressive. Now, let's do one more. Can you write a poem about this sculpture that captures some of those emotions? In the halls of fame, a statue stands tall. The David of Michelangelo, a work of art, the marble form, so lifelike and grand, inspires awe and admiration, a sense of pride, a symbol of strength and courage, a heroic guide.
Starting point is 00:08:47 The mastery of the sculptor, a work of art, skill and craftsmanship, a sight to behold. In the Academia Gallery, a place of fame. The statue, the David statue stands a source of inspiration in awe. A symbol of humanism, a work of Renaissance art, a testament to the power of the human spirit. In the statue of David, we see the strength and courage, a source of inspiration, a symbol of Michelangelo's enduring power. Pretty cool stuff, I have to say, but let's see if we can replicate one of the other demos where we post a picture of food and see if it can come up with a recipe. So let's upload and write how would I make what's in this photo? This is a Greek salad with tomatoes, cucumbers, red onion, feta cheese, and olives in a white bowl. It is topped with a
Starting point is 00:09:29 vinaigrette dressing. So it's a pretty good summation of what we saw in that demo. Anyways, I think we could get a lot more into these, but let's do a quick summary. I think why people are so excited about this is a couple parts. First of all, it works. You can see from that demo, even though it was very nascent and simple and this isn't really a full production model or anything like that, that we were able to replicate some of the demos that the researchers had presented, which is really cool. Second, I think people are really excited about the fact that this is a feature that companies like OpenAI have talked about for a while, but it's being shown in practice by researchers doing it in a slightly different way and in an open source way.
Starting point is 00:10:10 Yesterday we talked about whether open source can actually compete with closed source projects, and when we see things like this, it makes it seem a heck of a lot more viable. Anyways, Mini GPD is definitely the type of thing that feels like it will be commonplace going into the future and something that we just expect to be part of what we can do, even though it was unimaginable so recently. Anyways, that's my sum up of Mini GPT4. I will certainly keep you guys posted with any new use cases or experiments that I see. But for now, thanks for watching. Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.