The AI Daily Brief: Artificial Intelligence News and Analysis - Text-to-Image AI That Can Actually Spell!? Meet DeepFloyd IF

Starting point is 00:00:00 The AI breakdown you're about to hear was originally released as a YouTube video on Monday, May 1st. In it, we discussed Deep Floyd IF, a new text image generation model that can actually spell better than a preschooler. If you followed along with things like Mid Journey and Dali and Stable Diffusion, Deep Floyd IF is something you're going to want to check out. For a little while now, there has been buzz around something called Deep Floyd. You can see this image on your screen here from April 19th that says letters made of clouds that says really soon above beautiful ocean. And there is in fact an image of an ocean that has clouds sitting on top of it that says really soon.

Starting point is 00:00:45 Now, if you've ever played with Mid Journey or Dali or any of these text to image generators, one of the things that you know is that they have a very, very hard time with words, with spelling, even with getting characters of a specific language. You will often find a picture that looks photorealistic except for a billboard or a sign on the wall with absolute gibberish or nonsense. And it appears that Deep Floyd has been working on a solution to that. These are all teaser images that have come out.

Starting point is 00:01:17 Well, this is a new model from a team that's connected to the folks at Stability AI. Stability AI, of course, brought you stable diffused. and more recently stable L.M plus stable Vakuna. They're a very busy team, it seems. And Imod here, the CEO of Stability AI, writes, Deep Floyd IF is the most advanced image gen model out there with an FID 30K score of 6.66 beating Dali 2, Imogen, Parti, and more. We'll get even better with optimizations and more to come. So this is a new text image generation model. stability AI just announced this on the 28th of April.

Starting point is 00:01:58 They write, Today's Stability AI, together with its multimodal AI research lab Deep Floyd, announced the research release of Deep Floyd IF, a powerful text-to-image cascaded pixel diffusion model. Deep Floyd IF is a state-of-the-art text-image model released on a non-commercial research permissible license that provides an opportunity for research labs to examine and experiment with advanced text-to-image generation approaches.

Starting point is 00:02:21 In line with other Stability AI models, stability AI intends to release a Deep Floyd IF model fully open source at a future date. Now, this goes through all of the features and the first two have to do with text. Deep text prompt understanding. The generation pipeline utilizes the large language model T5XXL as a text encoder. A significant amount of text image cross-attention layers also provides better prompted image alliance. Second bullet, application of text description into images. Incorporating the intelligence of the T5 model,

Starting point is 00:02:52 deep Floyd IF generates coherent and clear text alongside objects of different properties, appearing in various spatial relations. Until now, these use cases have been challenging for most text-to-image models. Now, this article from 1b.a.i gets into a little bit more around what makes Deep Floyd IF different. They write, Beyond the fact that it's open source and it wouldn't fail the first round of an elementary school spelling bee, IF boasts better performance on nuances than other generative models

Starting point is 00:03:19 typically struggle with. Here we're talking about spatial awareness. and composition. If you prompt some diffusion models with specific instructions about what objects are in front of what other objects or of what material they're made of, they often struggle. This is especially true for complex prompts with multiple objects described with multiple adjectives. They can often be mixed up or sometimes ignore it altogether. The researchers who trained IF used less style data than some other generative models, so if you're looking to create an anime Abe Lincoln, you'll want to look elsewhere.

Starting point is 00:03:48 Lastly, a lot of attention was paid when training IF to make it safe. Generative models have significant potential to create harmful or explicit content. The researchers here walked us through the laudable steps they took to remove racier violent imagery from their training data. This is never a bad idea and doubly so for open source efforts. So they identify here a few different things that make this model interesting and potentially different. One is this idea of spatial awareness and composition where you can instruct it using language around which objects are supposed to be in the background or the foreground or basically just organizing objects in the picture. Now in terms of how deep floyd IF was trained, there are a couple interesting data sets that were included. One is called Lyon, L-A-I-O-N, which is a

Starting point is 00:04:28 5 billion data set of image-to-text pairs. So this is going to help IF understand text in context and be able to produce that in a way that other models can't. They also trained on something called Clever, C-L-E-V-R, which has a lot of images that are basically useful for that spatial awareness and composition that we talked about. So for those of you listening, the example that they give is a shot that's a white background with a number of different geometric objects. There's a couple of cubes, a sphere, a cylinder, et cetera, et cetera. And this is a way to help the model understand more about that spatial and compositional nuance. But of course, what we're really interested in is what it can do in practice. And one of the things the team calls out is in painting. So one of the people on the team

Starting point is 00:05:13 here shows a picture of kind of Abraham Lincoln Cross with Vincent Van Gogh with a hat appearing on their head. Another example is this image to image translation where the same picture, in this case, a woman in the foreground with sort of a river, mountain forest scene in the background, switches between different styles. So one of the styles is sort of a paper cutout model. One of the styles is Legos. One of the styles is a little bit more anime. So basically taking the same image, but turning it into a set of different styles. Still, the thing that people are most excited about beyond a shadow of a doubt is the fact that this model can actually get text into images, which is just something we haven't had yet.

Starting point is 00:05:54 Javi Lopez here tweets new generative AI toy, Deep Floyd IF. His prompt is a neon sign of an American motel at night with the sign, J-V-L-L-O-P. Sure enough, in the picture attached, there's a neon sign that says J-A-V-I-L-O-P motel, just like he asked it to. And so, of course, I wanted to give it a try. And so what I did is I gave the same prompt to Mid-Journey version 5, and to IF to see what would come out. The first prompt was a film camera photo

Starting point is 00:06:21 of a 1960s Southern California Beachside Burger Restaurant Hasey Afternoon. I love this sort of nostalgia stuff. So you can see this is what Mid Journey came back with. Amazing image, I mean beautiful, captured exactly what I was going for from a vibe standpoint, but basically nonsense characters.

Starting point is 00:06:37 There's sort of a recognizable B, but then a letter that doesn't exist in any language that I'm familiar with. And that's pretty normal, right? This is pretty typical for what you see from Mid Journey. Now, meanwhile, with the Deep Floyd IF version, the first thing that came back, perhaps a little less polished, although it certainly looks like a film camera from that era, but it says B-U-R-R-G-E-R-R on the sign. So, Berger, a lot closer. It's legible text.

Starting point is 00:07:03 It's actually English characters, and it's close to the word that I was discussing. Now, I gave it a little bit extra help in the next prom and actually told it to do the same thing but with a sign that said Burger, and sure enough. it got it right that second time. So it's the same sort of hazy Southern California burger stand, but the sign on top of it says B-U-R-G-E-R. So that was test one. Test two, I wrote a punk-looking girl with green hair standing outside on Wall Street in New York City, holding a cardboard sign that says buy Bitcoin. Again, here's the mid-journey, super cool image, but the letters on the cardboard sign say B-B-B-D-Y-T-B-L-T-Y. So not exactly by Bitcoin, although at least they're English letters, right, in this case.

Starting point is 00:07:46 back to Deep Floyd IF. Sure enough, green-haired punk-looking girl on a New York City looking street with a sign that says buy Bitcoin in completely legible writing. Now, I'm sure most of you who are watching will notice that the face is kind of wonky with one side sort of tilted and the eyes a little bit misshapen and the right hand on this person has two sets of knuckles at least and a number of different fingers. So you're getting into some of the other challenges with this sort of text to image prompting, but the sign that says by Bitcoin, at least, is really clear. My third prompt was a humanoid robot typing on a computer in a futuristic lab with a neon sign on the wall that says the AI breakdown. Now, admittedly, this one was a little bit tougher

Starting point is 00:08:29 because it was more complex. Mid-Journey just ignored the sign entirely, very cool image of a robot in a futuristic looking lab. But meanwhile, Deep Floyd IF gave it a real go. You have the backside of a robot looking at a computer screen that says the AI, B-E-A-O-O-O-O-W, and then it fades off the screen. So a good college try, right? I tried to narrow that prompt down with a neon sign on a brick wall that says the AI breakdown in glowing pink letters. Let's see if it can produce a marketing material, right? Well, this is mid-jurney, no recognizable letters, really, kind of an R, I guess, and an A and sort of an L in there, but it says Ravilfian or something, cool-looking image, but obviously not getting what we were going for. Meanwhile, Deep Floyd kind of came close, but not quite again.

Starting point is 00:09:19 You see the sign here that T and an H are combined in the V, and it got the AI, but then in terms of the breakdown, it says B-O-T-K-L-O-W, so not exactly clear. And I wonder to what extent this has to do with the fact that the AI breakdown is such a specific custom thing. It's not drawing from lots of examples of the word burger, as in my previous example, or even Bitcoin. So the point here is that even though this model is hugely improved in terms of its ability to output text, it is not perfect. But I think that's the conclusion of this and what makes Deep Floyd IF so exciting is that it's easy to when you see something like Mid Journey V5 get lost in how excellent it is and sort of ignore all the areas that it has left to develop. But it's been a year

Starting point is 00:10:07 of these models existing in a really usable way. And the fact that this new Deepwater is, Floyd IF model is showing such huge progress on the text question. It means that you have to think that this ability to put text into images as a part of text image generators has to be around the corner for many more models than just Deep Floyd IF. David Vorick sums up the week here saying Deep Floyd has been released. RLHF Fikunia has been released and a new self-learning AI called Wizard LM has been published. Open Source AI is doing well this week.

Starting point is 00:10:37 If you want a little preview of things that I'm going to talk about on the AI breakdown this week. A lot of these are on the list. Anyways, guys, that's it for today. Hope you enjoyed this conversation. Go check out Deep Floyd LF. It's available on Hugging Face. I'll include a link in the show notes or the video description if you're watching it. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Text-to-Image AI That Can Actually Spell!? Meet DeepFloyd IF

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.