The AI Daily Brief: Artificial Intelligence News and Analysis - Text-to-Image AI That Can Actually Spell!? Meet DeepFloyd IF
Episode Date: May 1, 2023If you've ever used Midjourney, Dall-E, Stable Diffusion or another text-to-image generator, you'll know that words are a weakness. Text (such as on signs) tends to be gibberish. DeepFloyd IF has star...ted to solve that problem and it's doing it open source. Referenced in the video: https://twitter.com/DeepFloydIF https://twitter.com/EMostaque/status/1652295961404645376 https://stability.ai/blog/deepfloyd-if-text-to-image-model https://twitter.com/hardmaru/status/1651822596844048385 https://the-decoder.com/deepfloyd-if-is-a-crazy-good-text-to-image-model-and-open-source/ https://wandb.ai/geekyrakshit/deepfloyd/reports/A-Gentle-Introduction-to-DeepFloydAI-s-New-Diffusion-Model-IF--VmlldzozNTY3Nzc4 https://twitter.com/javilopen/status/1652387049268297729 https://huggingface.co/DeepFloyd https://twitter.com/DavidVorick/status/1652070967412129793 Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/
Transcript
Discussion (0)
The AI breakdown you're about to hear was originally released as a YouTube video on Monday, May 1st.
In it, we discussed Deep Floyd IF, a new text image generation model that can actually spell better than a preschooler.
If you followed along with things like Mid Journey and Dali and Stable Diffusion, Deep Floyd IF is something you're going to want to check out.
For a little while now, there has been buzz around something called Deep Floyd.
You can see this image on your screen here from April 19th that says letters made of clouds
that says really soon above beautiful ocean.
And there is in fact an image of an ocean that has clouds sitting on top of it that says
really soon.
Now, if you've ever played with Mid Journey or Dali or any of these text to image generators,
one of the things that you know is that they have a very, very hard time with words, with spelling,
even with getting characters of a specific language.
You will often find a picture that looks photorealistic
except for a billboard or a sign on the wall
with absolute gibberish or nonsense.
And it appears that Deep Floyd has been working on a solution to that.
These are all teaser images that have come out.
Well, this is a new model from a team that's connected
to the folks at Stability AI.
Stability AI, of course, brought you stable diffused.
and more recently stable L.M plus stable Vakuna. They're a very busy team, it seems.
And Imod here, the CEO of Stability AI, writes, Deep Floyd IF is the most advanced image gen model out there
with an FID 30K score of 6.66 beating Dali 2, Imogen, Parti, and more. We'll get even better
with optimizations and more to come. So this is a new text image generation model.
stability AI just announced this on the 28th of April.
They write,
Today's Stability AI, together with its multimodal AI research lab Deep Floyd,
announced the research release of Deep Floyd IF,
a powerful text-to-image cascaded pixel diffusion model.
Deep Floyd IF is a state-of-the-art text-image model
released on a non-commercial research permissible license
that provides an opportunity for research labs
to examine and experiment with advanced text-to-image generation approaches.
In line with other Stability AI models,
stability AI intends to release a Deep Floyd IF model fully open source at a future date.
Now, this goes through all of the features and the first two have to do with text.
Deep text prompt understanding.
The generation pipeline utilizes the large language model T5XXL as a text encoder.
A significant amount of text image cross-attention layers also provides better prompted image alliance.
Second bullet, application of text description into images.
Incorporating the intelligence of the T5 model,
deep Floyd IF generates coherent and clear text alongside objects of different properties,
appearing in various spatial relations.
Until now, these use cases have been challenging for most text-to-image models.
Now, this article from 1b.a.i gets into a little bit more around what makes Deep Floyd
IF different.
They write,
Beyond the fact that it's open source and it wouldn't fail the first round of an elementary
school spelling bee, IF boasts better performance on nuances than other generative models
typically struggle with.
Here we're talking about spatial awareness.
and composition. If you prompt some diffusion models with specific instructions about what objects
are in front of what other objects or of what material they're made of, they often struggle.
This is especially true for complex prompts with multiple objects described with multiple adjectives.
They can often be mixed up or sometimes ignore it altogether.
The researchers who trained IF used less style data than some other generative models, so if you're
looking to create an anime Abe Lincoln, you'll want to look elsewhere.
Lastly, a lot of attention was paid when training IF to make it safe. Generative models have significant
potential to create harmful or explicit content. The researchers here walked us through the laudable
steps they took to remove racier violent imagery from their training data. This is never a bad idea
and doubly so for open source efforts. So they identify here a few different things that make this
model interesting and potentially different. One is this idea of spatial awareness and composition
where you can instruct it using language around which objects are supposed to be in the background
or the foreground or basically just organizing objects in the picture. Now in terms of how deep floyd
IF was trained, there are a couple interesting data sets that were included. One is called Lyon, L-A-I-O-N, which is a
5 billion data set of image-to-text pairs. So this is going to help IF understand text in context and be
able to produce that in a way that other models can't. They also trained on something called
Clever, C-L-E-V-R, which has a lot of images that are basically useful for that spatial awareness
and composition that we talked about. So for those of you listening, the example that they give is a shot
that's a white background with a number of different geometric objects. There's a couple of cubes,
a sphere, a cylinder, et cetera, et cetera. And this is a way to help the model understand more about
that spatial and compositional nuance. But of course, what we're really interested in is what it can do
in practice. And one of the things the team calls out is in painting. So one of the people on the team
here shows a picture of kind of Abraham Lincoln Cross with Vincent Van Gogh with a hat appearing on their
head. Another example is this image to image translation where the same picture, in this case,
a woman in the foreground with sort of a river, mountain forest scene in the background, switches
between different styles. So one of the styles is sort of a paper cutout model. One of the
styles is Legos. One of the styles is a little bit more anime. So basically taking the same
image, but turning it into a set of different styles. Still, the thing that people are most excited
about beyond a shadow of a doubt is the fact that this model can actually get text into images,
which is just something we haven't had yet.
Javi Lopez here tweets new generative AI toy, Deep Floyd IF.
His prompt is a neon sign of an American motel at night with the sign, J-V-L-L-O-P.
Sure enough, in the picture attached, there's a neon sign that says J-A-V-I-L-O-P motel,
just like he asked it to.
And so, of course, I wanted to give it a try.
And so what I did is I gave the same prompt to Mid-Journey version 5,
and to IF to see what would come out.
The first prompt was a film camera photo
of a 1960s Southern California Beachside Burger Restaurant
Hasey Afternoon.
I love this sort of nostalgia stuff.
So you can see this is what Mid Journey came back with.
Amazing image, I mean beautiful,
captured exactly what I was going for
from a vibe standpoint,
but basically nonsense characters.
There's sort of a recognizable B,
but then a letter that doesn't exist
in any language that I'm familiar with.
And that's pretty normal, right?
This is pretty typical for what you see from Mid Journey.
Now, meanwhile, with the Deep Floyd IF version, the first thing that came back, perhaps a little less polished, although it certainly looks like a film camera from that era, but it says B-U-R-R-G-E-R-R on the sign.
So, Berger, a lot closer.
It's legible text.
It's actually English characters, and it's close to the word that I was discussing.
Now, I gave it a little bit extra help in the next prom and actually told it to do the same thing but with a sign that said Burger, and sure enough.
it got it right that second time. So it's the same sort of hazy Southern California
burger stand, but the sign on top of it says B-U-R-G-E-R. So that was test one. Test two, I wrote a punk-looking
girl with green hair standing outside on Wall Street in New York City, holding a cardboard
sign that says buy Bitcoin. Again, here's the mid-journey, super cool image, but the letters on the
cardboard sign say B-B-B-D-Y-T-B-L-T-Y. So not exactly by Bitcoin, although at least they're
English letters, right, in this case.
back to Deep Floyd IF. Sure enough, green-haired punk-looking girl on a New York City looking street
with a sign that says buy Bitcoin in completely legible writing. Now, I'm sure most of you who are
watching will notice that the face is kind of wonky with one side sort of tilted and
the eyes a little bit misshapen and the right hand on this person has two sets of knuckles at least
and a number of different fingers. So you're getting into some of the other challenges with this
sort of text to image prompting, but the sign that says by Bitcoin, at least, is really clear.
My third prompt was a humanoid robot typing on a computer in a futuristic lab with a neon
sign on the wall that says the AI breakdown. Now, admittedly, this one was a little bit tougher
because it was more complex. Mid-Journey just ignored the sign entirely, very cool image of a robot
in a futuristic looking lab. But meanwhile, Deep Floyd IF gave it a real go. You have the backside of a
robot looking at a computer screen that says the AI, B-E-A-O-O-O-O-W, and then it fades off the screen.
So a good college try, right? I tried to narrow that prompt down with a neon sign on a brick wall that
says the AI breakdown in glowing pink letters. Let's see if it can produce a marketing material,
right? Well, this is mid-jurney, no recognizable letters, really, kind of an R, I guess, and an A
and sort of an L in there, but it says Ravilfian or something, cool-looking image, but obviously
not getting what we were going for. Meanwhile, Deep Floyd kind of came close, but not quite again.
You see the sign here that T and an H are combined in the V, and it got the AI, but then in terms of the
breakdown, it says B-O-T-K-L-O-W, so not exactly clear. And I wonder to what extent this has to do
with the fact that the AI breakdown is such a specific custom thing. It's not drawing from lots of
examples of the word burger, as in my previous example, or even
Bitcoin. So the point here is that even though this model is hugely improved in terms of its
ability to output text, it is not perfect. But I think that's the conclusion of this and what makes
Deep Floyd IF so exciting is that it's easy to when you see something like Mid Journey V5 get lost
in how excellent it is and sort of ignore all the areas that it has left to develop. But it's been a year
of these models existing in a really usable way. And the fact that this new Deepwater is,
Floyd IF model is showing such huge progress on the text question.
It means that you have to think that this ability to put text into images as a part of text
image generators has to be around the corner for many more models than just Deep Floyd
IF.
David Vorick sums up the week here saying Deep Floyd has been released.
RLHF Fikunia has been released and a new self-learning AI called Wizard LM has been published.
Open Source AI is doing well this week.
If you want a little preview of things that I'm going to talk about on the AI breakdown
this week. A lot of these are on the list. Anyways, guys, that's it for today. Hope you enjoyed this
conversation. Go check out Deep Floyd LF. It's available on Hugging Face. I'll include a link in
the show notes or the video description if you're watching it. Until next time, peace.
