a16z Podcast - AI, Design, and the Power of Open Models

Starting point is 00:00:05 It's not about how good a model is in the general sense. It's about how good is this model for my use case. For a lot of design and marketing use cases, we need editable design, not a single flat image. It's super impressive, honestly, reaching the level of things like nanobanana or GPT image with an open-source model. Why did you think that was important?

Starting point is 00:00:27 We really want our models to have taste. Every artist, they can really customize this model to the new answers of their... style, the texture of their canvas, and really get 2K output, and hopefully make that part of their workflow. One thing we were always wondering is that this release, the open source model is so small. It's 9.3 billion parameters. Like previously, SOTI is probably like 80 billion parameters.

Starting point is 00:00:53 It's like NIX difference. How did you do it? We focused on... Image generation has improved dramatically over the last few years. The next challenge is not simply creating images, but giving users more control over what gets created and how. That includes everything from typography and layouts to editing, customization, and workflows that fit into professional creative processes.

Starting point is 00:01:19 Yoko Lee and Justine Moore speak with IDogram founder and CEO, Mohamed Noruzzi, about image generation, open weight models, design tools, and the future of creative AI. So today we're excited to have Muhammad, CEO and founder of Ideogram, a Toronto-based generative AI company that just released their first open-weights Emmett's model. Congrats on a huge release. Congratulations. Thanks for having me.

Starting point is 00:01:46 We're really excited to talk through something that everyone has been buzzing about, which is the fact that the model is open weights. The previous Ida-Ram models have been closed source, so we'd love to hear how you made the call to make it open this time. What has happened is there has been a lot of progress in the industry. And we used to do everything. Basically, we had our own first party app, as well as our own first party API,

Starting point is 00:02:11 and model development itself is a lot of work. And we decided to focus a little more on the model side. We think that's where a lot of potential exists. We still want to continue to own the interaction with the users. We think there's a lot of important feedback we can get from the users directly. But then we want to focus more on building the model. and by releasing the weights, you're actually extending ourselves and working with inference providers,

Starting point is 00:02:40 working more directly with large enterprise. They have every ability to customize the models or host it on-prem or optimize it for a device. And we would love to work with the best chip makers to really optimize the model, the best inference providers. So this is basically us saying, hey, we are very serious about building the foundation model and we would like to work with you wherever you are,

Starting point is 00:03:09 whether you're an app developer or a chip maker or an inference provider. I think you already kind of touched on this. The new open source model is very exciting in that it unlocked a lot of new use cases. It's very photorealistic. I think it can generate up to 2K with a smaller model too. Obviously, there's very precise layout control as well. Do you want to talk about some of the net-new use cases?

Starting point is 00:03:32 that's unlocked by this model? So this is actually a foundation based on which we're going to release some more exciting features next. It's just the first release. You're just testing the waters, figuring out how to work with HigginFace and the open source community, comfy UI, etc. What I'm personally most excited about is something we haven't released yet, which is editable text and layout control. And I really believe for a lot of design and marketing use cases, we need editable design, not a single flat image.

Starting point is 00:04:08 And we haven't released that yet. We kind of show the teaser in our video, but I'm personally most excited about that. On the technical side, what we've done is we went really detailed on the prompting. And if you look at our prompts, it's like thousands of words, each element in the image, where it is in the image, We have layout control bounding box and a number of elements. And that's one of the key innovations here that unlocks a lot of, again, design use cases

Starting point is 00:04:40 because you clearly want font control, you want layout control. And this model is very versatile, allows you to really fix certain elements, fixed positioning, and control the image generation in every detail possible. Amazing. And one of the other things I immediately noticed about the model, was how you can render super long texts, like paragraphs of text completely accurately, which you either give the model in the prompt

Starting point is 00:05:08 or you ask the model to come up with something and it does it really well. And it's super impressive, honestly, reaching the level of things like nanobanana or GPT image with an open-source model. Was that something you guys really focused on and sort of why did you think that was important? I don't know if you remember,

Starting point is 00:05:25 but the very first model released three years ago. And at the time, image generation was synonymous with garbel text and there were memes about Dolly 2 generating travel posters with incorrect city names. Yeah. Which is fun to look at. So I remember at the time we were just a few people building these models and the question was how can we differentiate?

Starting point is 00:05:48 What's unique about our model? And he said, okay, text generation, accurate text is something we have. And then we released it and we were really surprised. It's just so many people were so excited about text generation, and then we realized, oh, actually, that's the whole graphic design and storytelling industry. Text is very important part of image generation, and that became a very important part of our brand.

Starting point is 00:06:12 So if you search idogram, people talk about the quality of typography, the quality of text. We are known for really stylized typography, for logo, T-shirt design, graphic design in general. And so we continue to push forward. Our previous model wasn't really beating the seed of the audience text generation, but we continue to focus on that and we had a bunch of research breakthroughs. And with this model, despite the fact that it's very tiny, the text generation is very, very accurate.

Starting point is 00:06:45 One of the things that stood out to us, which is what the community has been chatting about, is how there's new ways of processing data as you were training the model, which is like you kind of let the model learn. what is a bounding box and how to do the layering and color palettes. Do you want to talk more about some of the innovations you had during the training process? What made this model so good with these different shing features? Yeah, it's kind of difficult to exactly describe what resulted in such an amazing model. I think a lot of it is focus and evaluation.

Starting point is 00:07:22 Evaluating image models is actually a very difficult thing to do. There are a lot of benchmarks out there, but people look at that. and they're like, okay, this doesn't correlate with pixel fidelity that I care about, the realism. You don't really want Novi's users to judge the quality of these models because they may be looking at small monitors that aren't really adjusted for color accuracy. And we always cared so much about quality for the realism and, again, text accuracy. So throughout training, we always measure text accuracy and we app late.

Starting point is 00:07:58 very detailed changes to the model and data and see how that results in performance. So I would say a lot of it is really listing all the possible changes and very carefully tuning each element of the model and see what happens. Obviously, we try to gather as much data as possible. One of the standard recipes in the industry is that we take images and we turn them to,

Starting point is 00:08:28 using visual language models. The very first models we were training three, four years ago would be based on the alt text that you can find on internet. That is, each image on the internet may have an all text field associated with it, which describes what's in the image. But the problem is the alt text is often very short or inaccurate. And what we do now is we train models to go from image to text, and in this case, image to text with detailed boundy body,

Starting point is 00:08:58 information, detail element information. If we hear about text, then we really want to make sure all the text in the images correctly described. And then we go from text to image backwards. So it's kind of interesting. We gather all the images from the Internet. Some of them may have alt text. Some of them may not have all text.

Starting point is 00:09:16 And then we use AI to go from image to text. And then we train another AI model to go from text to image. So that's one of the key recipes that results in very good models. I saw a lot of JSON prompting in your technical blog, which is very unique. And as I was trying a model, it seems like it was translating the text, the prompt to a JSON representation with implicit structure. Right. Do you think JSON is a representation for image models going forward, or do you think there's another representation there? It's a very good question.

Starting point is 00:09:48 I don't know if you've seen the open source community is a little upset because of the safety image that shows up. They always are, but So Reddit was really lashing out as our engineers and one of our people said, oh, we might fix this. And they were like, we might. Oh.

Starting point is 00:10:09 But the fact is the community needs to also read the documentation and bear with us. This model is only trained with JSON prompting. And you have to provide JSON with that particular structure for you to get good quality output. So I don't know if this is a feature or bug, we did have some safety built into the

Starting point is 00:10:30 model, but that is also detecting incorrect prompts. So if you just give it a one word prompt, then you get this image is blocked by safety image back, but that's because your prompt is not a well-specific JSON. Now, we don't want people to write in JSON. We don't think that's a natural way of interacting with these models. But I do strongly believe that we need to use all the AI innovation to build the best

Starting point is 00:11:01 image generation and editing models. And there had been a lot of progress in language models in the text space. So the question is if you want to go from some vague idea to an image, what's the exact

Starting point is 00:11:17 process? How much of the thinking happens in the language space? and how much of the thinking happens in the actual kind of pixel generation space. I know you're an artist. So you should probably tell us when, for example, we always had this prompt, we test our models based on this prompt. So if you have meaning of life as your prompt, then do you want an diffusion model to decide what the meaning of life is?

Starting point is 00:11:45 Or do you want a language model to kind of think and go back and forth and come up with a description of a scene that's explaining the meaning of life. And that's kind of the context of JSON prompting is the intermediate representation that we think language models can describe images in that format and then imagination can happen. In general, we see a lot of editing happening in the field

Starting point is 00:12:12 and that's the new frontier. So I don't think we should expect the interaction to be only through text or JSON, but it's a combination of JSON and image, if I were to make a guess. Awesome. So it sounds like a lot of it is basically taking often a relatively simple prompt that someone puts into the model and then translating it on the back end with the magic prompt into JSON

Starting point is 00:12:35 so that the model can make something probably more detailed and interesting than the person with the short prompt may have even imagined. Exactly. And then I think everybody else does it too. Open AI does it, Google does it, but then they don't give you the actual input. to the model. Yeah.

Starting point is 00:12:51 But again, for professional use cases, you don't want to just roll the dice and then get some other completely different image interpretation of your prompt. Right. We show you the actual input to the model as well in the JSON format. And we think that that will foster more innovation and creativity. And for people who want control or like consistency, I think that would be key. Yeah. And to Justine's point, like, exactly, like, what is the implication for the professional use

Starting point is 00:13:18 cases? Like, what, like, was the JSON prompting? What can they do more easily now? What's this capability compared to before? Maybe zooming out a bit, the world is changing, right? I'm sure your work has changed a lot. My work has changed a lot. I'm actually writing PRs now, which is very exciting.

Starting point is 00:13:38 Amazing. We're in collaboration with AI. AI does the writing. And then for creative professionals as well, the world is changing. they are very excited to, okay, a large number of them are very excited. I think more and more of creatives are excited about adopting AI and they see the potential. They think ideation is still the most important part of the creative process and humans are very good at putting context into these models and their understanding of

Starting point is 00:14:11 situation, creativity will result in the best ideas. I'm very excited about the future where, you know, every kid will have these models at their fingertips, and then they can be much more creative, and we're going to experience a much more beautiful world. Now, as it comes to JSON prompting, again, because the JSON prompt describes every detail in the scene, you can take it and change one element in the scene, and that results in very, very close. system output. So you could be describing like a tiny detail in the corner of the image and then leave everything else the same. And we think this also has a big implication on editing. We haven't released our editing models yet, but they will also utilize the same JSON prompting

Starting point is 00:15:02 approach. And it's just more control. And with layout as well, you can imagine for every brand, you have brand guidelines in terms of, okay, the size of text, the font of text. And we think this kind of foundation allows us to really get into a lot of the enterprises cases. Amazing. And we've been talking about some of the things you've focused on with this model. Obviously, there's always trade-offs in model training in terms of what you want the model to be really amazing at and what you focus on. Would love to hear what you focused on for this release.

Starting point is 00:15:38 And also, do you consider more capability? Like, we want to be the top for this specific prompted here. or something like that, or is it more thinking about the end user and holistically, what are the different vectors where they want the model to be performing? Right.

Starting point is 00:15:56 So we care about a couple things. One is graphic design in general. Again, text rendering is part of that. We think basically graphic design is everywhere. Like, we go in a city, you open your eyes, you see billboards, you see storefronts, they all have text. And actually, it's much more important.

Starting point is 00:16:15 I guess photography is part of graphic design, but graphic design is actually the frontier for a lot of business use cases for storytelling. So we definitely focused a lot on graphic design since the release of our first model, which was good at text rendering. And in addition, we think taste is extremely important, and we really want our models to have taste,

Starting point is 00:16:37 and it's very hard to explain it. Yeah. What exactly taste is. One element of taste is kind of being, going outside of the normal little bit and not conforming to the average opinion, which is a little against being on top of the leaderboard. Right.

Starting point is 00:16:57 Yeah. Which is kind of interesting. Your own leaderboards. Yeah. We just care about ultimately, we worked with all the arenas. We hope all of the arenas will improve in detecting the nuances of images and image quality. but we care about our own internal evaluation

Starting point is 00:17:17 and unfortunately we see that AI is not very good at doing the actual taste evaluation yet so we work with designers and we have side by side comparisons between different versions of the model as well as other models to really push on the taste so we really care about taste I think there's still so much more to do obviously yeah I was going to add my follow-up was going to be do you have like one vibe guy or vibe woman internally who's like the taste arbiter because it can be hard to measure taste

Starting point is 00:17:48 but it sounds like you have a group of designers which is probably better i think yeah if we need to find that taste maker yeah um and then i guess one thing we're always wondering is that this release open source model is so small it's 9.3 billion parameters um you know like previously a sota is probably like 80 billion parameters it's like naix of a difference and then you can run it on a single GPU instead of having a lot of compute footprint, which really opened up opportunity for people to use it. So the question is, how did you do it?

Starting point is 00:18:25 We focused on the details of the model, and we know we can win on scaling. I used to work for Google. I don't think even if we raised 10x the amount we've raised so far, we can beat Google in terms of the number of chips that we can dedicate to each model training. So instead, we focused on innovation. We think there's still so much more to do to innovate.

Starting point is 00:18:55 We are also focusing on differentiation. I don't think a lot of labs are focusing on design, graphic design in particular, editable text that I'm talking about. And then we also decided to go open weight to really partner with a lot of other platforms to be at least another option for people who care about design. And so, yeah, so we focus on the small model primarily

Starting point is 00:19:21 because we think there's still so much to do. We think now is actually a good time for us to scale. Given the quality of the model at 9.3 billion parameters, you should imagine what if this model is 100x bigger and their mixture of experts' architectures that don't make the model necessarily slower. but they make the model a lot more powerful. So I think that's one new frontier for us to kind of scale this model, 10x, 100x.

Starting point is 00:19:49 I imagine because it's a smaller model, as you mentioned, like, it's harder to win on scaling and counting number of trips, but it is possible to win on a specific domain or optimizing for a different, you know, a thing in a different domain. So what was the tradeoff for the research team when training this model to decide what to focus on? Right. So one thing that you kind of alluded to is the fact that this can run on consumer GPU now. And we think there is a new frontier that, you know, you do a lot of editing on your phone, a lot of image generation on your phone. And it's not only about pushing quality at, you know, 100 billion parameter, one trillion parameter range.

Starting point is 00:20:32 We think it's really important to have small models that can run on device. Obviously, a lot of companies care about privacy, and we are really excited to partner with the industry to push the kind of small model size quality further. Now, in terms of the research team, it's an interesting question whether you can focus on a very small, narrow field in image generation. I sort of believe that you need a general understanding of the world,

Starting point is 00:21:07 in order to even be good at logo generation or be good at illustration style. But once you have a general base, then you can customize the model for certain use cases and it can be the best at that particular use case. So we're really excited about customization. We think that's a new frontier. And again, that's an important reason

Starting point is 00:21:30 for releasing the model with open weights. That is, every artist who has, at least like 50 pieces of art or hopefully like a little more, they can really customize this model to the nuances of their style, the texture of their canvas and really get 2K output and hopefully make that part of their workflow and be augmented with AI and be a lot more productive and creative. And we actually have worked with some artists in Resident

Starting point is 00:22:04 who said to us, okay, this at least made me 3x faster in making this comic book. So that's one frontier. And another frontier is enterprise again, because it's not about how good a model is in the general sense. It's about how good is this model for my use case, right? At the end of the day, I may not care about many of the general purpose use cases. I may not care about character consistency, for example, as an enterprise. But even though this model is a small, they think it can be the best model for particular use cases, whether that's a search and art test or whether that's an enterprise.

Starting point is 00:22:45 Were you getting a lot of demand from enterprises when it was a closed model that wanted to fine-tune it on their own data? Were there use cases where you're like, we really need to open source it now because there's so many cool things that people want to do? Yeah, yeah. So some of the enterprises we work with are very sensitive. They don't want to talk about. using AI in their visual side of things. But what we've seen over and over again is companies come to us and say, we try these generic models and they don't meet our design bar. They don't follow our style. They don't follow our brand guideline.

Starting point is 00:23:25 And once we train custom models for them, they are like, wow, this understand my brand DNA now. We can use this for design ideation or we can use. use this for marketing. And we thought with open rate release, we can give a glimpse of customization to, you know, developers within enterprise and kind of scale this side of our business further. And we're really excited about that. For enterprises, as you alluded to, I mean, customization is really top of mind, whether that's a brand kit or it's something that is stylistically just then,

Starting point is 00:24:07 but hard to encode that style into just a doc. So I imagine customization on top of an open source model, which is the best way to go. So the question becomes like, what is the ramp up for the customers or the artists to start, you know, post-training or fine-tuning on top of the ideal brand model? What do they have to do? So one thing I should say is we will work with the open source community

Starting point is 00:24:31 to make it as customers. customizable as possible. But because we are the model developer, we have some secret sauces that can make it even better. So what will happen is there will be different ways of customizing. One is in the open source based on the quantized model that's already released. The other is we already have a product that allows you to customize by just uploading certain number of images to our custom model training app. And we haven't released the four version of that yet, but we are hoping to release that as well. And then the third category is when enterprises work with us, and we really describe these

Starting point is 00:25:13 detailed prompts for every image. We worked with their design team to understand what wars they want to use because each team has different set of keywords. Each company may have certain mascots who have certain names. And so our annotation team gets involved and spends a lot of. time curating and cleaning data. And we think depending on your size and your budget, you should still be able to customize the model, maybe use the open source at the low budget, and then you can come and talk to us so that we can build a model for you at the high budget, but then depends

Starting point is 00:25:50 on really the ROI that you have in mind for your models. One of the things I think people have been talking about a lot, both for enterprise and actually for consumer, is fine-tuning. versus image editing. I actually think they don't necessarily have to be competitive. Like some people use image editing as a way of fine-tuning. Like they say, take this image and put it into this style. Others think it's much more efficient and consistent to just fine-tune a model to generate in that style. I know you alluded to wanting to release an editing model further down the line.

Starting point is 00:26:23 So I would love to hear how you guys are thinking about that. Yeah, I agree. I think editing is very powerful. We both agree. We all agree when... editing launched last year, so many new possibilities opened up. And the nice thing about editing is it's quick. You don't have to train a model.

Starting point is 00:26:42 You just take a style or an existing image and you make some changes. And it's part of your iterative workflow. Because with every creative that we worked with, it's never one shot promptings. Often, okay, you get something and then you're going to go and fix certain details after the first generation. Now that's for editing. But then customization gives you really freedom to not prompt at all, right? Because sometimes it's very hard for you to say, oh, I want to get inspired by this single image and edit that one. You may have some general style in mind and want to ideate in the context of that general style.

Starting point is 00:27:27 Or you may have a character that has many detailed, degrees of freedom or characteristics to left side, maybe different from the right side and like certain outfit. And it's very hard to really put all of those images as the input to your editing model and it often fails. So we think customization can give you a lot more powerful adherence to your characters and allows for an easier iteration on ideation. So I agree with you.

Starting point is 00:27:59 I don't think they are mutually exclusive, but they're both very powerful. With the JSON prompting and editing and model fine-tuning, the composability aspect of the model is just huge. There are so many ways you could customize it. One hot topic in the industry, in the research community, is a genetic loop for creative tools, right? So it used to be the creativity tools.

Starting point is 00:28:24 The consumption layer is always a UI. As a human, I look at it, and then I make modification. Now, so much of that may become like an API request like the agent makes. How do you see that? Well, what the API intel compared to how humans use it? And your earlier point, I want to say something, which is we seem to compare image generation with language models a lot. And for example, in the language model space, even though customization exists, it's not that like every company customizes their language models.

Starting point is 00:29:01 But I think that actually misses the point. When you look at visual representation of a brand, you immediately recognize the differences between brands. But if you look at their written communication, can you say, oh, is this Andresen Horviz or this is Sequoia? I mean, you probably can tell. I know so fast. I get the point.

Starting point is 00:29:24 But most people will not be able to immediately look at the text. That's true. I'm going to say. So there is a lot more diversity in the visual world, and that's very exciting for customization. And there are a lot of unique ways of interacting with the models that kind of goes back to your earlier question, a lot of 3D manipulation. So for that kind of use case, the input will be some 3D representation of the joints or position of the objects. Then you may have a completely different, you know, stylistic variation with the style being the input. So there are a lot of different types of interactions that you want to enable. And for that reason, it's much different

Starting point is 00:30:10 from the language space where the input is always taxed and like you can kind of more or less convert everything to tax. We're so excited about agents. We have our own MCP. We use them a lot internally what's really exciting is when you want to release a new feature you can go into your agent and then ask it to connect to the API and generate a bunch of images and then you can go and find the best ones and like in a couple hours you have your landing page up and running so we are very very excited about agenting workflows we think there's just at the beginning to your point we need evaluation as part of the loop we don't want to have to look at every image.

Starting point is 00:30:53 And then editing will be part of the agenting interaction too. How we exactly want to compose these different pieces to accomplish a goal is still to be discussed. But we have API and we have MCP and we really want to enable the agentic workflow. And we think every company is trying to figure it out as well. So we're really excited about that. I guess for the API business, so much of a design is iteration. It's a long tail of a design.

Starting point is 00:31:23 It's no longer just you prompt something, you get an image and call it a day, right? It's so much of a get an image, use the edit model to, you know, edit it, see if it works well. If it doesn't, get another image with the JSON flop, which is, you know, easier for control. What are the net new use cases you have seen after launching a model of how people do, like, compose different APIs? PI calls on the ideal grim bond. What's interesting is, yes, we have these agents. A lot of them live in a chat bot. And that's not enough for iteration, unfortunately.

Starting point is 00:31:59 So I think what's unique about that, you can really scale creativity, like kind of give it some high-level direction and ask it to go and explore many different approaches and come back with hundreds of thousands of designs that can be easily looked at and then you get a better sense of, okay, I want to explore more in this direction. So the language model interaction allows for kind of very large scale exploration of creative possibilities.

Starting point is 00:32:30 But once you know what you want, you need a UI. You need a UX to be able to go and edit and whether it's regional editing or text-based editing. I think at the end of the day you want your canvas, you want to be able to point to things. and then you want to also talk to it with national language. And it's actually very hard work because models are changing and now you're also designing the user interface at the same time. So kudos to the best designers who understand how these models work and are trying to figure this part out.

Starting point is 00:33:04 There's still a lot of work to do. In terms of the things we've seen from the model, again, design is something that's coming up a lot. And I saw a tweet before coming here that somebody said, I have no design training. And I got this design in two minutes. And it actually looked really, really nice. That was one example.

Starting point is 00:33:32 And then people are really excited about the art possibilities. Because this model was trained with very unique style description. As part of training, we actually stripped that. from the JSON prompt because it became too much. But the model has a lot of artistic possibilities and many different styles are embedded into a model. And if you've seen some of the frontier models, actually, that score very highly in the leaderboards,

Starting point is 00:34:05 they don't have a lot of kind of design variation. They always produce the same exact look. And I believe that's because they did a lot of reinforcement learning training. They actually have done very little reinforcement learning. So this is a very raw model. Now with that, the outcome is you need to be much more precise with your prompting, but you can get a lot of different styles from the model. And people seem to be very excited about that aspect of the model,

Starting point is 00:34:35 especially in the art community. I think when you talk both about the design and the art, it really brings back the taste point, you made earlier because for so many sorts of designs, you're trying to communicate some sort of idea, whether it's like an infographic or an ad or a logo or whatever. And it kind of needs to stand out and be distinct to you or have a unique style.

Starting point is 00:34:57 And I've totally noticed what you said, a lot of the frontier, the historical frontier image models, like you're scrolling the feet and you just see, like, I've now seen this style like 50 times. I've seen it 100 times. It doesn't like catch my eye anymore. And it feels like now, when I prompt IDogram 4, I often get something that makes me stop and be like, wow, this is

Starting point is 00:35:18 different than anything I've seen before coming out of an image model. And like, this is doing an amazing job of both communicating what I want to communicate and also holding someone's attention. Yeah. So we try to be enabling very different styles. And that was one of our goals. We still want to produce tasteful output. But that doesn't mean we have to force. a complex output to you. If you want a minimalist design, you should be able to get that. Actually, our minimalist is too minimalist in my opinion.

Starting point is 00:35:53 I was saying we should ban the word minimalist from the output. But you get what I mean. It's like the model can do many different things and that's by design. We know what's the first ideal grand model we're going to your post train. It would be an A16Z marketing brand style

Starting point is 00:36:10 art deco. A6.Z art deco. Yes. Yeah, let's do it. For the new branding. We need that. Yeah, that's good. That's very exciting. Yeah.

Starting point is 00:36:17 So I guess the question also becomes like, I kind of asked you the representation question before, which is here's a JSON representation. As an artist, obviously, if you abstract it out, you know, far enough, all the lines are pixels. So you could say that the composability is on the pixel level, which is actually different from the diffusion representation. It's like denoising and, you know, they're operating a different space. where does this lead to if you travel down the JSON path granular enough? Does it lead to pixels?

Starting point is 00:36:48 Does it lead to SVGs? Does it need to like language or something else? That's a very good question. So in general, the recipe for building more powerful models, in my opinion, is making the task as a straightforward as possible for the diffusion model. That is, specify the exact details of the image. And so now if you kind of make that extreme, then it becomes the pixels themselves. So the fusion model doesn't have to do anything.

Starting point is 00:37:17 Now the catch is what we would like to do is to get the language model to produce that intermediate representation. And language models, as of now, they aren't very good with continuous output. They aren't very good with kind of pixel values or very high dimensional vector representations. So I guess the constraint here is that the representation has to be, you know, tokens of like, I mean, depending on your language model power, maybe it can be a million tokens. Maybe it can be like that that's too extreme. For us, it's about 4,000 tokens. And that's where we still use natural language because these large language models are trained with national language. So they are very good with natural language.

Starting point is 00:38:03 But it may become more close to HTML, for example. that's okay because again large language models are trained with HTML and they know the tokens But it's going to be design your own version of HTML or would you align to each is actually kind of alluding to the The editable model that I'm talking about and we've had a lot of back and forth are we going to have our own JSON for Different you know text elements and buttons and stuff or are we going to use HTML and Seems like HTML makes more sense just because these large language language models have already been trained on HTML as opposed to us introducing a new JSON structure. But I would say to answer your question, that representation needs to be easy for the language

Starting point is 00:38:48 model with the particular design that we have right now, which is a language model, does some expansion of the ideas, and then the image model takes those expanded descriptions and turn them into images. We'd love to hear if there's anyone who's interested in working at IDogram or working with you guys as a customer or to find you in a model, what's the best way for them to get in touch with you and the team? First of all, we would love to work with more engineers, like cracked engineers. We have a very tiny team. You see what we were able to produce with such a tiny team. And if you want high agency, you know, if you want your work to matter and part of, and you, and you, you're able to produce, you want to be part of the academic and open source ecosystem,

Starting point is 00:39:38 then this is the perfect time to join us. Now, in addition, enterprises see the potential and we would like to work with the most creative brands out there to help them produce the best designs, produce most provocative ads. And also, we would like to partner with other startups or other companies at different levels of the stack. This is open rate so we can make it win-win.

Starting point is 00:40:06 And we would love to offer a different option to companies who want more control and data privacy and sovereignty. So we would love to work with other enterprises as well across the stack. And then the best way would be like we have this email partnerships at IDogram. We can DM me on Twitter or LinkedIn and I'm very active on. both platforms. Today, if I want to find out my own style, where should I go on Idoogram? Is it come to ask you? Is there a call to action? We should tell people like this year.

Starting point is 00:40:44 There is actually a model tab. If you log into ideogram, there's a model tab and then you can go and upload your images and train your model. It's a little more expensive. It's 60 bucks for two model training per month. But they think for professionals, that's totally worth it. Absolutely. And how many images do you need to get started? I think you need at least 15. So for an enterprise, if they want to find out their own model on ideogram, like what's your guidance on what they should upload an idealogram to start finding a model? I think for enterprise, again, we have some sales forms they can fill and then come talk to us.

Starting point is 00:41:26 Because we see that there are many differences in what different companies want, Some companies care more about editing. Some companies want more like on the marketing side to automate some of their ads. And we should talk first and then figure out what's the best solution for that. All right. Great. Awesome. Thanks so much for joining us.

Starting point is 00:41:46 Thank you so much. Thanks for listening to this episode of the A16Z podcast. If you like this episode, be sure to like, comment, subscribe, leave us a rating or review and share it with your friends and family. For more episodes, go to YouTube, Apple Podcast. in Spotify. Follow us on X at A16Z and subscribe to our substack at A16Z.com. Thanks again for listening and I'll see you in the next episode. As a reminder, the content here is for informational purposes only. Should not be taken as legal business, tax, or investment advice or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund.

Starting point is 00:42:29 Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16Z.com forward slash disclosures.

a16z Podcast - AI, Design, and the Power of Open Models

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.