Latent Space: The AI Engineer Podcast - AI Engineering for Art — with comfyanonymous, of ComfyUI
Episode Date: January 4, 2025Applications for the NYC AI Engineer Summit, focused on Agents at Work, are open!When we first started Latent Space, in the lightning round we’d always ask guests: “What’s your favorite AI produ...ct?”. The majority would say Midjourney. The simple UI of prompt → very aesthetic image turned it into a $300M+ ARR bootstrapped business as it rode the first wave of AI image generation.In open source land, StableDiffusion was congregating around AUTOMATIC1111 as the de-facto web UI. Unlike Midjourney, which offered some flags but was mostly prompt-driven, A1111 let users play with a lot more parameters, supported additional modalities like img2img, and allowed users to load in custom models. If you’re interested in some of the SD history, you can look at our episodes with Lexica, Replicate, and Playground.One of the people involved with that community was comfyanonymous, who was also part of the Stability team in 2023, decided to build an alternative called ComfyUI, now one of the fastest growing open source projects in generative images, and is now the preferred partner for folks like Black Forest Labs’s Flux Tools on Day 1. The idea behind it was simple: “Everyone is trying to make easy to use interfaces. Let me try to make a powerful interface that's not easy to use.”Unlike its predecessors, ComfyUI does not have an input text box. Everything is based around the idea of a node: there’s a text input node, a CLIP node, a checkpoint loader node, a KSampler node, a VAE node, etc. While daunting for simple image generation, the tool is amazing for more complex workflows since you can break down every step of the process, and then chain many of them together rather than manually switching between tools. You can also re-start execution halfway instead of from the beginning, which can save a lot of time when using larger models.To give you an idea of some of the new use cases that this type of UI enables:* Sketch something → Generate an image with SD from sketch → feed it into SD Video to animate* Generate an image of an object → Turn into a 3D asset → Feed into interactive experiences* Input audio → Generate audio-reactive videosTheir Examples page also includes some of the more common use cases like AnimateDiff, etc. They recently launched the Comfy Registry, an online library of different nodes that users can pull from rather than having to build everything from scratch. The project has >60,000 Github stars, and as the community grows, some of the projects that people build have gotten quite complex:The most interesting thing about Comfy is that it’s not a UI, it’s a runtime. You can build full applications on top of image models simply by using Comfy. You can expose Comfy workflows as an endpoint and chain them together just like you chain a single node. We’re seeing the rise of AI Engineering applied to art.Major Tom’s ComfyUI Resources from the Latent Space DiscordMajor shoutouts to Major Tom on the LS Discord who is a image generation expert, who offered these pointers:* “best thing about comfy is the fact it supports almost immediately every new thing that comes out - unlike A1111 or forge, which still don't support flux cnet for instance. It will be perfect tool when conflicting nodes will be resolved”* AP Workflows from Alessandro Perili are a nice example of an all-in-one train-evaluate-generate system built atop Comfy* ComfyUI YouTubers to learn from:* @sebastiankamph* @NerdyRodent* @OlivioSarikas* @sedetweiler* @pixaroma* ComfyUI Nodes to check out:* https://github.com/kijai/ComfyUI-IC-Light* https://github.com/MrForExample/ComfyUI-3D-Pack* https://github.com/PowerHouseMan/ComfyUI-AdvancedLivePortrait* https://github.com/pydn/ComfyUI-to-Python-Extension* https://github.com/THtianhao/ComfyUI-Portrait-Maker* https://github.com/ssitu/ComfyUI_NestedNodeBuilder* https://github.com/longgui0318/comfyui-magic-clothing* https://github.com/atmaranto/ComfyUI-SaveAsScript* https://github.com/ZHO-ZHO-ZHO/ComfyUI-InstantID* https://github.com/AIFSH/ComfyUI-FishSpeech* https://github.com/coolzilj/ComfyUI-Photopea* https://github.com/lks-ai/anynode* Sarav: https://www.youtube.com/@mickmumpitz/videos ( applied stuff )* Sarav: https://www.youtube.com/@latentvision (technical, but infrequent)* look for comfyui node for https://github.com/magic-quill/MagicQuill* “Comfy for Video” resources* Kijai (https://github.com/kijai) pushing out support for Mochi, CogVideoX, AnimateDif, LivePortrait etc* Comfyui node support like LTX https://github.com/Lightricks/ComfyUI-LTXVideo , and HunyuanVideo* FloraFauna AI and Krea.ai* Communities: https://www.reddit.com/r/StableDiffusion/, https://www.reddit.com/r/comfyui/Full YouTube EpisodeAs usual, you can find the full video episode on our YouTube (and don’t forget to like and subscribe!)Timestamps* 00:00:04 Introduction of hosts and anonymous guest* 00:00:35 Origins of Comfy UI and early Stable Diffusion landscape* 00:02:58 Comfy's background and development of high-res fix* 00:05:37 Area conditioning and compositing in image generation* 00:07:20 Discussion on different AI image models (SD, Flux, etc.)* 00:11:10 Closed source model APIs and community discussions on SD versions* 00:14:41 LoRAs and textual inversion in image generation* 00:18:43 Evaluation methods in the Comfy community* 00:20:05 CLIP models and text encoders in image generation* 00:23:05 Prompt weighting and negative prompting* 00:26:22 Comfy UI's unique features and design choices* 00:31:00 Memory management in Comfy UI* 00:33:50 GPU market share and compatibility issues* 00:35:40 Node design and parameter settings in Comfy UI* 00:38:44 Custom nodes and community contributions* 00:41:40 Video generation models and capabilities* 00:44:47 Comfy UI's development timeline and rise to popularity* 00:48:13 Current state of Comfy UI team and future plans* 00:50:11 Discussion on other Comfy startups and potential text generation supportTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Small AI.swyx [00:00:12]: Hey everyone, we are in the Chroma Studio again, but with our first ever anonymous guest, Comfy Anonymous, welcome.Comfy [00:00:19]: Hello.swyx [00:00:21]: I feel like that's your full name, you just go by Comfy, right?Comfy [00:00:24]: Yeah, well, a lot of people just call me Comfy, even when they know my real name. Hey, Comfy.Alessio [00:00:32]: Swyx is the same. You know, not a lot of people call you Shawn.swyx [00:00:35]: Yeah, you have a professional name, right, that people know you by, and then you have a legal name. Yeah, it's fine. How do I phrase this? I think people who are in the know, know that Comfy is like the tool for image generation and now other multimodality stuff. I would say that when I first got started with Stable Diffusion, the star of the show was Automatic 111, right? And I actually looked back at my notes from 2022-ish, like Comfy was already getting started back then, but it was kind of like the up and comer, and your main feature was the flowchart. Can you just kind of rewind to that moment, that year and like, you know, how you looked at the landscape there and decided to start Comfy?Comfy [00:01:10]: Yeah, I discovered Stable Diffusion in 2022, in October 2022. And, well, I kind of started playing around with it. Yes, I, and back then I was using Automatic, which was what everyone was using back then. And so I started with that because I had, it was when I started, I had no idea like how Diffusion works. I didn't know how Diffusion models work, how any of this works, so.swyx [00:01:36]: Oh, yeah. What was your prior background as an engineer?Comfy [00:01:39]: Just a software engineer. Yeah. Boring software engineer.swyx [00:01:44]: But like any, any image stuff, any orchestration, distributed systems, GPUs?Comfy [00:01:49]: No, I was doing basically nothing interesting. Crud, web development? Yeah, a lot of web development, just, yeah, some basic, maybe some basic like automation stuff. Okay. Just. Yeah, no, like, no big companies or anything.swyx [00:02:08]: Yeah, but like already some interest in automations, probably a lot of Python.Comfy [00:02:12]: Yeah, yeah, of course, Python. But I wasn't actually used to like the Node graph interface before I started Comfy UI. It was just, I just thought it was like, oh, like, what's the best way to represent the Diffusion process in the user interface? And then like, oh, well. Well, like, naturally, oh, this is the best way I've found. And this was like with the Node interface. So how I got started was, yeah, so basic October 2022, just like I hadn't written a line of PyTorch before that. So it's completely new. What happened was I kind of got addicted to generating images.Alessio [00:02:58]: As we all did. Yeah.Comfy [00:03:00]: And then I started. I started experimenting with like the high-res fixed in auto, which was for those that don't know, the high-res fix is just since the Diffusion models back then could only generate that low-resolution. So what you would do, you would generate low-resolution image, then upscale, then refine it again. And that was kind of the hack to generate high-resolution images. I really liked generating. Like higher resolution images. So I was experimenting with that. And so I modified the code a bit. Okay. What happens if I, if I use different samplers on the second pass, I was edited the code of auto. So what happens if I use a different sampler? What happens if I use a different, like a different settings, different number of steps? And because back then the. The high-res fix was very basic, just, so. Yeah.swyx [00:04:05]: Now there's a whole library of just, uh, the upsamplers.Comfy [00:04:08]: I think, I think they added a bunch of, uh, of options to the high-res fix since, uh, since, since then. But before that was just so basic. So I wanted to go further. I wanted to try it. What happens if I use a different model for the second, the second pass? And then, well, then the auto code base was, wasn't good enough for. Like, it would have been, uh, harder to implement that in the auto interface than to create my own interface. So that's when I decided to create my own. And you were doing that mostly on your own when you started, or did you already have kind of like a subgroup of people? No, I was, uh, on my own because, because it was just me experimenting with stuff. So yeah, that was it. Then, so I started writing the code January one. 2023, and then I released the first version on GitHub, January 16th, 2023. That's how things got started.Alessio [00:05:11]: And what's, what's the name? Comfy UI right away or? Yeah.Comfy [00:05:14]: Comfy UI. The reason the name, my name is Comfy is people thought my pictures were comfy, so I just, uh, just named it, uh, uh, it's my Comfy UI. So yeah, that's, uh,swyx [00:05:27]: Is there a particular segment of the community that you targeted as users? Like more intensive workflow artists, you know, compared to the automatic crowd or, you know,Comfy [00:05:37]: This was my way of like experimenting with, uh, with new things, like the high risk fixed thing I mentioned, which was like in Comfy, the first thing you could easily do was just chain different models together. And then one of the first things, I think the first times it got a bit of popularity was when I started experimenting with the different, like applying. Prompts to different areas of the image. Yeah. I called it area conditioning, posted it on Reddit and it got a bunch of upvotes. So I think that's when, like, when people first learned of Comfy UI.swyx [00:06:17]: Is that mostly like fixing hands?Comfy [00:06:19]: Uh, no, no, no. That was just, uh, like, let's say, well, it was very, well, it still is kind of difficult to like, let's say you want a mountain, you have an image and then, okay. I'm like, okay. I want the mountain here and I want the, like a, a Fox here.swyx [00:06:37]: Yeah. So compositing the image. Yeah.Comfy [00:06:40]: My way was very easy. It was just like, oh, when you run the diffusion process, you kind of generate, okay. You do pass one pass through the diffusion, every step you do one pass. Okay. This place of the image with this brand, this space, place of the image with the other prop. And then. The entire image with another prop and then just average everything together, every step, and that was, uh, area composition, which I call it. And then, then a month later, there was a paper that came out called multi diffusion, which was the same thing, but yeah, that's, uh,Alessio [00:07:20]: could you do area composition with different models or because you're averaging out, you kind of need the same model.Comfy [00:07:26]: Could do it with, but yeah, I hadn't implemented it. For different models, but, uh, you, you can do it with, uh, with different models if you want, as long as the models share the same latent space, like we, we're supposed to ring a bell every time someone says, yeah, like, for example, you couldn't use like Excel and SD 1.5, because those have a different latent space, but like, uh, yeah, like SD 1.5 models, different ones. You could, you could do that.swyx [00:07:59]: There's some models that try to work in pixel space, right?Comfy [00:08:03]: Yeah. They're very slow. Of course. That's the problem. That that's the, the reason why stable diffusion actually became like popular, like, cause was because of the latent space.swyx [00:08:14]: Small and yeah. Because it used to be latent diffusion models and then they trained it up.Comfy [00:08:19]: Yeah. Cause a pixel pixel diffusion models are just too slow. So. Yeah.swyx [00:08:25]: Have you ever tried to talk to like, like stability, the latent diffusion guys, like, you know, Robin Rombach, that, that crew. Yeah.Comfy [00:08:32]: Well, I used to work at stability.swyx [00:08:34]: Oh, I actually didn't know. Yeah.Comfy [00:08:35]: I used to work at stability. I got, uh, I got hired, uh, in June, 2023.swyx [00:08:42]: Ah, that's the part of the story I didn't know about. Okay. Yeah.Comfy [00:08:46]: So the, the reason I was hired is because they were doing, uh, SDXL at the time and they were basically SDXL. I don't know if you remember it was a base model and then a refiner model. Basically they wanted to experiment, like chaining them together. And then, uh, they saw, oh, right. Oh, this, we can use this to do that. Well, let's hire that guy.swyx [00:09:10]: But they didn't, they didn't pursue it for like SD3. What do you mean? Like the SDXL approach. Yeah.Comfy [00:09:16]: The reason for that approach was because basically they had two models and then they wanted to publish both of them. So they, they trained one on. Lower time steps, which was the refiner model. And then they, the first one was trained normally. And then they went during their test, they realized, oh, like if we string these models together are like quality increases. So let's publish that. It worked. Yeah. But like right now, I don't think many people actually use the refiner anymore, even though it is actually a full diffusion model. Like you can use it on its own. And it's going to generate images. I don't think anyone, people have mostly forgotten about it. But, uh.Alessio [00:10:05]: Can we talk about models a little bit? So stable diffusion, obviously is the most known. I know flux has gotten a lot of traction. Are there any underrated models that people should use more or what's the state of the union?Comfy [00:10:17]: Well, the, the latest, uh, state of the art, at least, yeah, for images there's, uh, yeah, there's flux. There's also SD3.5. SD3.5 is two models. There's a, there's a small one, 2.5B and there's the bigger one, 8B. So it's, it's smaller than flux. So, and it's more, uh, creative in a way, but flux, yeah, flux is the best. People should give SD3.5 a try cause it's, uh, it's different. I won't say it's better. Well, it's better for some like specific use cases. Right. If you want some to make something more like creative, maybe SD3.5. If you want to make something more consistent and flux is probably better.swyx [00:11:06]: Do you ever consider supporting the closed source model APIs?Comfy [00:11:10]: Uh, well, they, we do support them as custom nodes. We actually have some, uh, official custom nodes from, uh, different. Ideogram.swyx [00:11:20]: Yeah. I guess DALI would have one. Yeah.Comfy [00:11:23]: That's, uh, it's just not, I'm not the person that handles that. Sure.swyx [00:11:28]: Sure. Quick question on, on SD. There's a lot of community discussion about the transition from SD1.5 to SD2 and then SD2 to SD3. People still like, you know, very loyal to the previous generations of SDs?Comfy [00:11:41]: Uh, yeah. SD1.5 then still has a lot of, a lot of users.swyx [00:11:46]: The last based model.Comfy [00:11:49]: Yeah. Then SD2 was mostly ignored. It wasn't, uh, it wasn't a big enough improvement over the previous one. Okay.swyx [00:11:58]: So SD1.5, SD3, flux and whatever else. SDXL. SDXL.Comfy [00:12:03]: That's the main one. Stable cascade. Stable cascade. That was a good model. But, uh, that's, uh, the problem with that one is, uh, it got, uh, like SD3 was announced one week after. Yeah.swyx [00:12:16]: It was like a weird release. Uh, what was it like inside of stability actually? I mean, statute of limitations. Yeah. The statute of limitations expired. You know, management has moved. So it's easier to talk about now. Yeah.Comfy [00:12:27]: And inside stability, actually that model was ready, uh, like three months before, but it got, uh, stuck in, uh, red teaming. So basically the product, if that model had released or was supposed to be released by the authors, then it would probably have gotten very popular since it's a, it's a step up from SDXL. But it got all of its momentum stolen. It got stolen by the SD3 announcement. So people kind of didn't develop anything on top of it, even though it's, uh, yeah. It was a good model, at least, uh, completely mostly ignored for some reason. Likeswyx [00:13:07]: I think the naming as well matters. It seemed like a branch off of the main, main tree of development. Yeah.Comfy [00:13:15]: Well, it was different researchers that did it. Yeah. Yeah. Very like, uh, good model. Like it's the Worcestershire authors. I don't know if I'm pronouncing it correctly. Yeah. Yeah. Yeah.swyx [00:13:28]: I actually met them in Vienna. Yeah.Comfy [00:13:30]: They worked at stability for a bit and they left right after the Cascade release.swyx [00:13:35]: This is Dustin, right? No. Uh, Dustin's SD3. Yeah.Comfy [00:13:38]: Dustin is a SD3 SDXL. That's, uh, Pablo and Dome. I think I'm pronouncing his name correctly. Yeah. Yeah. Yeah. Yeah. That's very good.swyx [00:13:51]: It seems like the community is very, they move very quickly. Yeah. Like when there's a new model out, they just drop whatever the current one is. And they just all move wholesale over. Like they don't really stay to explore the full capabilities. Like if, if the stable cascade was that good, they would have AB tested a bit more. Instead they're like, okay, SD3 is out. Let's go. You know?Comfy [00:14:11]: Well, I find the opposite actually. The community doesn't like, they only jump on a new model when there's a significant improvement. Like if there's a, only like a incremental improvement, which is what, uh, most of these models are going to have, especially if you, cause, uh, stay the same parameter count. Yeah. Like you're not going to get a massive improvement, uh, into like, unless there's something big that, that changes. So, uh. Yeah.swyx [00:14:41]: And how are they evaluating these improvements? Like, um, because there's, it's a whole chain of, you know, comfy workflows. Yeah. How does, how does one part of the chain actually affect the whole process?Comfy [00:14:52]: Are you talking on the model side specific?swyx [00:14:54]: Model specific, right? But like once you have your whole workflow based on a model, it's very hard to move.Comfy [00:15:01]: Uh, not, well, not really. Well, it depends on your, uh, depends on their specific kind of the workflow. Yeah.swyx [00:15:09]: So I do a lot of like text and image. Yeah.Comfy [00:15:12]: When you do change, like most workflows are kind of going to be complete. Yeah. It's just like, you might have to completely change your prompt completely change. Okay.swyx [00:15:24]: Well, I mean, then maybe the question is really about evals. Like what does the comfy community do for evals? Just, you know,Comfy [00:15:31]: Well, that they don't really do that. It's more like, oh, I think this image is nice. So that's, uh,swyx [00:15:38]: They just subscribe to Fofr AI and just see like, you know, what Fofr is doing. Yeah.Comfy [00:15:43]: Well, they just, they just generate like it. Like, I don't see anyone really doing it. Like, uh, at least on the comfy side, comfy users, they, it's more like, oh, generate images and see, oh, this one's nice. It's like, yeah, it's not, uh, like the, the more, uh, like, uh, scientific, uh, like, uh, like checking that's more on specifically on like model side. If, uh, yeah, but there is a lot of, uh, vibes also, cause it is a like, uh, artistic, uh, you can create a very good model that doesn't generate nice images. Cause most images on the internet are ugly. So if you, if that's like, if you just, oh, I have the best model at 10th giant, it's super smart. I created on all the, like I've trained on just all the images on the internet. The images are not going to look good. So yeah.Alessio [00:16:42]: Yeah.Comfy [00:16:43]: They're going to be very consistent. But yeah. People like, it's not going to be like the, the look that people are going to be expecting from, uh, from a model. So. Yeah.swyx [00:16:54]: Can we talk about LoRa's? Cause we thought we talked about models then like the next step is probably LoRa's. Before, I actually, I'm kind of curious how LoRa's entered the tool set of the image community because the LoRa paper was 2021. And then like, there was like other methods like textual inversion that was popular at the early SD stage. Yeah.Comfy [00:17:13]: I can't even explain the difference between that. Yeah. Textual inversions. That's basically what you're doing is you're, you're training a, cause well, yeah. Stable diffusion. You have the diffusion model, you have text encoder. So basically what you're doing is training a vector that you're going to pass to the text encoder. It's basically you're training a new word. Yeah.swyx [00:17:37]: It's a little bit like representation engineering now. Yeah.Comfy [00:17:40]: Yeah. Basically. Yeah. You're just, so yeah, if you know how like the text encoder works, basically you have, you take your, your words of your product, you convert those into tokens with the tokenizer and those are converted into vectors. Basically. Yeah. Each token represents a different vector. So each word presents a vector. And those, depending on your words, that's the list of vectors that get passed to the text encoder, which is just. Yeah. Yeah. I'm just a stack of, of attention. Like basically it's a very close to LLM architecture. Yeah. Yeah. So basically what you're doing is just training a new vector. We're saying, well, I have all these images and I want to know which word does that represent? And it's going to get like, you train this vector and then, and then when you use this vector, it hopefully generates. Like something similar to your images. Yeah.swyx [00:18:43]: I would say it's like surprisingly sample efficient in picking up the concept that you're trying to train it on. Yeah.Comfy [00:18:48]: Well, people have kind of stopped doing that even though back as like when I was at Stability, we, we actually did train internally some like textual versions on like T5 XXL actually worked pretty well. But for some reason, yeah, people don't use them. And also they might also work like, like, yeah, this is something and probably have to test, but maybe if you train a textual version, like on T5 XXL, it might also work with all the other models that use T5 XXL because same thing with like, like the textual inversions that, that were trained for SD 1.5, they also kind of work on SDXL because SDXL has the, has two text encoders. And one of them is the same as the, as the SD 1.5 CLIP-L. So those, they actually would, they don't work as strongly because they're only applied to one of the text encoders. But, and the same thing for SD3. SD3 has three text encoders. So it works. It's still, you can still use your textual version SD 1.5 on SD3, but it's just a lot weaker because now there's three text encoders. So it gets even more diluted. Yeah.swyx [00:20:05]: Do people experiment a lot on, just on the CLIP side, there's like Siglip, there's Blip, like do people experiment a lot on those?Comfy [00:20:12]: You can't really replace. Yeah.swyx [00:20:14]: Because they're trained together, right? Yeah.Comfy [00:20:15]: They're trained together. So you can't like, well, what I've seen people experimenting with is a long CLIP. So basically someone fine tuned the CLIP model to accept longer prompts.swyx [00:20:27]: Oh, it's kind of like long context fine tuning. Yeah.Comfy [00:20:31]: So, so like it's, it's actually supported in Core Comfy.swyx [00:20:35]: How long is long?Comfy [00:20:36]: Regular CLIP is 77 tokens. Yeah. Long CLIP is 256. Okay. So, but the hack that like you've, if you use stable diffusion 1.5, you've probably noticed, oh, it still works if I, if I use long prompts, prompts longer than 77 words. Well, that's because the hack is to just, well, you split, you split it up in chugs of 77, your whole big prompt. Let's say you, you give it like the massive text, like the Bible or something, and it would split it up in chugs of 77 and then just pass each one through the CLIP and then just cut anything together at the end. It's not ideal, but it actually works.swyx [00:21:26]: Like the positioning of the words really, really matters then, right? Like this is why order matters in prompts. Yeah.Comfy [00:21:33]: Yeah. Like it, it works, but it's, it's not ideal, but it's what people expect. Like if, if someone gives a huge prompt, they expect at least some of the concepts at the end to be like present in the image. But usually when they give long prompts, they, they don't, they like, they don't expect like detail, I think. So that's why it works very well.swyx [00:21:58]: And while we're on this topic, prompts waiting, negative comments. Negative prompting all, all sort of similar part of this layer of the stack. Yeah.Comfy [00:22:05]: The, the hack for that, which works on CLIP, like it, basically it's just for SD 1.5, well, for SD 1.5, the prompt waiting works well because CLIP L is a, is not a very deep model. So you have a very high correlation between, you have the input token, the index of the input token vector. And the output token, they're very, the concepts are very close, closely linked. So that means if you interpolate the vector from what, well, the, the way Comfy UI does it is it has, okay, you have the vector, you have an empty prompt. So you have a, a chunk, like a CLIP output for the empty prompt, and then you have the one for your prompt. And then it interpolates from that, depending on your prompt. Yeah.Comfy [00:23:07]: So that's how it, how it does prompt waiting. But this stops working the deeper your text encoder is. So on T5X itself, it doesn't work at all. So. Wow.swyx [00:23:20]: Is that a problem for people? I mean, cause I'm used to just move, moving up numbers. Probably not. Yeah.Comfy [00:23:25]: Well.swyx [00:23:26]: So you just use words to describe, right? Cause it's a bigger language model. Yeah.Comfy [00:23:30]: Yeah. So. Yeah. So honestly it might be good, but I haven't seen many complaints on Flux that it's not working. So, cause I guess people can sort of get around it with, with language. So. Yeah.swyx [00:23:46]: Yeah. And then coming back to LoRa's, now the, the popular way to, to customize models is LoRa's. And I saw you also support Locon and LoHa, which I've never heard of before.Comfy [00:23:56]: There's a bunch of, cause what, what the LoRa is essentially is. Instead of like, okay, you have your, your model and then you want to fine tune it. So instead of like, what you could do is you could fine tune the entire thing, but that's a bit heavy. So to speed things up and make things less heavy, what you can do is just fine tune some smaller weights, like basically two, two matrices that when you multiply like two low rank matrices and when you multiply them together, gives a, represents a difference between trained weights and your base weights. So by training those two smaller matrices, that's a lot less heavy. Yeah.Alessio [00:24:45]: And they're portable. So you're going to share them. Yeah. It's like easier. And also smaller.Comfy [00:24:49]: Yeah. That's the, how LoRa's work. So basically, so when, when inferencing you, you get an inference with them pretty efficiently, like how ComputeWrite does it. It just, when you use a LoRa, it just applies it straight on the weights so that there's only a small delay at the base, like before the sampling to when it applies the weights and then it just same speed as, as before. So for, for inference, it's, it's not that bad, but, and then you have, so basically all the LoRa types like LoHa, LoCon, everything, that's just different ways of representing that like. Basically, you can call it kind of like compression, even though it's not really compression, it's just different ways of represented, like just, okay, I want to train a different on the difference on the weights. What's the best way to represent that difference? There's the basic LoRa, which is just, oh, let's multiply these two matrices together. And then there's all the other ones, which are all different algorithms. So. Yeah.Alessio [00:25:57]: So let's talk about LoRa. Let's talk about what comfy UI actually is. I think most people have heard of it. Some people might've seen screenshots. I think fewer people have built very complex workflows. So when you started, automatic was like the super simple way. What were some of the choices that you made? So the node workflow, is there anything else that stands out as like, this was like a unique take on how to do image generation workflows?Comfy [00:26:22]: Well, I feel like, yeah, back then everyone was trying to make like easy to use interface. Yeah. So I'm like, well, everyone's trying to make an easy to use interface.swyx [00:26:32]: Let's make a hard to use interface.Comfy [00:26:37]: Like, so like, I like, I don't need to do that, everyone else doing it. So let me try something like, let me try to make a powerful interface that's not easy to use. So.swyx [00:26:52]: So like, yeah, there's a sort of node execution engine. Yeah. Yeah. And it actually lists, it has this really good list of features of things you prioritize, right? Like let me see, like sort of re-executing from, from any parts of the workflow that was changed, asynchronous queue system, smart memory management, like all this seems like a lot of engineering that. Yeah.Comfy [00:27:12]: There's a lot of engineering in the back end to make things, cause I was always focused on making things work locally very well. Cause that's cause I was using it locally. So everything. So there's a lot of, a lot of thought and working by getting everything to run as well as possible. So yeah. ConfUI is actually more of a back end, at least, well, not all the front ends getting a lot more development, but, but before, before it was, I was pretty much only focused on the backend. Yeah.swyx [00:27:50]: So v0.1 was only August this year. Yeah.Comfy [00:27:54]: With the new front end. Before there was no versioning. So yeah. Yeah. Yeah.swyx [00:27:57]: And so what was the big rewrite for the 0.1 and then the 1.0?Comfy [00:28:02]: Well, that's more on the front end side. That's cause before that it was just like the UI, what, cause when I first wrote it, I just, I said, okay, how can I make, like, I can do web development, but I don't like doing it. Like what's the easiest way I can slap a node interface on this. And then I found this library. Yeah. Like JavaScript library.swyx [00:28:26]: Live graph?Comfy [00:28:27]: Live graph.swyx [00:28:28]: Usually people will go for like react flow for like a flow builder. Yeah.Comfy [00:28:31]: But that seems like too complicated. So I didn't really want to spend time like developing the front end. So I'm like, well, oh, light graph. This has the whole node interface. So, okay. Let me just plug that into, to my backend.swyx [00:28:49]: I feel like if Streamlit or Gradio offered something that you would have used Streamlit or Gradio cause it's Python. Yeah.Comfy [00:28:54]: Yeah. Yeah. Yeah.Comfy [00:29:00]: Yeah.Comfy [00:29:14]: Yeah. logic and your backend logic and just sticks them together.swyx [00:29:20]: It's supposed to be easy for you guys. If you're a Python main, you know, I'm a JS main, right? Okay. If you're a Python main, it's supposed to be easy.Comfy [00:29:26]: Yeah, it's easy, but it makes your whole software a huge mess.swyx [00:29:30]: I see, I see. So you're mixing concerns instead of separating concerns?Comfy [00:29:34]: Well, it's because... Like frontend and backend. Frontend and backend should be well separated with a defined API. Like that's how you're supposed to do it. Smart people disagree. It just sticks everything together. It makes it easy to like a huge mess. And also it's, there's a lot of issues with Gradio. Like it's very good if all you want to do is just get like slap a quick interface on your, like to show off your ML project. Like that's what it's made for. Yeah. Like there's no problem using it. Like, oh, I have my, I have my code. I just wanted a quick interface on it. That's perfect. Like use Gradio. But if you want to make something that's like a real, like real software that will last a long time and will be easy to maintain, then I would avoid it. Yeah.swyx [00:30:32]: So your criticism is Streamlit and Gradio are the same. I mean, those are the same criticisms.Comfy [00:30:37]: Yeah, Streamlit I haven't used as much. Yeah, I just looked a bit.swyx [00:30:43]: Similar philosophy.Comfy [00:30:44]: Yeah, it's similar. It's just, it just seems to me like, okay, for quick, like AI demos, it's perfect.swyx [00:30:51]: Yeah. Going back to like the core tech, like asynchronous queues, slow re-execution, smart memory management, you know, anything that you were very proud of or was very hard to figure out?Comfy [00:31:00]: Yeah. The thing that's the biggest pain in the ass is probably the memory management. Yeah.swyx [00:31:05]: Were you just paging models in and out or? Yeah.Comfy [00:31:08]: Before it was just, okay, load the model, completely unload it. Then, okay, that, that works well when you, your model are small, but if your models are big and it takes sort of like, let's say someone has a, like a, a 4090, and the model size is 10 gigabytes, that can take a few seconds to like load and load, load and load, so you want to try to keep things like in memory, in the GPU memory as much as possible. What Comfy UI does right now is it. It tries to like estimate, okay, like, okay, you're going to sample this model, it's going to take probably this amount of memory, let's remove the models, like this amount of memory that's been loaded on the GPU and then just execute it. But so there's a fine line between just because try to remove the least amount of models that are already loaded. Because as fans, like Windows drivers, and one other problem is the NVIDIA driver on Windows by default, because there's a way to, there's an option to disable that feature, but by default it, like, if you start loading, you can overflow your GPU memory and then it's, the driver's going to automatically start paging to RAM. But the problem with that is it's, it makes everything extremely slow. So when you see people complaining, oh, this model, it works, but oh, s**t, it starts slowing down a lot, that's probably what's happening. So it's basically you have to just try to get, use as much memory as possible, but not too much, or else things start slowing down, or people get out of memory, and then just find, try to find that line where, oh, like the driver on Windows starts paging and stuff. Yeah. And the problem with PyTorch is it's, it's high levels, don't have that much fine-grained control over, like, specific memory stuff, so kind of have to leave, like, the memory freeing to, to Python and PyTorch, which is, can be annoying sometimes.swyx [00:33:32]: So, you know, I think one thing is, as a maintainer of this project, like, you're designing for a very wide surface area of compute, like, you even support CPUs.Comfy [00:33:42]: Yeah, well, that's... That's just, for PyTorch, PyTorch supports CPUs, so, yeah, it's just, that's not, that's not hard to support.swyx [00:33:50]: First of all, is there a market share estimate, like, is it, like, 70% NVIDIA, like, 30% AMD, and then, like, miscellaneous on Apple, Silicon, or whatever?Comfy [00:33:59]: For Comfy? Yeah. Yeah, and, yeah, I don't know the market share.swyx [00:34:03]: Can you guess?Comfy [00:34:04]: I think it's mostly NVIDIA. Right. Because, because AMD, the problem, like, AMD works horribly on Windows. Like, on Linux, it works fine. It's, it's lower than the price equivalent NVIDIA GPU, but it works, like, you can use it, you generate images, everything works. On Linux, on Windows, you might have a hard time, so, that's the problem, and most people, I think most people who bought AMD probably use Windows. They probably aren't going to switch to Linux, so... Yeah. So, until AMD actually, like, ports their, like, raw cam to, to Windows properly, and then there's actually PyTorch, I think they're, they're doing that, they're in the process of doing that, but, until they get it, they get a good, like, PyTorch raw cam build that works on Windows, it's, like, they're going to have a hard time. Yeah.Alessio [00:35:06]: We got to get George on it. Yeah. Well, he's trying to get Lisa Su to do it, but... Let's talk a bit about, like, the node design. So, unlike all the other text-to-image, you have a very, like, deep, so you have, like, a separate node for, like, clip and code, you have a separate node for, like, the case sampler, you have, like, all these nodes. Going back to, like, the making it easy versus making it hard, but, like, how much do people actually play with all the settings, you know? Kind of, like, how do you guide people to, like, hey, this is actually going to be very impactful versus this is maybe, like, less impactful, but we still want to expose it to you?Comfy [00:35:40]: Well, I try to... I try to expose, like, I try to expose everything or, but, yeah, at least for the, but for things, like, for example, for the samplers, like, there's, like, yeah, four different sampler nodes, which go in easiest to most advanced. So, yeah, if you go, like, the easy node, the regular sampler node, that's, you have just the basic settings. But if you use, like, the sampler advanced... If you use, like, the custom advanced node, that, that one you can actually, you'll see you have, like, different nodes.Alessio [00:36:19]: I'm looking it up now. Yeah. What are, like, the most impactful parameters that you use? So, it's, like, you know, you can have more, but, like, which ones, like, really make a difference?Comfy [00:36:30]: Yeah, they all do. They all have their own, like, they all, like, for example, yeah, steps. Usually you want steps, you want them to be as low as possible. But you want, if you're optimizing your workflow, you want to, you lower the steps until, like, the images start deteriorating too much. Because that, yeah, that's the number of steps you're running the diffusion process. So, if you want things to be faster, lower is better. But, yeah, CFG, that's more, you can kind of see that as the contrast of the image. Like, if your image looks too bursty. Then you can lower the CFG. So, yeah, CFG, that's how, yeah, that's how strongly the, like, the negative versus positive prompt. Because when you sample a diffusion model, it's basically a negative prompt. It's just, yeah, positive prediction minus negative prediction.swyx [00:37:32]: Contrastive loss. Yeah.Comfy [00:37:34]: It's positive minus negative, and the CFG does the multiplier. Yeah. Yeah. Yeah, so.Alessio [00:37:41]: What are, like, good resources to understand what the parameters do? I think most people start with automatic, and then they move over, and it's, like, snap, CFG, sampler, name, scheduler, denoise. Read it.Comfy [00:37:53]: But, honestly, well, it's more, it's something you should, like, try out yourself. I don't know, you don't necessarily need to know how it works to, like, what it does. Because even if you know, like, CFGO, it's, like, positive minus negative prompt. Yeah. So the only thing you know at CFG is if it's 1.0, then that means the negative prompt isn't applied. It also means sampling is two times faster. But, yeah. But other than that, it's more, like, you should really just see what it does to the images yourself, and you'll probably get a more intuitive understanding of what these things do.Alessio [00:38:34]: Any other nodes or things you want to shout out? Like, I know the animate diff IP adapter. Those are, like, some of the most popular ones. Yeah. What else comes to mind?Comfy [00:38:44]: Not nodes, but there's, like, what I like is when some people, sometimes they make things that use ComfyUI as their backend. Like, there's a plugin for Krita that uses ComfyUI as its backend. So you can use, like, all the models that work in Comfy in Krita. And I think I've tried it once. But I know a lot of people use it, and it's probably really nice, so.Alessio [00:39:15]: What's the craziest node that people have built, like, the most complicated?Comfy [00:39:21]: Craziest node? Like, yeah. I know some people have made, like, video games in Comfy with, like, stuff like that. So, like, someone, like, I remember, like, yeah, last, I think it was last year, someone made, like, a, like, Wolfenstein 3D in Comfy. Of course. And then one of the inputs was, oh, you can generate a texture, and then it changes the texture in the game. So you can plug it to, like, the workflow. And there's a lot of, if you look there, there's a lot of crazy things people do, so. Yeah.Alessio [00:39:59]: And now there's, like, a node register that people can use to, like, download nodes. Yeah.Comfy [00:40:04]: Like, well, there's always been the, like, the ComfyUI manager. Yeah. But we're trying to make this more, like, I don't know, official, like, with, yeah, with the node registry. Because before the node registry, the, like, okay, how did your custom node get into ComfyUI manager? That's the guy running it who, like, every day he searched GitHub for new custom nodes and added dev annually to his custom node manager. So we're trying to make it less effortless. So we're trying to make it less effortless for him, basically. Yeah.Alessio [00:40:40]: Yeah. But I was looking, I mean, there's, like, a YouTube download node. There's, like, this is almost like, you know, a data pipeline more than, like, an image generation thing at this point. It's, like, you can get data in, you can, like, apply filters to it, you can generate data out.Comfy [00:40:54]: Yeah. You can do a lot of different things. Yeah. So I'm thinking, I think what I did is I made it easy to make custom nodes. So I think that helped a lot. I think that helped a lot for, like, the ecosystem because it is very easy to just make a node. So, yeah, a bit too easy sometimes. Then we have the issue where there's a lot of custom node packs which share similar nodes. But, well, that's, yeah, something we're trying to solve by maybe bringing some of the functionality into the core. Yeah. Yeah. Yeah.Alessio [00:41:36]: And then there's, like, video. People can do video generation. Yeah.Comfy [00:41:40]: Video, that's, well, the first video model was, like, stable video diffusion, which was last, yeah, exactly last year, I think. Like, one year ago. But that wasn't a true video model. So it was...swyx [00:41:55]: It was, like, moving images? Yeah.Comfy [00:41:57]: I generated video. What I mean by that is it's, like, it's still 2D Latents. It's basically what I'm trying to do. So what they did is they took SD2, and then they added some temporal attention to it, and then trained it on videos and all. So it's kind of, like, animated, like, same idea, basically. Why I say it's not a true video model is that you still have, like, the 2D Latents. Like, a true video model, like Mochi, for example, would have 3D Latents. Mm-hmm.Alessio [00:42:32]: Which means you can, like, move through the space, basically. It's the difference. You're not just kind of, like, reorienting. Yeah.Comfy [00:42:39]: And it's also, well, it's also because you have a temporal VAE. Mm-hmm. Also, like, Mochi has a temporal VAE that compresses on, like, the temporal direction, also. So that's something you don't have with, like, yeah, animated diff and stable video diffusion. They only, like, compress spatially, not temporally. Mm-hmm. Right. So, yeah. That's why I call that, like, true video models. There's, yeah, there's actually a few of them, but the one I've implemented in comfy is Mochi, because that seems to be the best one so far. Yeah.swyx [00:43:15]: We had AJ come and speak at the stable diffusion meetup. The other open one I think I've seen is COG video. Yeah.Comfy [00:43:21]: COG video. Yeah. That one's, yeah, it also seems decent, but, yeah. Chinese, so we don't use it. No, it's fine. It's just, yeah, I could. Yeah. It's just that there's a, it's not the only one. There's also a few others, which I.swyx [00:43:36]: The rest are, like, closed source, right? Like, Cling. Yeah.Comfy [00:43:39]: Closed source, there's a bunch of them. But I mean, open. I've seen a few of them. Like, I can't remember their names, but there's COG videos, the big, the big one. Then there's also a few of them that released at the same time. There's one that released at the same time as SSD 3.5, same day, which is why I don't remember the name.swyx [00:44:02]: We should have a release schedule so we don't conflict on each of these things. Yeah.Comfy [00:44:06]: I think SD 3.5 and Mochi released on the same day. So everything else was kind of drowned, completely drowned out. So for some reason, lots of people picked that day to release their stuff.Comfy [00:44:21]: Yeah. Which is, well, shame for those. And I think Omnijet also released the same day, which also seems interesting. Yeah. Yeah.Alessio [00:44:30]: What's Comfy? So you are Comfy. And then there's like, comfy.org. I know we do a lot of things for, like, news research and those guys also have kind of like a more open source thing going on. How do you work? Like you mentioned, you mostly work on like, the core piece of it. And then what...Comfy [00:44:47]: Maybe I should fade it in because I, yeah, I feel like maybe, yeah, I only explain part of the story. Right. Yeah. Maybe I should explain the rest. So yeah. So yeah. Basically, January, that's when the first January 2023, January 16, 2023, that's when Amphi was first released to the public. Then, yeah, did a Reddit post about the area composition thing somewhere in, I don't remember exactly, maybe end of January, beginning of February. And then someone, a YouTuber, made a video about it, like Olivio, he made a video about Amphi in March 2023. I think that's when it was a real burst of attention. And by that time, I was continuing to develop it and it was getting, people were starting to use it more, which unfortunately meant that I had first written it to do like experiments, but then my time to do experiments went down. It started going down, because people were actually starting to use it then. Like, I had to, and I said, well, yeah, time to add all these features and stuff. Yeah, and then I got hired by Stability June, 2023. Then I made, basically, yeah, they hired me because they wanted the SD-XL. So I got the SD-XL working very well withітhe UI, because they were experimenting withámphi.house.com. Actually, the SDX, how the SDXL released worked is they released, for some reason, like they released the code first, but they didn't release the model checkpoint. So they released the code. And then, well, since the research was related to code, I released the code in Compute 2. And then the checkpoints were basically early access. People had to sign up and they only allowed a lot of people from edu emails. Like if you had an edu email, like they gave you access basically to the SDXL 0.9. And, well, that leaked. Right. Of course, because of course it's going to leak if you do that. Well, the only way people could easily use it was with Comfy. So, yeah, people started using. And then I fixed a few of the issues people had. So then the big 1.0 release happened. And, well, Comfy UI was the only way a lot of people could actually run it on their computers. Because it just like automatic was so like inefficient and bad that most people couldn't actually, like it just wouldn't work. Like because he did a quick implementation. So people were forced. To use Comfy UI, and that's how it became popular because people had no choice.swyx [00:47:55]: The growth hack.Comfy [00:47:56]: Yeah.swyx [00:47:56]: Yeah.Comfy [00:47:57]: Like everywhere, like people who didn't have the 4090, they had like, who had just regular GPUs, they didn't have a choice.Alessio [00:48:05]: So yeah, I got a 4070. So think of me. And so today, what's, is there like a core Comfy team or?Comfy [00:48:13]: Uh, yeah, well, right now, um, yeah, we are hiring. Okay. Actually, so right now core, like, um, the core core itself, it's, it's me. Uh, but because, uh, the reason where folks like all the focus has been mostly on the front end right now, because that's the thing that's been neglected for a long time. So, uh, so most of the focus right now is, uh, all on the front end, but we are, uh, yeah, we will soon get, uh, more people to like help me with the actual backend stuff. Yeah. So, no, I'm not going to say a hundred percent because that's why once the, once we have our V one release, which is because it'd be the package, come fee-wise with the nice interface and easy to install on windows and hopefully Mac. Uh, yeah. Yeah. Once we have that, uh, we're going to have to, lots of stuff to do on the backend side and also the front end side, but, uh.Alessio [00:49:14]: What's the release that I'm on the wait list. What's the timing?Comfy [00:49:18]: Uh, soon. Uh, soon. Yeah, I don't want to promise a release date. We do have a release date we're targeting, but I'm not sure if it's public. Yeah, and we're still going to continue doing the open source, making MPUI the best way to run stable infusion models. At least the open source side, it's going to be the best way to run models locally. But we will have a few things to make money from it, like cloud inference or that type of thing. And maybe some things for some enterprises.swyx [00:50:08]: I mean, a few questions on that. How do you feel about the other comfy startups?Comfy [00:50:11]: I mean, I think it's great. They're using your name. Yeah, well, it's better they use comfy than they use something else. Yeah, that's true. It's fine. We're going to try not to... We don't want to... We want people to use comfy. Like I said, it's better that people use comfy than something else. So as long as they use comfy, I think it helps the ecosystem. Because more people, even if they don't contribute directly, the fact that they are using comfy means that people are more likely to join the ecosystem. So, yeah.swyx [00:50:57]: And then would you ever do text?Comfy [00:50:59]: Yeah, well, you can already do text with some custom nodes. So, yeah, it's something we like. Yeah, it's something I've wanted to eventually add to core, but it's more like not a very... It's a very high priority. But because a lot of people use text for prompt enhancement and other things like that. So, yeah, it's just that my focus has always been on diffusion models. Yeah, unless some text diffusion model comes out.swyx [00:51:30]: Yeah, David Holtz is investing a lot in text diffusion.Comfy [00:51:34]: Yeah, well, if a good one comes out, then we'll probably implement it since it fits with the whole...swyx [00:51:39]: Yeah, I mean, I imagine it's going to be a close source to Midjourney. Yeah.Comfy [00:51:43]: Well, if an open one comes out, then I'll probably implement it.Alessio [00:51:54]: Cool, comfy. Thanks so much for coming on. This was fun. Bye. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Happy New Year, friends. Thanks for all the love on the latent space live and 100th episode
end of year recap. Your support has boosted us 30 places in the podcast charts, and that always
helps us book great guests and organise more industry events for you. We don't say this enough,
but thank you to everyone who has left a review on Apple Podcasts or subscribed to our new YouTube
channel. Last year, we broke new ground when we interviewed our first public company CEO,
with Drew Houston and First Technology Cabinet member with Minister Josephine Teo,
and first year with full coverage of leading labs across meta, open AI, anthropic,
RACA, Liquid and Google Deep Mind.
For our 100 first episode, we are proud to introduce another first, with our first anonymous guest.
As Swix mentions in the episode, latent space was started in the immediate aftermath of stable diffusion,
and the uncredentialed software engineers it enabled
set the stage for the LLM wave that was to come with ChatGPT.
The earliest winner of the Stable Diffusion Tooling Wars
was SD WebUI, a Gradio app by the anonymous young creator
Automatic 1111 that quickly amassed over 100,000 GitHub stars
for how it rapidly shipped plugins and usable interfaces
for the rapidly growing stable diffusion ecosystem.
However, these days, the power tool of choice is now ComforUI by today's guest, Comfy Anonymous,
who is gracing us with his first ever podcast appearance today.
The shift from Automatic 1111 to Comfor UI reflects a shift away in the image diffusion space
from prompting and tweaking settings in 2022 to more complex and parallel workflows
chaining together different models and orchestrating long-running operations that can also
include video processing, visualized on an intuitive canvas instead of long YAML or code blocks.
Because Comfewi is open source, there are now multiple Y Combinator start-ups built off of a comfy workflow,
or offering Comfewi as a service directly.
Interestingly enough, this same workflow tooling has not seemed to take off for other modalities yet,
but perhaps 2025 as the year diffusion tooling diffuses to non-image domains.
In other news, we have just announced the second AI engineer summit in New York City.
We are bringing back the surprisingly successful AI leadership track from World's Fair.
And also the single track AI engineering track is now wholly focused on agents at work.
If you are building agents in 2025, this is the single best conference to attend.
Head to apply.aI.orgeneer and see you there.
Watch out and take care.
Hey, everyone.
Welcome to the latest space podcast.
This is Alessio, partner and CTO Adisable Partners, and I'm joined by my co-hosts, founder of Small A.I.
Hey, everyone.
We are in the Chrome Studio again, but with our first ever anonymous guest, comfy anonymous, welcome.
I feel like that's your full name.
You just go back comfy, right?
Yeah, well, a lot of people just call me comfy, even though, even when they know my real name.
Hey, hey, comfy.
Spix is the same.
Not a lot of people call you strong.
Yeah, yeah, you have a profession.
name, right, that people know you buy and then you have a legal name. Yeah, it's fine. How do I phrase this?
Like, I think people who are in the know, know that comfy is like the tool for image generation and now other
multimodality stuff. I would say that when I first got started with stable diffusion, the star of the
show was automatic 111, right? And I actually looked back in my notes from 2022-ish. Like, Comfey was already
getting started back then, but it was kind of like the up-and-comer and like your main feature was
the flowchart. Can you just kind of rewind to that moment?
that year and how you looked at the landscape there and decided to start comfy?
Yeah, I discovered stable diffusion in the 2022, in October 2022.
And, well, I kind of started playing around with it.
Yes, I, and back then I was using Automatic, which was what everyone was using back then.
So I started with that.
Because I had the, it was when I started, I had no idea like how diffusion models work, how
any of this work.
Oh, yeah.
What was your prior background as an engineer?
Just a software engineer.
Yeah, boring software engineer.
But like any image stuff, any orchestration, distributed systems, GPUs.
No, I was doing basically nothing interesting.
Crud, web development?
Yeah, well, not at web development, just, yeah, some basic, maybe some basic, like automation stuff.
Okay.
Or just, yeah, no big companies or anything.
Yeah, but already some interest in automations, probably a lot of Python.
Yeah, yeah, of course, Python.
But I wasn't actually used to the node graph interface before I started ConfiUI.
It was just, I just thought it was like, oh, like what's the best way to represent the diffusion process in the user interface?
and then I'm like, oh, well, like, natural.
This is the best way I found.
And this was like with the node interface.
So how I got started was, yeah.
So basically October 2022, just like I hadn't written a line of pie torch before that.
So it's completely new.
What happened was I kind of got addicted to generating images.
As we all did.
Yeah.
And then I started experimenting with the high-risk fixed in Otto,
which was, for those that don't know, the high-rest-fix is just to generate,
since the diffusion models back then could only generate that low-resolution.
So what you would do, you would generate low-resolution image,
then upscale, then refine it again.
And that was kind of the hack to generate high-resolution images.
I really liked generating higher resolution images, so I was experimenting with that.
And so I modified the code a bit, okay, what happens if I use different samplers on the second pass?
So I must edited the code of auto.
So what happens if I use a different sampler?
What happens if I use a different, like a different settings, different number of steps?
and because back then the high-risk fix was very basic just so yeah now there's a whole library of just
the upsamplers yeah i think they added a bunch of uh of options to the high-risk fix since
since since then but before that was just so basic so i wanted to go further i wanted to track it
what happens if i use a different model for the second the second pass and then well
then the auto code base wasn't good enough for it would have been harder to implement that in the auto interface than to create my own interface.
So that's when I decided to create my own.
And you were doing that mostly on your own when you started or did you already have kind of like a subgroup?
No, I was on my own because it was just me experimenting with stuff.
So, yeah, that was it.
So I started writing the code January 1, 2023.
And then I released the first version on GitHub, January 16, 2023.
That's how things got started.
And was the name Comfi UI right away?
Yeah, comfy UI.
The reason my name is Comfi is people thought my pictures were comfy.
So I just name it.
It's my comfy.
UI. So yeah, that's...
Is there a particular segment of the community that you targeted as users?
Like more intensive workflow artists, you know, compared to the automatic crowd or, you know?
This was my way of like experimenting with new things.
Like the high-risk fixed thing I mentioned, which was like in comfy.
The first thing you could easily do was just chain different models together.
And then one of the first things, I think the first times it got.
a bit of popularity was when I started experimenting with
different, like, applying prompts to different areas of the image.
Yeah, I called it area conditioning.
Posted it on Reddit and it got a bunch of upvotes.
So I think that's when, like, when people first learned of ConfiUI.
Is it mostly like fixing hands?
Oh, no, that was just like, let's say, well, it was very, well, it still is
kind of difficult to, like, let's say you want a mountain.
You have an image, and then, okay, I want a mountain here, and I want, like, a fox here.
Yeah, so compositing the image.
Yeah, by way, it was very easy.
It was just, like, when you run the diffusion process, you kind of generate, okay, you do
one pass through the diffusion model.
Every step you do one pass, okay, this.
place of the image with this prompt, this space, place of the image with the other prompt,
and then the entire image with another prompt, and then just average everything together,
every step. And that was area composition, which I call it. And then a month later,
there was a paper that came out called multi-diffusion, which was the same thing.
But yeah, that's...
Could you do area composition with different models? Or because you're
averaging out, you kind of need the same model.
You could do it with, but yeah,
I hadn't implemented it for
different models, but you
can do it with
different models if you want, as long
as the models share the same
latent space.
We're supposed to ring a bell every time
someone says deep in space.
Yeah, like for example, you couldn't
use like Excel and
SD1.5 because those have
different latent space, but
like, yeah, like
SD1.
1.5 models, different ones, you could do that.
There's some models that try to work in pixel space, right?
Yeah, they're very slow.
Of course.
That's the problem.
That's the reason why stable diffusion actually became popular, like, popular was because
was because of the latent space.
Yeah, small in, yeah, because it used to be latent diffusion models and then they
trained it up.
Yeah, because pixel refusion models are just too slow.
So, yeah. Have you ever tried to talk to like stability, the latent diffusion guys, like, you know, Robin Rombach, that crew? Yeah, well, I used to work at stability. Oh, I actually didn't know. Yeah, I used to work at stability. I got, I got hired in June 2023.
Ah, that's the part of the story I didn't know about. Okay.
So the reason I was hired is because they were doing SDXL at the time. And they were basically SDXL. As they Excel.
I don't know if you remember, it was a base model and then a refiner model.
Basically, they wanted to experiment, like, chaining them together.
And then they saw, oh, we can use this to do that.
Well, let's hire that guy.
But they didn't pursue it for like SD3.
What do you mean?
Like the SDXL approach.
Yeah.
The reason for that approach was because basically they had two models.
And then they wanted.
wanted to publish both of them.
So they trained one on lower time steps, which was the refiner model.
And then the first one was trained normally.
And then during their test, they realized, oh, like if we string these models together are
like quality increases.
So let's publish that.
It worked.
Yeah.
But like right now, I don't think many people actually use the refiner anymore.
though it is actually a full diffusion model.
You can use it on its own and it's going to generate images.
I don't think anyone, people have mostly forgotten about it.
Can we talk about models a little bit?
So stable diffusion, obviously, is the most known.
I know flux has gotten a lot of direction.
Are there any underrated models that people should use more?
Or what's the state of the union?
Well, the latest state of the art, at least, yeah, for images, there's
Yeah, there's flux.
There's also SD3.5.
ZZ3.5 is two models.
There's a small one, 2.5B, and there's the figure one, 8B.
So it's smaller than flux, and it's more creative in a way.
But flux, yeah, flex is the best.
People should give SD3.5 a try because it's different.
I won't say it's better.
Well, it's better for some specific use cases.
If you want to make something more creative, maybe as the 3.5,
if you want to make something more consistent and flux, it's probably better.
Do you ever consider supporting the closed source model APIs?
Well, we do support them as custom nodes.
We actually have some official custom nodes from different.
Adiogram.
Yeah.
I guess Dolly would have one.
Yeah, that's, it's just not another person that handles that.
Sure, sure.
Quick question on SD, there's a lot of community discussion about the transition from SD1.5 to SD2 and then SD2 to SD3.
People still, like, you know, very loyal to the previous generations of SDs.
Yeah, SD1.5 and still has a lot of users.
The last based model.
Yeah.
Then SD2 was mostly ignored.
It wasn't, it wasn't big enough improvement over the previous one.
Okay, so SD1.5, SD3, flux, and whatever else.
SdXL.
That's the main one.
Stable Cascade?
Stable Cascade.
That was a good model.
The problem with that one is it got, like, SD3 was announced one week after.
Yeah, it was like a weird release.
what was it like inside of stability actually?
I mean, statute of limitations expired.
You know, management has moved.
It's easier to talk about now.
Yeah.
Inside stability, actually that model was ready like three months before,
but it got stuck in red teaming.
So basically, if that model had released,
or it was supposed to be released by the authors,
then it would probably have gotten very popular since it's,
It's a step up from SDXL, but it got all of its momentum stolen by the SD3 announcement,
so people kind of didn't develop anything on top of it, even though it's a, yeah.
It was a good model, at least completely, mostly ignores for some reason.
It seemed, I think the naming as well matters.
It seemed like a branch off of the main tree of developments.
Yeah, well, it was different to research.
did it.
Yeah.
Very, like a good model.
Like, it's the Worcesterchin authors.
I'm not if I'm pronouncing it.
Worshin.
Yeah.
Yeah.
I actually met them in Vienna.
Yeah, they worked at stability for a bit, and they left right after the Cascade release.
This is Dustin, right?
No.
Dustin's SD3.
No.
Dustin is SD3.
So that's Pablo and Domey.
This I think I'm pronouncing this name.
correctly.
Yeah, that's very good.
It seems like the community is very, they move very quickly.
Yeah.
Like when there's a new model out, they just drop whatever the current one is and they
just all move wholesale over.
Like they don't really stay to explore the full capabilities.
Like if the stable cascate was that good, they would have AB tested a bit more.
Instead, they're like, okay, SD3 is out.
Let's go.
Well, I find the opposite, actually.
The community doesn't, like, they only jump on a new model when there's a
significant improvement.
Like if there's only like
incremental improvement,
which is what
most of these models are going to
have, especially if you, because
stay the same parameter count.
Yeah. Like you're not going to get
a massive improvement
into, like, unless
there's something big that changes.
So, yeah.
And how are they evaluating these improvements?
Like, because there's a whole chain
of, you know, comfy workflows.
How does one part of the chain actually affect the whole process?
Are you talking on the model side specific?
Model specific, right?
But once you have your whole workflow based on a model, it's very hard to move.
Not well, not really.
It depends on their specific kind of workflow.
Yeah.
So I do a lot of text and image.
Yeah.
When you do change, like most workflows are kind of going to be,
compatible between different models.
It's just like you might have to completely change your prompt, completely change.
Okay, well, I mean, maybe the question is really about e-vals.
Like, what does the confi community do for evils?
Just, you know.
Well, they don't really do.
It's more like, oh, I think this image is nice.
They just subscribe to Fofer.
Yeah, I just see, like, you know, what Fofer is doing.
Yeah, well, they just generate, like, like, I don't see anyone really.
really doing, at least on the comfy side, comfy users, it's more like, or generate images
and see, oh, this one's not, is this like, yeah, it's not, yeah, it's not, like, the more,
like, scientific, like, checking, that's more specifically on, like, model side.
Yeah.
But there is a lot of vibes also, because it is a, like, artistic.
You can't create a very good model that doesn't generate nice images.
Because most images on the Internet are ugly.
So if you, if you just, oh, I have the best model that can't.
It's super smart.
I create it on all the, like, I'm trying it on just all the images on the Internet.
The images are not going to look good.
Yeah, yeah.
They're going to be very consistent.
But, yeah.
It's not going to be like the look that people are going to be expecting from a model.
Can we talk about lauras?
Because we talk about models.
Then the next step is probably lauras.
Actually, I'm kind of curious how lauras entered the tool set of the image community
because the lower paper was 2021.
And then like there was like other methods like textual inversion that was popular at the early SD stage.
Yeah, I can't explain the difference between like,
Textual inversions, that's basically what you're doing is you're training a, because, well, yeah, stable diffusion.
You have the diffusion model.
You have the text encoder.
So basically what you're doing is training a vector that you're going to pass to the text decoder.
It's basically you're training a new word.
Yeah, it's a little bit like representation engineering now.
Yeah, basically, yeah, you're just, so yeah, if you know how, like, the text encoder works,
Basically, you have a, you take your words of your product, you convert those into tokens with the tokenizer, and those are converted into vectors.
Basically, yeah, each token represents a different vector. So each word presents a vector. And those, depending on your words, that's the list of vectors that get passed to the text encoder, which is just, yeah, just a stack of attention.
Like basically it's very close to LLAB architecture.
Yeah, so basically what you're doing is just training a new vector.
We're saying, well, I have all these images and I want to know which word does that represent.
And it's going to get like you train this vector.
And then when you use this vector, it hopefully generates like something similar to your images.
Yeah.
I would say it's like surprisingly sample efficient in picking up the concept that you're trying to train it on.
Yeah. Well, people have kind of stopped doing that. Even though back as like when I was at stability, we actually did train internally some like textual inversions on like T5XXL.
I actually worked pretty well. But for some reason, yeah, people don't use them. And also they might also work like like yeah, that's just something.
and probably have to test,
but maybe if you train a textual inversion
like on T5XXL,
it might also work with all the other models
that use D5XXL.
Because same thing with
like the textual inversions
that were trained for SD1.5,
they also kind of work on SDXL
because SDXL has
two text encoders,
and one of them is the same as the
SD1.5 clip L.
So those, they actually
they don't work as strongly because they're only applied to one of the text encoders.
And the same thing for SD3.
ZD3 has three text encoders.
So it works.
You can still use your text conversion SD1.5 on SD3.
But it's just a lot weaker because now there's three text encoders.
So it gets even more diluted.
Yeah.
Do people experiment a lot on just on the clip side?
There's like Siglip, there's blip.
Do people experiment a lot on...
You can't really replace.
Yeah, because they're trained together, right?
Yeah, they're trained together, so you can't...
Like, what I've seen people experimenting with is a long clip.
So basically someone fine-tuned the clip model to accept longer promise.
Oh, it's kind of like long context fine-tuning.
Yeah, so sort of like it's actually supported in CoreC comfy.
How long is long?
Regular clip is 77 tokens.
long clip is 256.
But the hack that, like, if you use stable diffusion, 1.5, you've probably noticed.
So it still works if I use long prompts longer than 77 words.
Well, that's because the hack is to just, well, you split it up in chugs of 70s,
your whole big prompt.
Let's say you give it like the massive text.
the vibe
or something.
And it would split it up
in China's like seven
and then just pass each one
through the clip
and then just
cut down anything together
at the end.
It's not ideal
but it actually works.
Like the positioning
of the words really,
really matters then, right?
This is why order matters
in prompts.
Yeah.
Yeah, like it works,
but it's not ideal
but it's what people
expect.
Like if someone gives a huge prompt, they expect at least some of the concepts
that begin to be like present in the image.
But usually when they give long prompts, they don't expect like detail, I think.
So that's why it works very long.
And while we're on this topic, prompt weighting, negative prompting,
all sort of similar part of this layer of the stack.
Yeah, the hack for that which works on clip, like,
Basically, it's just for SD1.5, the prompt plating works well because Clip L is not a very deep model.
So you have a very high correlation between, you have the input token, the index of the input token vector and the output token, the very, the concepts are very close, closely.
So that means if you interpolate the vector from what, well, the way it comes to you, it does it is it has, okay, you have the vector, you have an empty prompt.
So you have a chunk, like a clip output for the empty prompt, and then you have the one for your prompt.
And then it interpolates from that depending on your prompt weight, the weight of your tokens.
So if you, yeah, so that's how it does prompt waiting, but this stops working the deeper your text encoder is.
So on T5XSL it doesn't work at all.
Wow.
Is that a problem for people?
I mean, because I'm used to just moving up numbers.
Probably not.
Yeah.
Well.
So you just use words to describe, right?
Because it's a bigger language model.
Yeah.
Yeah.
So honestly, it might be good.
but I haven't seen many complaints on fluxed out.
It's not working.
So, because I guess people can sort of get around it with language.
Yeah.
And then coming back to Loras, now the popular way to customize models is Loras.
And I saw you also support Locon and Loja, which I've never heard of before.
There's a bunch of, because what the Laura is essentially is instead of like, okay,
You have your model, and then you want to fine tune it.
So instead of, like, what you could do is you could fine tune the entire thing.
Yeah, full fine tune.
But that's a bit heavy.
So to speed things up and make things less heavy, what you can do is just fine tune some smaller weights.
Like basically two matrices that when you multiply, like two low-rank matrices,
and when you multiply them together gives a,
represents a difference between trained weights and your base weights. So by training those two
smaller matrices, that's a lot less same. And they're portable. So you're going to share them.
Yeah, and also smaller. Yeah. That's how Laura's work. So basically, so when inferencing,
you can inference with them pretty efficiently, like how I can see why it does it. It just,
when you use a Laura, it just applies it straight on the weight.
so that there's only a small delay at the bay, like, before the sampling,
when it applies the weights, and then it just same speed as before.
So for inference, it's not that bad.
But, and then you have, so basically all the lower types, like lo-ha, low-con, everything,
that's just different ways of representing that, like, basically you can call it kind of like compression,
even though it's not really compression.
It's just different ways of represent it like just okay.
I want to train a difference on the weights.
What's the best way to represent that difference?
There's the basic moro, which is just,
oh, let's multiply these two matrices together.
And then there's all the other ones,
which are all different algorithms.
So let's talk about what comfy UI actually is.
I think most people have heard,
of it, some people might have seen screenshots.
I think fewer people have built very complex workflow.
So when you started, automatic was like the super simple way.
What were some of the choices that you made?
So the node workflow, is there anything else that stands out?
It's like this was like a unique take on how to do image generation workflows.
Well, I feel like yeah, back there, everyone was trying to make like easy to use interface.
everyone's trying to make an easy to use interface
let's make a hard to use interface
like to say like
I don't need to do that
everyone else doing it so
let me try something like let me try
to make a powerful interface
that's not easy to use
so
so like yeah there's a sort of node execution engine
your read me actually lists
has a really good list of
features of things you prioritize, right?
Like, let me see, like sort of re-executing from, from any parts of the workflow that was changed,
asynchronous queue system, smart memory management, like all this seems like a lot of engineering.
Yeah, there's a lot of engineering in the back end to make things.
Because I was always focused on making things work locally very well because that's because I was
using it locally.
So everything.
So there's a lot of a lot.
of thought and work and getting everything to run as well as possible.
So yeah, Confuio is actually more of a back end, at least, well, not other front end is getting
a lot more development.
But before it was, I was pretty much only focused on the back end.
Yeah.
So V0.1 was only August this year.
Yeah, before there was a new front end.
So yeah, yeah.
Yeah.
And so what was the big rewrite for the 0.1 and then the 1.0?
Well, that's more on the front end side.
Because before that, it was just like the UI.
Because when I first wrote it, I just said, okay, how can I make, like,
I can do web development, but I don't like doing it.
Like, what's the easiest way I can slap a node interface on this?
And then I found this library, light graph, like JavaScript library.
Live graph?
Light graph.
Usually people will go for like React flow for like a flow building.
Yeah, but that seems like too complicated.
So I didn't really want to spend time like developing the front end.
I'm like, oh, light graph.
This has the whole node interface.
So, okay, let me just plug that into my backend.
I feel like if Streamlet or Gradio offered something,
you would have used Streamlet or Gradio because it's Python.
Streamlet and grade, yeah, Gradio I don't like radio.
That's one of the reasons why automatic was very bad.
It's great because the problem with radio, it forces you to, well, not forces you,
but it kind of makes your interface logic and your backend logic and just sticks them together.
It's supposed to be easy for you guys.
If you're a Python main, I'm a JS main, right?
If you're a Python main is supposed to be easy.
Yeah, it's easy. Well, it's easy, but it makes your whole software a huge mess.
I see, yeah, I see. So you're mixing concerns, you're separating concerns?
Well, it's because... Like, front and backend should be well separated with a defined API.
Like, that's how you're supposed to do.
Smart people disagree.
It just takes everything together. It makes it easy to, like, any huge mess.
And also, it's...
There's a lot of issues with radio.
Like, it's very good if all you want to do is just slap a quick interface on your,
like, to show off your, like, your ML project.
Like, that's what it's made for.
Yeah.
Like, there's no problem using it, like, oh, I have my, I have my code.
I just want it quick interface on it.
That's perfect.
Like, use radio.
But if you want to make something that's, like, a real,
like real software that will last a long time and will be easy to maintain, then I would avoid it.
Yeah.
So your criticism is streamlet and radio, the same, I mean, those are the same criticisms.
Yeah, streamlet I haven't used as much.
Yeah, I just looked a bit.
Similar philosophy.
Yeah, it's similar.
It's just, it just seems to me like, okay, for quick, like, AI demos, it's perfect.
Yeah.
Going back to, like, the core tech, like, asynchronous Q,
slow re-execution, smart memory management,
anything that you're very proud of
or was very hard to figure out?
Yeah, the thing that's the biggest pain in the ass
is probably the memory management.
Yeah, were you just paging models in and out?
Yeah, before it was just, okay,
load the model, completely unload it,
load the new model, completely unloaded,
then, okay, that works well when your model are small,
but if your models are big and it takes it like,
It's like an SOMO has a like a 4090 and the model size is 10 gigabytes that can take a few seconds to like load and load, load, load and load.
So you want to try to keep things like in memory, in the GPU memory as much as possible.
What comfy UI does right now is that it tries to like estimate, okay, like, okay, you're going to sample this model.
it's going to take probably this amount of memory.
Let's remove the models like this amount of memory.
Let you load that's been loaded on the GPU
and then just execute it.
But so there's a fine line between just
because try to remove the least amount of models
that are already loaded.
Because as fans, like Windows, driver,
And another problem is the NVIDIA driver on Windows by default.
Because there's an option to disable that feature.
But by default, if you start loading, you can overflow your GPU memory,
and then the driver's going to automatically start paging to RAM.
But the problem with that is it makes everything extremely slow.
So when you see people complaining, oh, this model, it works,
but oh shit, it starts slowing down a lot.
That's probably what's happening.
So it's basically you have to just try to get,
use as much memory as possible,
but not too much,
or else things start slowing down
or people get out of memory.
And then just try to find that line where,
like the drive-round window starts paging and stuff.
Yeah.
And yeah, the problem with Pythorch is it's high,
levels. Don't have that much fine-grained control over like specific memory stuff.
So I kind of have to leave like the memory freeing to Python and PyTorch, which can be
annoying sometimes.
So, you know, I think one thing is a maintaining of this project.
Like you're designing for a very wide surface area of compute.
Like you even support CPUs.
Yeah, well, that's, that's just for fun.
pie torch,
pie torch
mic CPU.
Yeah,
it's just,
that's not,
that's not hard to support.
First of all,
is there a market share
estimate,
like,
is it like 70%
Nvidia and 30%
MD
and then like
miscellaneous on
Apple,
Silicon or whatever?
For comfy?
Yeah.
Yeah,
yeah,
I don't know the market share.
Can you guess?
I think it's mostly
in the video.
Right.
Because,
yeah.
Because AM,
the problem,
like,
AMD works horribly
on Windows.
Like on Linux,
it works.
it works fine. It's slower than the price equivalent at the video GPU. But it works.
Like you can use it, generate images, everything works. On Linux, on Windows, you might have a hard time.
So that's the problem. And most people, I think most people who bought AMD probably use Windows.
They probably aren't going to switch to Linux. So until A&D,
AMD actually ports their like raw cam to windows properly and then there's actually
pie torch.
I think they're doing that.
They're in the process of doing that.
But until they get it, they get a good like pie torch rock cam bill that works on Windows, it's like they're going to have a hard time.
Yeah.
We've got to get George on it.
Yeah.
Well, he's trying to get Lisa Sue with his own.
but let's talk a bit about like the node design so unlike all the other text to image you have a very
like deep so you have like a separate node for like clip and code you have a separate note for like the
case sampler you have like all these notes going back to like the making it easy versus making it
hard but like how much do people actually play with all the settings you know kind of like how do you
guide people to like hey this is actually going to be very impactful versus this is maybe like
less impactful but we still want to expose it to you
Well, I try to expose, like, I try to expose everything, or, but, yeah,
at least for the, but for things, like, for example, for the samplers,
like, there's, like, yeah, four different sample nodes,
which go in easiest to most advanced.
So, yeah, if you go, like, the easy node, the regular sampler node,
that you have just the basic settings.
But if you use the sampler,
custom advanced node,
that one you can actually,
you'll see you have,
like, different nodes.
I'm looking at up now.
Yeah.
What are like the most impactful parameters that you use?
So it's like, you know,
you're going to have more,
but like which ones like really make a difference?
Yeah, they all do.
They all have like, they all like, for example,
yeah, steps.
Usually you want to step.
You want them to be as low as possible, but if you're optimizing your workflow, you want to, you lower the steps until like the images start deteriorating too much.
Because that, yeah, that's a number of steps you're running the diffusion process.
So if you want things to be faster, lower is better.
But yeah, CFG, that's more, you can kind of see that as the contrast of the air.
image, like if your image looks too burnt out, then you go over the CFG. So, yeah, CFG,
that's how, yeah, that's how strongly the negative versus positive problems. So when you sample
a diffusion model, it's basically a negative prompt. It's just, yeah, positive prediction
minus negative prediction. Contrastive loss. Yeah, positive minus negative, and the CFG is the multiplier.
Yeah.
What are like good resources to understand what the parameters do?
I think most people start with automatic and then the move over and it's like step CFG,
sampler, name, scheduler, denoise.
Rid it.
Honestly, well, it's more, it's something you should like try out yourself.
I don't know necessarily need to know how it works to like what it does because even if you know like CFGO,
it's like positive minus negative prop.
Yeah.
So the only thing you know at CFG is if it's 1.0, then that means the negative prop isn't applied.
You also mean sampling is two times faster.
But yeah.
But other than that, it's more like you should really just see what it does to the images yourself.
And you'll probably get the more intuitive understanding of what these things do.
Any other nodes or things you want to shut up?
out? Like, I know the animate diff IP adapter, those are like some of the most popular ones.
Yeah, what else comes to mind?
Not the nodes, but there's, like, what I like is when some people, sometimes they make,
they make things that they use Confi UI as their backend. Like, there's a plugin for
CRETA that uses Confi as it's back in. So you can use, like, all the models that work in
comfy in Greta.
I think I've tried it once.
But I know a lot of people use it and it's quite a very nice.
What's the craziest node that people have built, the most complicated?
Crazziest node, like, yeah, I know some people have made, like, video games and
comfy with, like, stuff like that.
So, like, someone, like, I remember, like, yeah, last.
The thing it was last year, someone made like Wolfenstein truly in Confucius.
And then one of the inputs was, oh, you can generate the texture and then it changes the texture in the game.
So you can plug it to the workflow.
There's a lot of, if you look there, there's a lot of crazy things people do.
So, yeah.
And now there's like a node register that people can use to like download notes and...
Yeah, like, well, there's always been.
the compute one manager, but we're trying to make this more like, I don't know, official, like, with, yeah, with the node registry.
Because before the node registry, like, okay, how did your custom note get in the compute one manager?
That's the guy running it who, like, every day he searched GitHub for new custom nodes and added them annually to his custom node managers.
So we're trying to make it
in less effort for him, basically.
Yeah, but I was looking, I mean,
there's like a YouTube download node.
There's like, this is almost like, you know,
a data pipeline more than like an image generation thing at this point.
It's like you can get data in.
You can like apply filters to it.
You can generate data out.
Yeah, you can do a lot of different things.
Yeah, something I think what I did is I made it easy to make custom notes.
So I think that helped a lot for the ecosystem, because it is very easy to just make a known.
So, yeah, a bit too easy sometimes.
Then we have the issue where there's a lot of custom note packs which share similar notes.
But, well, that's, yeah, something we're trying to solve by maybe bringing some of the functionality into core.
Yeah, yeah, yeah.
And then there's like video, people can do video generation.
Yeah, video that's, well, the first video model was like stable video diffusion,
which was last, yeah, exactly last year, I think, like one year ago.
But that wasn't a true video model.
So it was.
It was like moving images.
Yeah, it generated video.
What I mean by that is it's like it's still too deep latency.
It's basically what they did is they took SD2,
and then they added some temporal attention to it,
and then trained it on videos.
So it's kind of like animated, like, the same idea, basically.
Why I say it's not a true video model,
and you still have, like, the 2D latency.
Like a true video model, like Mochi, for example,
we have 3D latency.
So you can move through the space, basically.
It's the difference.
You're not just kind of like reorienting.
Yeah.
And it's also, well, it's also because you have a temporal VAE.
Also, like Mochi has a temporal VAE that compresses out like the temporal direction also.
So that's something you don't have with like animate diff and stable video infusion.
They only like compresses spatially, not temporally.
So yeah.
So these models.
That's why I call it like true video models.
There's actually a few of them.
But the one I've implemented in comfy is Mochi because that seems to be the best one so far.
We had AJ come and speak at the Stato Diffifian meetup.
Other open one I think I've seen is cog video.
Yeah, caught video.
Yeah, that one's in, yeah, it also seems decent.
But yeah.
Chinese, so we don't use it.
No, it's fine.
It's just, yeah, I could.
Yeah, it's just, it's not the only one.
There's also a few others, which I...
The rest are like closed stores, right?
Like, cling.
Yeah, closed stores, there's a bunch of them.
But I mean, open.
I've seen a few of them, like, I can't remember their names,
but there's cog videos, the big one.
And there's also a few of them that released at the same time.
There's one that released at the same time as SD 3.5 same day,
which is why I don't remember the name.
We should have a release schedule so we don't conflict on these things.
Yeah, I think SD3.5 and Mochi released on the same day.
So everything else was kind of drowned, completely drowned out.
So for some reason, lots of people picked that day to release their stuff.
Yeah, which is, well, shame for those in gas.
And I think our Omnijed also release the same day, which also seems interesting.
Yeah.
Yeah, what's comfy?
So you are comfy, and then there's like confi.org.
I know we do a lot of things for like news research and those guys also have kind of like a more open source and on thing going on.
How do you work?
Like you mentioned you mostly work on like the core piece of it.
And then what?
Maybe I should fit it because, yeah, I feel like maybe I, yeah, I only explain part of the story.
Right.
Yeah, yeah.
Yeah, maybe I should explain the rest.
So yeah.
So yeah, basically, a genuine.
that's when the first
January 23
January 16,
2020,
that's when
Comfey was the first
released to the public
and did a Reddit
post about the area
composition thing somewhere
in
he don't remember exactly
maybe the end of January
beginning of February
and then
somewhere a YouTuber
made a video about it
like Olivier
he made a video about
comfy in March
2023. I think that's when
it was real burst of
attention. And by
that time I was continued
to developing it and it was
getting, people were
starting to use it more
which unfortunately meant
I had
first written it to do
experiments but then
my time to do
experiments when I started going down
because
yeah because people were actually starting to use it and like I had to and I said well yeah I'm to add all all these features and stuff yeah and then I got hired by stability June 23 then I made basically yeah they hired me because they wanted SDXL so I got SDXL working very well in the UI because they were experimenting with it actually actually
the SDX, how the Excel released worked is they released, for some reason, like, they released
the code first, but they didn't release the model checkpoint.
So they released the code, and then, well, since the research was released the code,
I released the code and you come few to.
And then the checkpoints were basically early access.
People had to sign up, and they only allowed a lot of people from EDU emails.
If you had an adieu email, like they gave you access basically to the zero as the Excel 0.9.
And well, that leaked.
Right.
Of course, because of course it's going to leak if you do that.
Well, the only way people could easily use it was with comfy.
So, yeah, people started using it.
And then I fixed a few of all the issues people had.
So then the big 1.0 really.
happened and, well,
ComfUI was the only
way a lot of people could actually
run it on their computers.
Because it just, like, automatic
was so, like, inefficient
and bad that
most people couldn't actually,
like, it just wouldn't work.
Like, because he did
a quick implementation. So people
were forced to use ConfiUI.
And that's how it became popular
because people had no choice.
the growth hack
yeah
yeah like everywhere
like people who didn't have the 490
they had like who had just
regular GPU
they didn't have a choice
yeah I got a 4070
so think of me
and so today what's
is there like a core
comfy team or
yeah well right now
yeah we are hiring
actually
so right that core
like the core core itself
it's it's
me.
Because the reason
where all the focus
has been mostly
on the front end
right now,
because that's the
thing that's been
neglected for a lot of time.
So most of the
focus right now
is all on the front end,
but we are,
yeah,
we will soon get
more people
to help me
with the actual
backend stuff
because that's,
once we have
our V1
release,
which is going to be the package,
comfy one with the nice interface
and easy to install on Windows
and hopefully Mac.
Yeah.
Once we have that,
we're going to have to,
lots of stuff to do on the back end side
and also the front end side.
What's the release?
I'm on the wait list.
What's the timing?
Soonish.
Yeah.
Like I don't want to promise.
is a release date.
Yeah, we do have a
real estate we're targeting, but
yeah, I'm not sure if it's
public. Yeah.
Yeah, and how we're going to, like,
we're still going to continue, like,
doing the open source,
like making QI the best way to run
like stable diffusion models,
like at least the open source side
and like it's going to be the best
way to run models
locally, but we will have
a few, like a few things to make money from it, like cloud inference or like that type of, that type
of thing.
So, and maybe some, like, some things for some enterprises.
I mean, a few questions on that.
How do you feel about the other comfy startups?
I mean, I think it's great.
They're using your name.
Yeah, well, it's better they use comfy than they use something else.
Yeah, that's true.
Yeah.
Like, it's fine.
I don't like, where.
Like, yeah, we're going to try not to, we don't want to, like, we want them to people to use comfy.
Like I said, it's better that people use comfy than something else.
So as long as they use comfy, I think it helps the ecosystem.
Because more people, even if they don't contribute directly, the fact that they are using comfy means that people are more likely.
to join the ecosystem.
So, yeah.
And then would you ever do text?
Yeah, well, you can already do text with some custom notes.
So, yeah, it's something we like, yeah, it's something I've wanted to eventually add to
core, but it's more like not a very high priority.
But because a lot of people use text for like prompt enhancement and like other things
like that.
So it's, yeah, it's just.
that my focus has always been
diffusion models
unless some text diffusion model
comes out. Yeah, David
Holtz is investing a lot in
text diffusion. Yeah, well, if a good one comes out
then, well, I'll probably implement it
since it fits with the whole. Yeah. I mean,
I imagine it's going to be close
source to me journey, so. Yeah, well,
if an open one comes out.
Yeah.
Then, yeah,
I'll probably, yeah,
yeah, I'll probably implement it
Thanks to know.
Cool, Confi.
Thanks so much for coming on.
This is fun.
