Software Huddle - Fast Inference with Hassan El Mghari

Starting point is 00:00:00 In terms of using an open source model, what is typically the challenges that people run into? You need a lot of expertise to run open source models on GPUs. There's a lot of LLM serving frameworks like VLM or TRT LLM. You need to look one of those up, make sure they support your open source model, make sure they support your architecture, get it up and running on the GPU, test it out, make sure it's working well. What was the original vision behind the company? It started off more in crypto and trying to use like access GPUs from people to try to just try to leverage that extra compute, essentially. You've built a lot of AI apps over the last, you know, couple years. What are some of the things that

Starting point is 00:00:46 you've learned along the way that are typical sort of mistakes that you see as people who are just starting to enter, you know, building some of these types of applications? Hey, everyone, Sean here. And today we have Hassan el Magari back on the show. Hassan was one of our first guests for huddle when he was working at Vercell. But since then, he's joined together AI, one of the hottest companies in the show. Hassan was one of our first guests for Huddle when he was working at Versel. But since then, he's joined Together AI, one of the hottest companies in the world.

Starting point is 00:01:09 They just raised a massive Series B. I didn't even know that a Series B round could be that big. Hassan joins me to talk about Together AI, inference optimization, and building AI applications. We touch on a bunch of different topics, like customer uses of AI, best practices for building apps, and what's next for Together AI. With that, let's get you over to the show.

Starting point is 00:01:31 And as always, if you have questions or feedback, feel free to reach out to Alex or myself. Hassan, welcome back to Software Huddle. Hi, Sean, thank you. Appreciate you having me back. Yeah, well, thanks for coming back. You were one of our first guests, and since then, we were chatting beforehand,

Starting point is 00:01:48 and you made a career change. I think that probably happened within just a short while after we last did this. So you're now at Together AI, which we'll be talking about in depth, but what led to that decision? What particularly caught your eye about the opportunity there? Yeah. I mean, I think like the AI industry

Starting point is 00:02:07 has just been like blowing up. I've been really into building AI apps even before joining together, just building these little like side projects since like 2023 really. So yeah, it seemed like a great fit. I kind of interviewed in a bunch of these different like AI infra companies.

Starting point is 00:02:27 And I love working at infra companies in general because a lot of them will just have like a great platform or APIs for building applications. And that's really what I'm really, really passionate about is just like getting to build really cool apps. And so interviewed at a bunch of different AI infra companies together really stood out in terms of like the talent that's on the team, the research focus.

Starting point is 00:02:50 We have a really big research team that's working on optimizing a lot of different parts of our staff, which I can get into later. And then the growth and revenue and fundraising, like everything just seemed really, really exciting. So I ended up joining a little over a year ago. How big was the company when you joined and what is it today? Yeah, great question. I think we were like 35 when I joined. And now we're like 150. That's a lot of growth. It's a little bit crazy. We're 150 now, and I've grown my team to like four people now

Starting point is 00:03:26 as well. And yeah, it's been big. Yeah, so I know we talked a lot about some of your side project app development work previously. And I often say that there's very few, basically no one in the world has 10,000 hours of building AI applications because it's all so new. But you might be the world leader from my perspective in terms of the number of hours, because you've built so much stuff and you're kind of like early

Starting point is 00:03:55 on that journey of like experimenting as like a, you know, AI engineer, as some people call it now. Was part of the motivation of wanting to go to an AI for companies because the token costs of your side projects was starting to creep up and you needed essentially a place where you get some free credits? Oh man, that was part of it. I'm not going to lie. But yeah, and like these AI workloads are just like so much more expensive than anything else and like I was running into some limitations before.

Starting point is 00:04:27 In terms of together, you guys raised a big round recently. Can you talk a little bit about what was that for? What are you guys hoping to achieve with that? Yeah, for sure. So we raised a $305 million Series B round at like a 3.3 billion valuation. And really, there's a lot of reasons for raising that amount of money.

Starting point is 00:04:55 One is if you're in the AI Infra business, obviously GPUs are very expensive. And so we're kind of in the business of doing both inference and training. So for inference, we just have an API for developers to come to our platform and basically be able to use AI models really easily through our API.

Starting point is 00:05:14 So we have to put a lot of GPUs around a lot of these open source models. And then also people come to us for training. We have a GPU cluster product where we can just give people access to, if you want like 50 H100s for like two months to like train a model or really do whatever you want on those H100s, we do that for you too. So every part of our business basically requires a lot of

Starting point is 00:05:35 GPUs. And so we're either kind of buying or leasing. And so a big part of the fundraise is going to be for those GPUs to kind of expand, we've seen a ton of demand and just to like expand and get like a lot more GPUs. And then part of it is growth as well. You know, we're, we're continually getting bigger. Like I said, we were like 35 people and now we're 150. We're like fivex basically almost a year and and we're going to grow even more this year and so yeah, yeah, some of it is personnel, some of it is GPU. What was the original vision behind the company? What was sort of the key problem that the founders had identified that they wanted to try to solve? Yeah, so it actually started out in crypto land, funny enough.

Starting point is 00:06:21 The company also is very, very young. It's like two and a half years old, it's under three years old. And it started off more in crypto and trying to use like access GPUs from people to try to just try to leverage that extra compute essentially, from like all the like GPUs or really high-end laptops that people had from the Bitcoin mining era, essentially. But we saw more promise in just providing GPUs to a lot of these customers. So really, it started off because I think the world needs more compute.

Starting point is 00:07:02 And I think we realized that a lot of people were very lost with how to use AI and how to use specifically open source AI, how to use open source AI models, how to train their own models from scratch. A lot of people needed not only compute, but there was a big compute shortage, which is a big part of like, you know,

Starting point is 00:07:25 why like together did really, really well, because we have all these GPUs when people really, really needed them. But also because we have, you know, our research team that goes above and beyond and writes like kernels for these GPUs to make them run really, really fast. And we have an optimized inference stack now on top of the GPUs to be able to run inference to run these AI models really, really fast, really efficient, and with a lot of throughput. And so yeah, I think the core problem was to help people build with AI help people train AI models help people find two models on their own data. And with a big

Starting point is 00:07:57 emphasis on on open source, so people can own their own data, you know, like if you go to, you know, like an open AI or a company like that, and you fine tune your model, you know, like if you go to, you know, like an open AI or a company like that, and you fine tune your model, you don't really own it, you know, open AI kind of owns it, you don't have access to it, you can't download it. You can expand your weights to a file. Exactly, exactly. And so you don't really own it. There's a lot of like, you know, things that can happen there. But you know, if you come together, you can fine tune your own models,

Starting point is 00:08:25 you own the weights, you can download it and do whatever you want with it. There's no kind of vendor lock in in that sense. And so yeah. In terms of using an open source model, what is typically the challenges that people run into if if they don't have essentially the limitation of like, I don't have GPUs, let's say they have GPUs, they can get that from somewhere, then what problem do they run into with just being able to run that model themselves? Yeah, well, first off with closed source models, you can't run them on your GPU

Starting point is 00:08:55 because they're closed source, you don't have the weights. And so you can't self-host open AI, for example, right? So you run into those problems. Are you more asking like? I'm asking about open source. Like if I want to run like a llama model, for example, and I have access to a GPU cluster somehow, what is sort of the problems I run into with running that

Starting point is 00:09:19 myself versus going through a managed service where this is kind of set up for me? Yeah, great question. Great question. So you need a lot of expertise to run open source models on GPUs. There's a lot of like LLM serving frameworks like VLM or TRT LLM.

Starting point is 00:09:37 You'd need to like look one of those up and make sure they support your open source model, make sure they support your architecture, get it up and running on the GPU, test it out, make sure it's working well, that there's a lot of steps. And so a lot of people, you know, a lot of our customers that we see, especially if you're trying to run just like general like chat, or vision or audio models, just

Starting point is 00:10:00 like, like, models that have a very common architecture, specifically, you know, we'll take care of a lot of that for people. Whether that's like, you can come in through a serverless API, which you can go to together.ai, sign up, get an API key, bam, start calling these open source models. But also, we'll work with you to customize it to your specific needs.

Starting point is 00:10:22 If you need extremely low latency on a very specific model for a specific use case, we can actually help you with that. We can give you what we call our dedicated endpoint product, where we give you a set of GPUs that that model is running on. And we can play around with the GPUs to kind of get them to do what you care about most, whether that's, like I said, latency or throughput or whatever that is.

Starting point is 00:10:53 And then our GPU cluster. Can you bring your own model? Yes. Yep, you can bring your own model as well and we'll help you host it on our GPUs. So a lot of people tend to go for that kind of stuff. The only people that are like, you know, we do have a number of customers that use our GPUs. So a lot of people tend to go for that kind of stuff. The only people that are like, you know, we do have a number of customers that use our GPU cluster

Starting point is 00:11:09 product, but just decide to run inference by themselves, decide to run a model by themselves. But for those, those are usually like very custom models. Like we have a good amount of video companies on our platform, like Pica Labs or Hedra, a lot of these like text to video companies. And they run on our GPU clusters. And we actually, our research team work with AI like claims to be faster than any other inference engine, I believe.

Starting point is 00:11:50 So first of all, like, why, why speed for inference so important? And then, I guess, like, as a follow up, like, how does how do you actually achieve that sort of performance? Yeah. I mean, I'd say there are like certain companies for specific models that may be faster than us at a certain thing. I don't think it's fair to say that we're the fastest out there for every single lot or for every single use case. But with that said, yeah, we do pride ourselves on our speed and we are usually at the top of the leaderboard for a lot of different things. A big part again is our research team, honestly. A big part is we custom built our own inference stack to work

Starting point is 00:12:32 really well for LLM specifically. A lot of our focus is on these chat models and helping them run really, really fast. And so there's our inference engine that just helps LLMs run a little bit faster. We have a whole kernels team that writes kernels for these GPUs in order to make them run faster. And we have another spec decoding team who, I don't know if you know, I can give a quick intro to speculative decoding, but it's the idea

Starting point is 00:13:02 that you have a small model predict outputs for a larger model. And by incorporating that into your LLMs, you can get way faster speeds. Because really, it's like the smaller model answering a good amount of queries, and the big model just kind of checks tokens as they come through. So yeah. And we have folks like TreeDao who who created flash attention and these really, really, really big,

Starting point is 00:13:29 essentially these really popular mechanisms to get a lot of this stuff to run as fast as possible. So kind of putting that all together is how I would say we achieve the performance that we achieve. What other areas are you guys focused on in terms of excelling optimization for AI beyond? Is it solely focused on these inference workloads or other types of workloads as well? Yeah, I would say there's inference and there's training workloads.

Starting point is 00:14:00 And those are the two we're primarily working on. For the inference workload, yeah, like I said, the speculative decoding and the inference engine are the main things. And then for the training workload, I think we have a product called Together Kernels Collection, which is essentially a collection of kernels that we wrote that just make GPUs run a little bit more

Starting point is 00:14:22 efficiently for training as well. And then what is, I would assume that the majority of my customers are probably using you essentially for inference or as an inference endpoint. But do you see a lot of people fine tuning models as well? Yeah, I'd say fine tuning is definitely less popular than inference just because it's not as accessible.

Starting point is 00:14:49 And we're working to make it more accessible. We actually have a really exciting fine tuning update that's going to come out in a few weeks. But yeah, we just see less people doing it just because you need a lot more, right? To fine tune a model, you need data. And you don't only need data, you need like good high quality data labeled in the right way, in the right format, you know, like, so, so you kind of need, there's a lot of prerequisites, you know, before you get to fine tuning.

Starting point is 00:15:15 But we've seen for the customers that use it, they do get a lot of value out of it. So, so we're seeing it as something that's, that's growing that like, we're trying to kind of drop the barriers for a little bit, we're trying to, yeah, just make the whole process easier. You know, the end vision, hopefully, is like, we can, like almost automatically fine tune models for you based on your,

Starting point is 00:15:40 your traffic, you know, like you just like, we're using together, you're using our inference API on your production app, you know, people use it and all of a sudden you have like 10,000 data points from your users and you can just take that and fine tune a smaller model to achieve similar performance at a much lower cost. You know, and a lot of that stuff can just be like that flywheel can just be automated.

Starting point is 00:16:03 So we're kind of building towards that vision. Yeah. What are the typical use cases where people are doing fine tuning? So there's a few things. One is just to get a smaller model to do what a bigger model can do. And there's a lot of use cases there.

Starting point is 00:16:24 One is latency. Let's say you need a really powerful reasoning model, like DeepSeaCar1, to do whatever use case you want to do. But DeepSeaCar1 is a little bit too slow for your use case. And so you want to train a smaller model to do a lot better. So we see a lot of those kind of use cases as well, like just save money and be faster.

Starting point is 00:16:48 But then we see a lot of use cases where you want to change how an LLM sort of response in terms of the tone or try to teach the LLM something a little bit new. Yeah, there's a lot of small use cases there. And a lot of people will just do RAG instead of fine tuning. That's really good for a lot of use cases. For some use cases, you really don't need fine tuning, and you can just embed all of your documents

Starting point is 00:17:17 and just do RAG. Right. Have you seen in your own application building value in fine tuning versus just properly contextualizing the prompt through rag and other similar techniques? Yeah, for the type of stuff that I build, it's mostly demo apps. You don't really need fine tuning for the stuff that I do.

Starting point is 00:17:42 I think fine tuning is like, yeah, the way I like to think of it is kind of what you said, of just like, when you're building an LLM app, just start super, super basic. Start with one LLM call and start to do some prompt engineering, start to break down your problem. A lot of people try to do too much in one LLM call and really like, if you break your call

Starting point is 00:18:02 into like four different LLMs, like a lot of people don't realize this. Like if you have one big LLM call. And really, if you break your call into four different LLMs, a lot of people don't realize this. If you have one big LLM and you're asking it to take a piece of text and summarize it and write some follow-up questions, if you just separate those into three different calls, it just performs way better. Yeah.

Starting point is 00:18:18 I mean, it's a little bit like going and walking up to someone you work with, and you want them to do something for you, and you give them a list of 12 things to do at once. Of course, they might remember the first one and the last one that you told them. And then some of that context gets lost. Models work very similar. So if you can kind of decompose that

Starting point is 00:18:36 and give it a little bit more specific instructions and break it apart, of course it ends up performing better. Exactly. So just doing that, and then if you want to embed some knowledge into it, try to do RAG. If you want to try to get a specific tone, try to do prompt engineering, try to do a few-shot prompting. There's so many things I think that you can do

Starting point is 00:18:57 before you get to fine-tuning. But then there are some use cases where it's like, OK, we have a lot of data, and we're trying to do some very, very specific niche thing that LLMs aren't traditionally good at. We need to do fine tuning for this use case. But other than that, yeah, I'd recommend to use that as a last resort into optimizing your problem. But for a lot of the demo apps that I build,

Starting point is 00:19:20 I don't really need that for the most part. Do you see, from a customer standpoint, are people picking one model or one to two models that they like and building most of their application on that? Or are they using a combination of models and techniques to actually build these applications to deliver to production workloads? Yeah, that's a really good question.

Starting point is 00:19:46 Um, I think we see people, it's definitely a little bit top heavy in the sense that like, you know, out of the like now hundreds of open source models that there is, like there is definitely like a top 25 or something that's used most of the time. Um, with that said, we see a lot of people lot of people don't just use one model in their apps. Like, a lot of the time you'll need... You'll need variety. There's a lot of, like, LLM techniques.

Starting point is 00:20:17 So we actually have something called Mixture of Agents that we released, which is the idea that you could send one query to four different LLMs and then just have a fifth LLM just aggregate it. And if you do that, you just start to get way better responses than even the frontier models. So if you're implementing a technique like that,

Starting point is 00:20:38 you'd want to use multiple models. But even if you're not doing something like that, just the variety of models really helps. Because generally, you'll need one really good small model. And then maybe you'll have your flagship model. And then maybe you'll have a model specifically for a really specific use case, like summarization. So yeah, we see folks usually use at least a couple of models

Starting point is 00:20:59 when we're talking about a full AI app or AI agent. In terms of the custom work that's been done around building this inference engine, what goes into actually building an inference engine? What does that from an architectural standpoint break down into? Yeah, that's a really good question that I feel like I don't feel qualified to really do it justice. I definitely wanted to lean on one of my researchers

Starting point is 00:21:27 to come on and speak to that. OK, no problem. In terms of with the way your inference endpoints work, so you have dedicated inference endpoints. And then is there also a serverless version of this that is more like a shared compute cluster that people are hitting? Exactly.

Starting point is 00:21:50 That's exactly it. Yep, we have serverless endpoints. They're definitely shared among different customers. So if you want the top speed possible, you probably want something like a dedicated endpoint. Or if you want complete reliability and the speed never changing, you definitely want something like a dedicated endpoint, or if you want complete reliability and the speed never changing, you definitely want something like a dedicated endpoint.

Starting point is 00:22:09 But our serverless endpoints are still really, really good. And we work a lot to optimize them. And we do a lot of load balancing to make sure we have enough GPUs on it when the load gets really high for a certain model, all of that kind of stuff. Is there any cold start issues with using a serverless model for this? Nope, no cold start issues.

Starting point is 00:22:27 We don't scale down to zero. So we'll always have, at a minimum, one replica running for every single serverless model that we host. So no cold starts. Can you talk a little bit about the Red Pajama project? Yeah, I feel like that's another one where that was before my time together. And yeah, I actually don't know a ton about that.

Starting point is 00:22:52 OK. In terms of fine tuning, so you talked a little bit about how you need high quality data to do that. But what's that kind of workload look like? What is it in that process of actually fine-tuning a model? And what do I get at the end of that? Yeah, yeah. Great question.

Starting point is 00:23:15 So for fine-tuning, you just want me to walk you through a full process of how fine-tuning works, essentially. Yeah, absolutely. Yeah. OK, from a user perspective. So yeah, if I'm a user and I want to come together and do fine tuning,

Starting point is 00:23:34 I need a data set for my specific use case, whatever it is. I can actually give you an example of one that I've done myself before, which is LLMs, especially a while ago, like a year ago, used to be really bad at like kind of multi-step word math problems. Some of them are still not great,

Starting point is 00:23:55 especially the smaller models. And so what I did is I thought, okay, like, let me see if I can try to fine tune a really small model, like a Llama 3.b model, to get really good at doing these word problems. And yeah, that's kind of where I started. So then the first step there is like, I need a data set of word problems. And so there's kind of two directions to take. One is, I go try to find something online

Starting point is 00:24:25 and see if it works for me. Hug Me Face has an amazing repository of a lot of different data sets. In my case, I checked a bunch of different math data sets, and I found one that was really good called Math Instruct that had a list of about 200,000 math problems and their answers. So I found this data set. It was labeled, which was great. So So I found this data set. It was labeled, which was great.

Starting point is 00:24:47 So now I found my data set. Otherwise, if I didn't find a data set, I'd want to check if I have my own data that I want to use. If I don't have my own data that I want to use, the other solution kind of is trying to generate synthetic data, which a lot of people do. And a lot of people even take some of the data that they have and realize it's not enough, and then just augment it with synthetic data generation. And so at the end of this, whatever method you're doing, you come up with a data set. And a lot of the time, you want to also do some filtering.

Starting point is 00:25:14 Ideally, you don't want to make sure the data set is high quality. You want to remove duplicates. Even the data set that I found online, I have to remove a bunch of duplicates that I found in it. So I went and do a little bit of data cleaning. And then at the end of the day, you have this data set of essentially it's

Starting point is 00:25:31 like prompt and response. It's like the math question and the answer to the right. And we support a lot of different formats, which is great. So you can get it into whatever format you want. But generally, like a two-line file is generally what you want. So after you have that, then you say, okay, now I want to actually do the fine-tuning process.

Starting point is 00:25:51 So you go together, you upload your data set. We have a one-line CLI command to do that, or a Python library, or a TypeScript library, a lot of ways to do it. But in one line, you upload the data set, and then in another line, you can trigger a fine-tuning job. And when triggering a fine-tuning job, there's, again, a lot of different parameters

Starting point is 00:26:09 you can play around with. The main ones are probably the model that you want to fine-tune, the number of epochs. And then we get into a lot of specifics, like batch size and a lot of other stuff. But generally, we have good default settings. So for the most part, you can just come in, give us your model, give us your number of epochs, and launch a fine-tuning job. So after a fine-tuning job is done, you can then run inference on your model.

Starting point is 00:26:32 And we actually give you the opportunity for LoRa FineTunes to either deploy it on a dedicated instance, so you can take your fine-tuning model, you can deploy it, you can download your model and go run it somewhere else or run it on your own GPUs or do whatever you want deploy it. You can download your model and go run it somewhere else, or run it on your own GPUs, or do whatever you want with it. Or what's actually really cool is you can just run it right away and pay for it usage-based. So you don't have to do dedicated endpoint. You could just run it right away, essentially very similar to serverless. And you can run it and see how it goes.

Starting point is 00:27:03 We see some customers will launch multiple fine-tuning jobs with different kind of parameters or different data or all of that. But yeah, then after I have my fine-tuning job and I can run inference on it, a lot of the time we'll see customers run evals on it, obviously. So you want to make sure and see how good this model is. So in my case, you know, the LAMA 3 AB model that I took and fine tuned on those 200,000 math problems, I tested it on 1,000 math problems that the model has never seen. And I tested a LAMA 3 AB, LAMA 370B, and then some frontier models like GPT-4. And I was able to get my small LAMA 3 AB model

Starting point is 00:27:42 to be better at answering these math questions than even GPT-4. In terms of, so you had this, you kind of walked out with having a set of 200,000 examples. But did you experiment at all with what happens if I give it 100,000 or 50,000? How much data ultimately do you need? Yeah, that's a really good question. Highly dependent on your use case,

Starting point is 00:28:08 highly dependent on how the model you use, all of that. But in general, so there's two types of fine tuning. There's Lora fine tuning, and then there's full fine tuning. And in my case, I was doing full fine tuning. So for something like full fine tuning, you'll want a good amount of data. I'd probably say 100,000 plus data points, probably, for a full fine tune.

Starting point is 00:28:30 But the great part is with a LoRa fine tune that actually goes from hundreds of thousands to maybe thousands of examples that you need to actually get a good model. So yeah, that's what I would say. I'd say if you don't have a lot of data, you probably want to just try to allure a fine tune and see how that goes. Yeah, and the big difference for those that are not familiar

Starting point is 00:28:50 with the concept of allure fine tuning is, essentially you're restricting the fine tuning to the lower layers of the neural network versus all the layers. So you're not adjusting all the model weights, just essentially the model weights in the lower layer. So it's less intensive from a compute perspective, going to take less time to do.

Starting point is 00:29:07 But it's probably harder to do a drastic behavioral change than something like a full fine tune. How long does that process take? I mean, obviously, it depends on the data set and so forth. But in your case, what was the turnaround time on that fine tune job? Yeah, great question. In my case, depending on the number of epochs and everything,

Starting point is 00:29:30 I think it ranged from two hours to 14 hours. Yeah. OK, well, not too bad. I mean, I remember back in the early days of GPT-3, you're talking multiple days to do a fine two jobs. So if we're talking a handful of hours, that's not bad. Never said. But yeah, then if you have millions or tens of millions

Starting point is 00:29:54 of data points, then yeah, you're definitely looking at a factor of days instead of hours. Yeah, absolutely. There's all these different techniques around like when you're building applications in terms of getting like sort of the reliability and the performance that you want out of the application. And a lot of them depend on doing inference multiple times. Like if I'm doing something where I'm producing an output

Starting point is 00:30:21 and maybe I want another call to judge the output and I do some sort of inference loop, a series of reflection, or the example that you talked about earlier, where maybe I'm calling multiple models and then have another model sort of summarize it in like a sort of parent-child relationship type of pattern. Like, how do you, from a cost perspective, like how do you sort of balance reliability performance without completely you know, completely blowing apart your token budget?

Starting point is 00:30:51 Yeah, that's yeah, asking some difficult questions over here. I think that's like that's a very hard problem to solve in general, right? And generally what I like to tell people is like start as simple as you can. Don't try to overcomplicate it before you don't need to. Sometimes you just use a small LLM and you send it a single thing and you're like, oh my God, that's perfect. It's doing exactly what I want it to do.

Starting point is 00:31:17 In that case, do some evals, which I feel like evals are just underrated in general. If you have a serious LLM app, you need evals. You absolutely need to. Yeah, I mean, you can't be vibe checking production apps. Oh yeah. What do you see from an eval perspective, like either in your own projects or when you're talking to customers and things like that,

Starting point is 00:31:41 what are people doing? Are they building their own stuff? Are they using platforms like BrainTrust and others to do evals? Yeah, I think a surprising amount of people do their own stuff, which I found interesting. And I think it's because different people are testing very different things.

Starting point is 00:32:04 And sometimes it's very custom based on what that specific person needs. Like a good example, I think, is I built this app called Lulma Coder. And it's basically like a poor man's like v0 or lovable or bolt. And it's starting to build evals for it. I launched, it started to get popular.

Starting point is 00:32:24 And so I was like, OK, I want to try to take this a little bit more seriously and add evals for it, right? I launched, it started to get popular, and so I was like, okay, I want to try and take this a little bit more seriously and add evals to it. But evals are really hard to add in that case. So this is an app where you put in a prompt and you get an app. But how do you judge it? How do you judge that app's output?

Starting point is 00:32:38 And so I started to think about that and think like, oh, maybe one potential way to do that would be to write a script that spins up like a headless browser and types in a prompt and waits for the app to finish and then takes a screenshot of the page and then sends it to a vision model to try to judge it in some way.

Starting point is 00:32:58 And so anyway, you start to get very, very specific and I don't think there's any eval company that's built like a UI for something like that necessarily. So yeah, we see a lot of people build their own. But also if you're if you're just testing like more traditional, you know, chat kind of applications, you want to do LLM as a judge, all that kind of stuff. I think platforms like brain trust are great. They're a good partner of ours. And I use

Starting point is 00:33:22 brain trust for some of my chat apps as well. Yeah, that's consistent with what I've seen, too, also talking to people in the industry is, I think ultimately a lot of this stuff ends up being somewhat custom. You know, even when you're starting to build out, I think, more complicated like workflows or agents, like a lot of times people end up going away from some of the frameworks that are available or using very lightweight version of the frameworks because they want to be able to have more control essentially. And they end up doing a lot of just like bespoke custom work

Starting point is 00:33:53 just I think in part due to the level of immaturity of all the technology. It's kind of just part and parcel with where the tech stack is currently. 100%. In terms of what you're seeing, you've been working in the space for what seems like a long time now, relative to what a lot of people have been working at, what are some of the most interesting use cases and uses for Together that you've seen from customers?

Starting point is 00:34:28 Yeah, that's a good question. Most interesting use cases of Together. I mean, maybe I could start with some of the more common ones. A lot of folks use us for things like summarization. A lot of people use us for things like summarization. A lot of people use us for chat apps. We power Doc.Go's online chat. We power this company called Zamato, which is one of the leading delivery companies in India.

Starting point is 00:34:57 We power their kind of chat bot agent. So we do a lot of these kind of chat bots, like TripAdvisor, we help them with a lot of stuff. Yeah, we also power summarization for a lot of these different companies. And then we just have a lot of these really unique customers. Those are kind of customers that are on our, using our inference engine, kind of using like serverless or dedicated. But then we have a lot of really interesting customers on the GPU side and people kind of doing all sorts of things. We have video game companies that are using us.

Starting point is 00:35:36 Latitude has built a bunch of video games, leveraged our inference service. Some audio companies like Cartesia. Cartesia is a really great company that has a fantastic model. And we help serve their model. So we do inference for them, essentially, on our GPU clusters.

Starting point is 00:35:57 Like I mentioned, some video companies like Pica, where we help them not only train their model on our GPUs, but also run inference on it. Yeah. You mentioned in terms of use cases, people doing things like summarization, chatbots. There's bread and butter use cases for generative AI. Do you feel like that? And I think now these are not necessarily the use cases

Starting point is 00:36:27 that get a lot of headlines, right? These sort of are actual practical use cases. But these are great. These are actual real-world examples that are very, very useful applications of generative AI. Whereas a lot of the headlines, I think, end up getting dedicated to sort of, I would say, like a little bit more out there, types of applications of like, where look very impressive, but are you know, are they really going to see the light of day in a production scenario for a while? But do you think that we end up focusing a little bit too much on some of the hype side of AI, and then we lose a little bit of the sort of practicality of what companies are being successful with. I think so for sure. If you go on Twitter and you see what people are talking about there versus what enterprise companies are actually shipping in production, a lot of the time it's two very different things.

Starting point is 00:37:17 But I think that's also what's a little bit cool about AIs is that there's these really cool, there's always bleeding edge stuff. There's always bleeding edge stuff that comes out that that's interesting to look into. And we have seen people go a little bit beyond the traditional chat bot summarization stuff. I mean, we see a lot of people do text extraction. We see people do a lot of just image generation for different use cases, like personalization, a lot of

Starting point is 00:37:47 code generation too, actually, I forgot to mention that one. A lot of code generation. Yeah, people just do that. And yeah, a lot of people also using our, I guess one of our new product is like, we acquired a company called Code Sandbox, and we have like a Code Sandbox SDK now, where people can spin up these mini VMs and be able to run a lot of the lovable, bold type companies like that to be able to run an entire Next.js app, basically, on these VMs. So we have folks using us for that, too.

Starting point is 00:38:21 Yeah, I think that when it comes to writing code, in a lot of ways, software engineering has become the tip of the spear for a lot of these AI use cases, which I think maybe is surprising to some. But I think potentially one of the reasons why AI has been so successful there is, one, there's inherently a human in a loop that's going to be evaluating the outcome.

Starting point is 00:38:47 So if a mistake is made, presumably it gets caught by a person at some point. But the other thing is, from tying it back to what we were talking about earlier, from an eval test perspective, there's a lot of existing stuff to test against. I think one of the challenges with the really general use of a general model is

Starting point is 00:39:06 what is the eval test case? If someone can ask anything in the world to this thing, how do you test whether it's able to answer that correctly? And the advantage of historical predictive AI or purpose-built models is that you have a very good understanding when you have that model of what the inputs and outputs should be, So it's easier to build like a test set to test and evaluate it. And when it comes to code, like if I have, you know, some sort of agent that's going to do PR reviews, well, there is probably thousands of human PR reviews in your organization that become essentially your eval test set for that. 100%. Yeah, that's a really good way to think about it. What are, for you and your team, what kind of projects

Starting point is 00:39:50 are you guys focused on? Yeah, so I run the DevRel team here at Together. So we have a few different focuses. I think on one side, we really just care about the developer experience of the product. And so we're really involved in being customer zero, trying to test out everything before it goes out, trying to relay information from the community

Starting point is 00:40:12 back to the product teams, trying to really think about, OK, a new member onboarding, and someone signs up for the website, what really happens? How do we basically take developers down the you know, the happy path and, and, you know, get the time to first value or in our case, that's like having to make an API call, you know, as fast as possible after they, they sign up, you know, so we, we care a lot about that and think a lot about that. Our docs as well. So we own our documentation and we'll, we'll continue to make that better and ship more guides and stuff like that.

Starting point is 00:40:45 So that's from the DX point of view. And then from the content point of view, we'll put out a bunch of block posts and videos and go to conferences and try to get the word out. And then on the last point is we just try to be customers ourselves. We try to build apps like our customers do. And that leads to both some great feedback, I think, for our engineering team, but also is a cool, like, growth engine. Like, we'll just release interesting apps and make them

Starting point is 00:41:17 free and fully open source and have all these people come in and developers love looking at code. So they'll check them out, they'll clone them, and obviously to run any of the projects, you need a Together API key to call our various open source models. So yeah, a big part of what we do is just try to build really cool, impressive demos to highlight

Starting point is 00:41:35 cool use cases from Together. And if I'm using a model hosted by Together, then what is sort is that application experience of programming against that? Is it similar or comparable to using something like OpenAI directly, or we're going through some sort of AI, GenAI framework, like a Lang chain or a mom index or something like that?

Starting point is 00:42:04 Yeah. I would say it's similar to all of that. you know, Gen.ai framework, like a Lang chain or Lama index or something like that? Yeah, yeah, I would say it's similar to all of that. We actually, you can actually use our API through the OpenAI SDK by changing like two lines of code. On it, you can use our API with Lama index or Lang chain or Lang graph or crew AI, or, you know, we're compatible basically with every single major framework out there.

Starting point is 00:42:24 And then we also have our own SDKs. If you want to NPM install or PIP install together in like five, six lines of code, you can get up and running. OK. And then can you talk a little bit about your agent recipes project, which I forget. You came up with that a little while ago. I thought it was a really, really cool project

Starting point is 00:42:46 to show how some of these sort of agent design patterns essentially work. Yeah, absolutely. So Ancient Recipes, it was actually inspired by an article that Anthropic did on how to build agents. And I thought the data really. Yeah, that's a great article.

Starting point is 00:43:03 It's an amazing article. Highly recommend everybody to read. I talked to the Anthropic team a few weeks back and they told me they're working on a version 2, which I'm really excited about, or part 2 to that blog post. But anyway, I was kind of inspired by that and I liked how they broke down these different agents, or really mostly workflows, into different recipes that they've been seeing their customers use. And so I thought I'd just go a little bit of a step further and not only take that kind of use case,

Starting point is 00:43:33 but also try to give people a little bit more information on what are some good use cases for this specific recipe, and also write code. So I wrote TypeScript and Python code. Me and Zan on my team worked on this. And yeah, we tried to build a really cool resource for people to check out when building agents and building LLM apps,

Starting point is 00:43:57 people to go and literally copy paste code for different patterns that we've seen work well. And we're actually expanding on it even more and releasing a few kind of recipes of our own pretty soon. Oh, cool. Yeah, I think when you mentioned, like, they're a little bit more like workflows than fully identical.

Starting point is 00:44:16 I think in reality, like, going back to what we were talking about earlier of putting some of these things actually into production, I think when it comes to agents, at least what I'm seeing is most things that are hitting production are probably more akin to workflows than they are to, like, I just give this thing access to AWS, and it goes and optimizes and fixes everything.

Starting point is 00:44:41 I think we're a long way from that level of autonomy. And a lot of times, I think a good application of an agent is like, we have a playbook that some person follows today. And if I have a playbook already defined, then that's a very good application of essentially doing something agentic, but it might look a little bit more like a workflow. And that gives you a little bit more determinism over something that is inherently stochastic in nature in terms of operating with these probabilistic models. 100%, yeah, that's what we've been seeing as well.

Starting point is 00:45:13 Workflows is really where you can get a lot of value out of having these LLMs do a lot of cool things. But also, you're kind of restricting their output a little bit more and giving them more guardrails. And as a result, it's just so much more reliable than just, like you said, giving an agent just random access to a ton of different tools. Yeah, and I think, too, to your point earlier about where

Starting point is 00:45:37 you were saying why people make the mistake of trying to give too many instructions in a prompt, by thinking about sort of decomposing a problem into a multi-agent, whether it's a workflow or whatever pattern, it's a natural framework for helping people shrink the problem in the prompt that they're getting. This particular node in this graph or in this workflow, the only thing I want you to do is

Starting point is 00:46:03 produce a summary of this output from another part of the process or something like that. And that just increases your reliability and also makes your testing easier. 100%. Yeah, I've experimented with it a lot for this project that we just launched a couple of days ago called Together Chat.

Starting point is 00:46:20 And yeah, we ended up going with more of a workflow approach for that one with a router. And the way it works is you have this kind of like chat, it's kind of like a chatbot, right? But you send a message and the router will intelligently try to figure out like whether it should just answer the question, whether it should use a search API to try to look it up, or whether it should generate an image because you can also generate images in the chat. And so I found this kind of approach to work generally really well. Yeah, I think that's part,

Starting point is 00:46:49 that's sort of where the agentic nature is, is the decision around what tool or data to access, not necessarily in the full sort of keys of the kingdom and the execution of that once I figured out what tools I'm going to use. In terms of, you've built a lot of AI apps over the last couple of years. What are some of the things that you've learned along the way

Starting point is 00:47:12 that are typical mistakes that you see as people who are just starting to enter, building some of these types of applications? I think most people don't launch fast enough. I think that's something I've seen over and over and over again, where people try to overcomplicate problems a little bit. And I think the great part, I think for a lot of these apps, what I really like is that it's very quick.

Starting point is 00:47:40 I spend one week, two weeks at most on any of them, and I launch them, and sometimes I'll get millions of people to go check them out and say they work great and all this stuff. When really, I didn't work on this thing with a team of five people for 10 months. This is just a really quick one or two week thing. And then you could just add evals in and just build up

Starting point is 00:48:01 to where you want to, or just listen to your users and what they complain about. And so for most, I feel like for most, for a lot of AI apps, it makes sense to where you want to, or just listen to your users and what they complain about. I feel like for a lot of AI apps, it makes sense to launch them that way, is to get them to 90% very quickly, and launch and see how it goes, or do early access, or beta, or whatever it is.

Starting point is 00:48:16 So yeah, in terms of the launching, that's one thing I've seen. The other thing I've seen is that there's no shortage of cool stuff to build in the AI space right now. Like if you're a builder, like this is like, I feel like one of the best times to, to, to be building cool things. Like there's, there's, I have a, I keep a notion doc of ideas that I want to build and it's at like 70 something right now of like things that are like, I want to build and I have like a top 15 right now that I'm working towards trying to all get done to ship all of them this year.

Starting point is 00:48:52 But yeah, it's just a really exciting time and you can do so much with so little I think and so many of these like apps that I build are just like one API endpoint or two API endpoints or sometimes it's a chain of like two or three. But I found it fascinating how much you can build with not a lot of API calls and of just a very simple app. Yeah, I mean, I think that one of the things that you're doing sort of a great service to these models by sort of a lot of demonstrations

Starting point is 00:49:21 of like the art of the possible. I would love to see businesses do a little bit more of this themselves. I think a lot of times, when I talk to business, especially enterprises, they're really hyper-focused on trying to figure out what this killer use case that's gonna deliver massive ROI to their businesses and then invest like a tremendous amount of time.

Starting point is 00:49:43 But you can make really meaningful impact by focusing on, you know, what are my internal knowledge workers doing today that feels like, you know, kind of a waste of process or a waste of a person's time around collecting a bunch of information from different places. Like every business has these things. If you just went sort of piece by piece

Starting point is 00:50:04 of automating some of those, you're going to learn a lot and you're going to get, I think ROI pretty quickly. And it doesn't require a 50 person engineering team and a ton of, you know, like massive testing to get something essentially that's, you know, 90% good. Yeah, absolutely. You know, going back to together, I guess, what's on the horizon for you guys from a product standpoint? You raised all this, a large amount of money right now.

Starting point is 00:50:34 What's next for you? Yeah, a lot of things. I think, well, one thing I said, like I said, we just released Together Chat, which is kind of like our first official consumer app as part of Together. So we're kind of, I would say, dabbling a little bit in the consumer space and seeing how that goes. We're also working on a lot of things across many different products. So one thing that we have coming up is called Together Code Interpreter. And we're going to basically allow people to not only, like I said before,

Starting point is 00:51:10 one of the big use cases of UpTogether is just generating code from a lot of our LLMs. And we're going to let people not only generate code, but actually run code and do both with like the same API key, essentially, or the same SDK. So that's something that's coming that I'm really excited about. Continuing to just host amazing open source models as they come up, like better kernels, better inference stack, just being able to just be faster. And we're seeing a lot of companies ask for Blackwells now, like B200s.

Starting point is 00:51:45 So we're kind of moving from Hopper to Blackwell and we're just getting like, you know, thousands and thousands of Blackwells coming in very soon that I'm excited about because that'll be an improvement across the board, you know, for our customers that do inference and training, but also for our own like models to be able to run even faster on this great hardware. Yeah, we're improving our fine tuning service to make it a lot easier. For our GPU cluster service, right now, you have to talk to our sales team

Starting point is 00:52:17 and try to do things that way. But we're releasing instant GPU clusters, which I think is waitlist only right now. But we'll hopefully be releasing somewhat soon where you can just sign up for a platform and get access to GPUs right away, or very, very quick in a self-serve manner. So yeah, kind of just trying to push on all fronts

Starting point is 00:52:39 and just trying to be this, our branding is like we're the AI acceleration cloud and just be this basically end- know, our branding is like, we're like the AI acceleration cloud and just like, be this basically end to end AI cloud for whatever you're trying to do from inference to fine tuning to training and try to run your entire stack on together. Awesome. Well, as long as thanks so much for coming back. This is great. Yeah, thanks for having me. Alright, cheers.

Software Huddle - Fast Inference with Hassan El Mghari

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.