Software Huddle - Fast Inference with Hassan El Mghari
Episode Date: April 8, 2025Today we have Hassan back on the show. Hassan was one of our first guests for Huddle when he was working at Vercel, but since then, he's joined Together AI, one of the hottest companies in the world. ...They just raised a massive series B round. Hassan joins us to talk about Together AI, inference optimization and building AI applications. We touch on a bunch of topics like customer uses of AI, best practices for building apps, and what's next for Together AI. Timestamps 01:42 Opportunity at Together AI 04:26 Together raised a big round 06:06 Vision Behind Together AI 08:32 Problems in running Open Source Models 11:40 Speed For Inference 14:24 Fine Tuning 19:23 One or Two Models or a Combination of them 21:32 Serverless 22:21 Cold Start issues? 27:46 How much data do you need? 30:00 Balancing Reliability and Cost 34:07 How customers are using Together 42:36 Agent Recipes 47:03 Typical Mistakes buiilding AI apps
Transcript
Discussion (0)
In terms of using an open source model, what is typically the challenges that people run into?
You need a lot of expertise to run open source models on GPUs. There's a lot of LLM serving
frameworks like VLM or TRT LLM. You need to look one of those up, make sure they support your open
source model, make sure they support your architecture, get it up and running on the GPU,
test it out, make sure it's working well.
What was the original vision behind the company?
It started off more in crypto and trying to use like access GPUs from people to try to just try to leverage that extra compute, essentially.
You've built a lot of AI apps over the last, you know, couple years. What are some of the things that
you've learned along the way that are typical sort of
mistakes that you see as people who are just starting to enter,
you know, building some of these types of applications?
Hey, everyone, Sean here. And today we have Hassan el
Magari back on the show. Hassan was one of our first guests for
huddle when he was working at Vercell. But since then, he's joined together AI, one of the hottest companies in the show. Hassan was one of our first guests for Huddle when he was working at Versel.
But since then, he's joined Together AI, one of the hottest
companies in the world.
They just raised a massive Series B.
I didn't even know that a Series B round could be that big.
Hassan joins me to talk about Together AI, inference
optimization, and building AI applications.
We touch on a bunch of different topics,
like customer uses of AI, best practices for building apps,
and what's next for Together AI.
With that, let's get you over to the show.
And as always, if you have questions or feedback,
feel free to reach out to Alex or myself.
Hassan, welcome back to Software Huddle.
Hi, Sean, thank you.
Appreciate you having me back.
Yeah, well, thanks for coming back.
You were one of our first guests,
and since then, we were chatting beforehand,
and you made a career change.
I think that probably happened within just a short while
after we last did this.
So you're now at Together AI,
which we'll be talking about in depth,
but what led to that decision?
What particularly caught your eye about the opportunity there?
Yeah. I mean, I think like the AI industry
has just been like blowing up.
I've been really into building AI apps
even before joining together,
just building these little like side projects
since like 2023 really.
So yeah, it seemed like a great fit.
I kind of interviewed in a bunch of these different
like AI infra companies.
And I love working at infra companies in general
because a lot of them will just have like a great platform
or APIs for building applications.
And that's really what I'm really, really passionate about
is just like getting to build really cool apps.
And so interviewed at a bunch of different AI
infra companies together really stood out
in terms of like the talent that's on the team, the research focus.
We have a really big research team that's working on optimizing a lot of different parts of our staff, which I can get into later.
And then the growth and revenue and fundraising, like everything just seemed really, really exciting.
So I ended up joining a little over a year ago.
How big was the company when you joined and what is it today?
Yeah, great question. I think we were like 35 when I joined.
And now we're like 150.
That's a lot of growth.
It's a little bit crazy. We're 150 now, and I've grown my team to like four people now
as well.
And yeah, it's been big.
Yeah, so I know we talked a lot about some of your side project
app development work previously.
And I often say that there's very few, basically no one
in the world has 10,000 hours of building
AI applications because it's all so new. But you might be the world leader from my perspective in
terms of the number of hours, because you've built so much stuff and you're kind of like early
on that journey of like experimenting as like a, you know, AI engineer, as some people call it now.
Was part of the motivation of wanting to go to an AI for companies because the token costs
of your side projects was starting to creep up and you needed essentially a place where
you get some free credits?
Oh man, that was part of it.
I'm not going to lie.
But yeah, and like these AI workloads are just like so much more expensive than anything
else and like I was running into some limitations before.
In terms of together, you guys raised a big round recently.
Can you talk a little bit about what was that for?
What are you guys hoping to achieve with that?
Yeah, for sure.
So we raised a $305 million Series B round
at like a 3.3 billion valuation.
And really, there's a lot of reasons
for raising that amount of money.
One is if you're in the AI Infra business,
obviously GPUs are very expensive.
And so we're kind of in the business
of doing
both inference and training.
So for inference, we just have an API for developers
to come to our platform and basically
be able to use AI models really easily through our API.
So we have to put a lot of GPUs around a lot of these open
source models.
And then also people come to us for training.
We have a GPU cluster product where
we can just give people access to,
if you want like 50
H100s for like two months to like train a model or really do whatever you want on those
H100s, we do that for you too. So every part of our business basically requires a lot of
GPUs. And so we're either kind of buying or leasing. And so a big part of the fundraise
is going to be for those GPUs to kind of expand, we've seen a ton of demand and just to like expand and get like a lot more GPUs.
And then part of it is growth as well. You know, we're, we're continually getting bigger. Like I said, we were like 35 people and now we're 150. We're like fivex basically almost a year and and we're going to grow even more this year and so yeah, yeah, some of it is personnel, some of it is GPU.
What was the original vision behind the company?
What was sort of the key problem that the founders
had identified that they wanted to try to solve?
Yeah, so it actually started out in crypto land,
funny enough.
The company also is very, very young.
It's like two and a half years old, it's under three years old. And it started off more in
crypto and trying to use like access GPUs from people to try to just try to leverage
that extra compute essentially, from like all the like GPUs or really high-end laptops that people had from the Bitcoin mining era, essentially.
But we saw more promise in just providing GPUs
to a lot of these customers.
So really, it started off because I
think the world needs more compute.
And I think we realized that a lot of people
were very lost with how to use AI
and how to use specifically open source AI,
how to use open source AI models,
how to train their own models from scratch.
A lot of people needed not only compute,
but there was a big compute shortage, which
is a big part of like, you know,
why like together did really, really well, because we have all these GPUs when people really, really
needed them. But also because we have, you know, our research team that goes above and beyond and
writes like kernels for these GPUs to make them run really, really fast. And we have an optimized
inference stack now on top of the GPUs to be able to run inference to run these AI models really,
really fast, really efficient, and with a
lot of throughput. And so yeah, I think the core problem was to
help people build with AI help people train AI models help
people find two models on their own data. And with a big
emphasis on on open source, so people can own their own data,
you know, like if you go to, you know, like an open AI or a
company like that, and you fine tune your model, you know, like if you go to, you know, like an open AI or a company like that,
and you fine tune your model, you don't really own it, you know, open AI kind of owns it,
you don't have access to it, you can't download it.
You can expand your weights to a file.
Exactly, exactly. And so you don't really own it. There's a lot of like, you know, things
that can happen there. But you know, if you come together, you can fine tune your own models,
you own the weights, you can download it and do whatever you want with it. There's no kind of
vendor lock in in that sense. And so yeah. In terms of using an open source model, what is
typically the challenges that people run into if if they don't have essentially the limitation of
like, I don't have GPUs, let's say they have GPUs, they can get that from somewhere,
then what problem do they run into
with just being able to run that model themselves?
Yeah, well, first off with closed source models,
you can't run them on your GPU
because they're closed source, you don't have the weights.
And so you can't self-host open AI, for example, right?
So you run into those problems.
Are you more asking like?
I'm asking about open source.
Like if I want to run like a llama model, for example,
and I have access to a GPU cluster somehow,
what is sort of the problems I run into with running that
myself versus going through a managed service where
this is kind of set up for me?
Yeah, great question.
Great question.
So you need a lot of expertise
to run open source models on GPUs.
There's a lot of like LLM serving frameworks
like VLM or TRT LLM.
You'd need to like look one of those up
and make sure they support your open source model,
make sure they support your architecture,
get it up and running on the GPU,
test it out, make sure it's working well, that there's a lot of
steps. And so a lot of people, you know, a lot of our
customers that we see, especially if you're trying to
run just like general like chat, or vision or audio models, just
like, like, models that have a very common architecture,
specifically, you know, we'll take care of a lot of that for people.
Whether that's like, you can come in through a serverless
API, which you can go to together.ai, sign up,
get an API key, bam, start calling these open source
models.
But also, we'll work with you to customize it
to your specific needs.
If you need extremely low latency on a very specific model
for a specific use case, we can actually help you with that.
We can give you what we call our dedicated endpoint product,
where we give you a set of GPUs that that model is running on.
And we can play around with the GPUs
to kind of get them to do what you care about most,
whether that's, like I said, latency or throughput
or whatever that is.
And then our GPU cluster.
Can you bring your own model?
Yes.
Yep, you can bring your own model as well
and we'll help you host it on our GPUs.
So a lot of people tend to go for that kind of stuff.
The only people that are like, you know,
we do have a number of customers that use our GPUs. So a lot of people tend to go for that kind of stuff. The only people that are like, you know, we do have a number of customers that use our GPU cluster
product, but just decide to run inference by themselves,
decide to run a model by themselves.
But for those, those are usually like very custom models.
Like we have a good amount of video companies
on our platform, like Pica Labs or Hedra, a lot of these like text to video companies.
And they run on our GPU clusters.
And we actually, our research team work with AI like claims to be faster than any other
inference engine, I believe.
So first of all, like, why, why speed for inference so important?
And then, I guess, like, as a follow up, like, how does how do you actually achieve that
sort of performance?
Yeah.
I mean, I'd say there are like certain companies for specific models that may be faster than us at a certain thing.
I don't think it's fair to say that we're the fastest out there for every single lot or for every single use case.
But with that said, yeah, we do pride ourselves on our speed and we are usually at the top of the leaderboard for a lot of different things. A big part again
is our research team, honestly. A big part is we custom built our own inference stack to work
really well for LLM specifically. A lot of our focus is on these chat models and helping them run
really, really fast. And so there's our inference engine that just helps LLMs run a little bit
faster.
We have a whole kernels team that
writes kernels for these GPUs in order to make them run faster.
And we have another spec decoding team who,
I don't know if you know, I can give a quick intro
to speculative decoding, but it's the idea
that you have a small model predict outputs for a larger model.
And by incorporating that into your LLMs,
you can get way faster speeds.
Because really, it's like the smaller model answering
a good amount of queries, and the big model just kind of
checks tokens as they come through.
So yeah.
And we have folks like TreeDao who who created flash attention and these really, really, really big,
essentially these really popular mechanisms to get a lot of this stuff to run as fast as possible.
So kind of putting that all together is how I would say we achieve the performance that we achieve.
What other areas are you guys focused on
in terms of excelling optimization for AI beyond?
Is it solely focused on these inference workloads
or other types of workloads as well?
Yeah, I would say there's inference
and there's training workloads.
And those are the two we're primarily working on.
For the inference workload, yeah, like I said,
the speculative decoding and the inference engine
are the main things.
And then for the training workload,
I think we have a product called Together Kernels Collection,
which is essentially a collection of kernels
that we wrote that just make GPUs run a little bit more
efficiently for training as well.
And then what is, I would assume that the majority of my customers
are probably using you essentially for inference
or as an inference endpoint.
But do you see a lot of people fine tuning models as well?
Yeah, I'd say fine tuning is definitely less popular
than inference just because it's
not as accessible.
And we're working to make it more accessible.
We actually have a really exciting fine tuning update that's going to come out in a few weeks.
But yeah, we just see less people doing it just because you need a lot more, right?
To fine tune a model, you need data.
And you don't only need data, you need like good high quality data labeled in the right way,
in the right format, you know, like, so, so you kind of need,
there's a lot of prerequisites, you know,
before you get to fine tuning.
But we've seen for the customers that use it,
they do get a lot of value out of it.
So, so we're seeing it as something that's,
that's growing that like,
we're trying to kind of drop the barriers for a little
bit, we're trying to, yeah, just make the whole process easier.
You know, the end vision, hopefully, is like, we can, like
almost automatically fine tune models for you based on your,
your traffic, you know, like you just like, we're using together,
you're using our inference API on your production app,
you know, people use it and all of a sudden
you have like 10,000 data points from your users
and you can just take that and fine tune a smaller model
to achieve similar performance at a much lower cost.
You know, and a lot of that stuff can just be like
that flywheel can just be automated.
So we're kind of building towards that vision.
Yeah.
What are the typical use cases where
people are doing fine tuning?
So there's a few things.
One is just to get a smaller model
to do what a bigger model can do.
And there's a lot of use cases there.
One is latency.
Let's say you need a really powerful reasoning model,
like DeepSeaCar1, to do whatever use case you want to do.
But DeepSeaCar1 is a little bit too slow for your use case.
And so you want to train a smaller model
to do a lot better.
So we see a lot of those kind of use cases as well,
like just save money and be faster.
But then we see a lot of use cases where you want to change
how an LLM sort of response in terms of the tone
or try to teach the LLM something a little bit new.
Yeah, there's a lot of small use cases there.
And a lot of people will just do RAG instead of fine tuning.
That's really good for a lot of use cases.
For some use cases, you really don't need fine tuning,
and you can just embed all of your documents
and just do RAG.
Right.
Have you seen in your own application building value
in fine tuning versus just properly contextualizing
the prompt through rag and other similar techniques?
Yeah, for the type of stuff that I build,
it's mostly demo apps.
You don't really need fine tuning for the stuff that I do.
I think fine tuning is like, yeah,
the way I like to think of it is kind of what you said,
of just like, when you're building an LLM app,
just start super, super basic.
Start with one LLM call and start to do
some prompt engineering, start to break down your problem.
A lot of people try to do too much in one LLM call
and really like, if you break your call
into like four different LLMs,
like a lot of people don't realize this. Like if you have one big LLM call. And really, if you break your call into four different LLMs, a lot of people don't realize this.
If you have one big LLM and you're
asking it to take a piece of text and summarize it
and write some follow-up questions,
if you just separate those into three different calls,
it just performs way better.
Yeah.
I mean, it's a little bit like going and walking up
to someone you work with, and you want them to do something
for you, and you give them a list of 12 things to do at once.
Of course, they might remember the first one
and the last one that you told them.
And then some of that context gets lost.
Models work very similar.
So if you can kind of decompose that
and give it a little bit more specific instructions
and break it apart, of course it ends up performing better.
Exactly.
So just doing that, and then if you
want to embed some knowledge into it, try to do RAG.
If you want to try to get a specific tone,
try to do prompt engineering, try to do a few-shot prompting.
There's so many things I think that you can do
before you get to fine-tuning.
But then there are some use cases where it's like, OK,
we have a lot of data, and we're trying
to do some very, very specific niche thing that LLMs aren't traditionally good at.
We need to do fine tuning for this use case.
But other than that, yeah, I'd recommend
to use that as a last resort into optimizing your problem.
But for a lot of the demo apps that I build,
I don't really need that for the most part.
Do you see, from a customer standpoint,
are people picking one model or one to two models
that they like and building most of their application on that?
Or are they using a combination of models and techniques
to actually build these applications
to deliver to production workloads?
Yeah, that's a really good question.
Um, I think we see people, it's definitely a little bit top heavy in the sense that
like, you know, out of the like now hundreds of open source models that there is, like
there is definitely like a top 25 or something that's used most of the time.
Um, with that said, we see a lot of people lot of people don't just use one model in their apps.
Like, a lot of the time you'll need...
You'll need variety.
There's a lot of, like,
LLM techniques.
So we actually have something called
Mixture of Agents that we released,
which is the idea that you could send
one query to four different LLMs and then just have a fifth LLM just
aggregate it.
And if you do that, you just start
to get way better responses than even the frontier models.
So if you're implementing a technique like that,
you'd want to use multiple models.
But even if you're not doing something like that,
just the variety of models really helps.
Because generally, you'll need one really good small model.
And then maybe you'll have your flagship model.
And then maybe you'll have a model specifically
for a really specific use case, like summarization.
So yeah, we see folks usually use at least a couple of models
when we're talking about a full AI app or AI agent.
In terms of the custom work that's been done
around building this inference engine,
what goes into actually building an inference engine?
What does that from an architectural standpoint break down into?
Yeah, that's a really good question that I feel like I don't feel
qualified to really do it justice.
I definitely wanted to lean on one of my researchers
to come on and speak to that.
OK, no problem.
In terms of with the way your inference endpoints work,
so you have dedicated inference endpoints.
And then is there also a serverless version of this
that is more like a shared compute cluster
that people are hitting?
Exactly.
That's exactly it.
Yep, we have serverless endpoints.
They're definitely shared among different customers.
So if you want the top speed possible,
you probably want something like a dedicated endpoint.
Or if you want complete reliability
and the speed never changing, you definitely want something like a dedicated endpoint, or if you want complete reliability and the speed never changing, you
definitely want something like a dedicated endpoint.
But our serverless endpoints are still really, really good.
And we work a lot to optimize them.
And we do a lot of load balancing
to make sure we have enough GPUs on it
when the load gets really high for a certain model,
all of that kind of stuff.
Is there any cold start issues with using a serverless model for this?
Nope, no cold start issues.
We don't scale down to zero.
So we'll always have, at a minimum, one replica running
for every single serverless model that we host.
So no cold starts.
Can you talk a little bit about the Red Pajama project?
Yeah, I feel like that's another one where
that was before my time together.
And yeah, I actually don't know a ton about that.
OK.
In terms of fine tuning, so you talked a little bit
about how you need high quality data to do that.
But what's that kind of workload look like?
What is it in that process of actually fine-tuning a model?
And what do I get at the end of that?
Yeah, yeah.
Great question.
So for fine-tuning, you just want
me to walk you through a full process of how
fine-tuning works, essentially.
Yeah, absolutely.
Yeah.
OK, from a user perspective.
So yeah, if I'm a user and I want
to come together and do fine tuning,
I need a data set for my specific use case,
whatever it is.
I can actually give you an example of one
that I've done myself before, which is LLMs,
especially a while ago, like a year ago,
used to be really bad at like kind of multi-step
word math problems.
Some of them are still not great,
especially the smaller models.
And so what I did is I thought, okay, like,
let me see if I can try to fine tune a really small model,
like a Llama 3.b model, to get really good at doing these word problems.
And yeah, that's kind of where I started.
So then the first step there is like, I need a data set of word problems.
And so there's kind of two directions to take.
One is, I go try to find something online
and see if it works for me.
Hug Me Face has an amazing repository
of a lot of different data sets.
In my case, I checked a bunch of different math data sets,
and I found one that was really good called Math Instruct that
had a list of about 200,000 math problems and their answers.
So I found this data set.
It was labeled, which was great. So So I found this data set. It was labeled, which was great.
So now I found my data set. Otherwise, if I didn't find a data set, I'd want to check if I have my
own data that I want to use. If I don't have my own data that I want to use, the other solution
kind of is trying to generate synthetic data, which a lot of people do. And a lot of people
even take some of the data that they have and realize it's not enough,
and then just augment it with synthetic data generation.
And so at the end of this, whatever method you're doing,
you come up with a data set.
And a lot of the time, you want to also do some filtering.
Ideally, you don't want to make sure the data set is high
quality.
You want to remove duplicates.
Even the data set that I found online,
I have to remove a bunch of duplicates that I found in it.
So I went and do a little bit of data cleaning.
And then at the end of the day, you
have this data set of essentially it's
like prompt and response.
It's like the math question and the answer to the right.
And we support a lot of different formats,
which is great.
So you can get it into whatever format you want.
But generally, like a two-line file is generally what you want.
So after you have that, then you say,
okay, now I want to actually do the fine-tuning process.
So you go together, you upload your data set.
We have a one-line CLI command to do that,
or a Python library, or a TypeScript library,
a lot of ways to do it.
But in one line, you upload the data set,
and then in another line, you can trigger a fine-tuning job.
And when triggering a fine-tuning job,
there's, again, a lot of different parameters
you can play around with.
The main ones are probably the model
that you want to fine-tune, the number of epochs.
And then we get into a lot of specifics, like batch size
and a lot of other stuff.
But generally, we have good default settings.
So for the most part, you can just come in, give us your model, give us your number of epochs, and launch a fine-tuning job.
So after a fine-tuning job is done, you can then run inference on your model.
And we actually give you the opportunity for LoRa FineTunes to either deploy it on a dedicated instance, so you can take your fine-tuning model, you can deploy it,
you can download your model and go run it somewhere else or run it on your own GPUs or do whatever you want deploy it. You can download your model and go run it somewhere else, or run it on your own GPUs, or do whatever you want with it.
Or what's actually really cool is you can just
run it right away and pay for it usage-based.
So you don't have to do dedicated endpoint.
You could just run it right away, essentially
very similar to serverless.
And you can run it and see how it goes.
We see some customers will launch multiple fine-tuning jobs with different kind of parameters or different data or all of that.
But yeah, then after I have my fine-tuning job and I can run inference on it, a lot of the time we'll see customers run evals on it, obviously.
So you want to make sure and see how good this model is.
So in my case, you know, the LAMA 3 AB model that I took and fine tuned on those 200,000 math problems, I tested it
on 1,000 math problems that the model has never seen.
And I tested a LAMA 3 AB, LAMA 370B, and then
some frontier models like GPT-4.
And I was able to get my small LAMA 3 AB model
to be better at answering these math questions than even GPT-4.
In terms of, so you had this, you
kind of walked out with having a set of 200,000 examples.
But did you experiment at all with what
happens if I give it 100,000 or 50,000?
How much data ultimately do you need?
Yeah, that's a really good question.
Highly dependent on your use case,
highly dependent on how the model you use, all of that.
But in general, so there's two types of fine tuning.
There's Lora fine tuning, and then there's full fine tuning.
And in my case, I was doing full fine tuning.
So for something like full fine tuning,
you'll want a good amount of data.
I'd probably say 100,000 plus data points,
probably, for a full fine tune.
But the great part is with a LoRa fine tune that actually
goes from hundreds of thousands to maybe thousands of examples
that you need to actually get a good model.
So yeah, that's what I would say.
I'd say if you don't have a lot of data,
you probably want to just try to allure a fine tune
and see how that goes.
Yeah, and the big difference for those that are not familiar
with the concept of allure fine tuning is,
essentially you're restricting the fine tuning
to the lower layers of the neural network
versus all the layers.
So you're not adjusting all the model weights,
just essentially the model weights in the lower layer.
So it's less intensive from a compute perspective,
going to take less time to do.
But it's probably harder to do a drastic behavioral change
than something like a full fine tune.
How long does that process take?
I mean, obviously, it depends on the data set and so forth.
But in your case, what was the turnaround time
on that fine tune job?
Yeah, great question.
In my case, depending on the number of epochs and everything,
I think it ranged from two hours to 14 hours.
Yeah.
OK, well, not too bad.
I mean, I remember back in the early days of GPT-3,
you're talking multiple days to do a fine two jobs.
So if we're talking a handful of hours, that's not bad.
Never said.
But yeah, then if you have millions or tens of millions
of data points, then yeah, you're
definitely looking at a factor of days instead of hours.
Yeah, absolutely.
There's all these different techniques around like when you're building applications
in terms of getting like sort of the reliability
and the performance that you want out of the application.
And a lot of them depend on doing inference multiple times.
Like if I'm doing something where I'm producing an output
and maybe I want another call to judge the output
and I do some sort of inference loop, a series of reflection,
or the example that you talked about earlier, where maybe I'm calling
multiple models and then have another model sort of summarize it in like a
sort of parent-child relationship type of pattern.
Like, how do you, from a cost perspective, like how do you sort of balance
reliability performance without completely you know, completely blowing
apart your token budget?
Yeah, that's yeah, asking some difficult questions over here.
I think that's like that's a very hard problem to solve in general,
right? And generally what I like to tell people is like
start as simple as you can.
Don't try to overcomplicate it before you don't need to.
Sometimes you just use a small LLM and you send it a single thing and you're like, oh
my God, that's perfect.
It's doing exactly what I want it to do.
In that case, do some evals, which I feel like evals are just underrated in general.
If you have a serious LLM app, you need evals.
You absolutely need to.
Yeah, I mean, you can't be vibe checking production apps.
Oh yeah.
What do you see from an eval perspective,
like either in your own projects
or when you're talking to customers and things like that,
what are people doing?
Are they building their own stuff?
Are they using platforms like BrainTrust and others
to do evals?
Yeah, I think a surprising amount of people
do their own stuff, which I found interesting.
And I think it's because different people are testing
very different things.
And sometimes it's very custom based
on what that specific person needs.
Like a good example, I think, is I built this app called Lulma
Coder.
And it's basically like a poor man's
like v0 or lovable or bolt.
And it's starting to build evals for it.
I launched, it started to get popular.
And so I was like, OK, I want to try to take this a little bit more seriously and add evals for it, right? I launched, it started to get popular, and so I was like, okay, I want to try
and take this a little bit more seriously
and add evals to it.
But evals are really hard to add in that case.
So this is an app where you put in a prompt
and you get an app.
But how do you judge it?
How do you judge that app's output?
And so I started to think about that and think like,
oh, maybe one potential way to do that
would be to write a script that spins up
like a headless browser and types in a prompt
and waits for the app to finish
and then takes a screenshot of the page
and then sends it to a vision model
to try to judge it in some way.
And so anyway, you start to get very, very specific
and I don't think there's any eval company
that's built like a UI for something like that
necessarily. So yeah, we see a lot of people build their own.
But also if you're if you're just testing like more
traditional, you know, chat kind of applications, you want to do
LLM as a judge, all that kind of stuff. I think platforms like
brain trust are great. They're a good partner of ours. And I use
brain trust for some of my chat apps as well.
Yeah, that's consistent with what I've seen, too, also talking to people in the industry is,
I think ultimately a lot of this stuff ends up being somewhat custom.
You know, even when you're starting to build out, I think, more complicated like workflows or agents,
like a lot of times people end up going away from some of the frameworks that are available
or using very lightweight version of the frameworks
because they want to be able to have more control essentially.
And they end up doing a lot of just like bespoke custom work
just I think in part due to the level of immaturity
of all the technology.
It's kind of just part and parcel
with where the tech stack is currently.
100%.
In terms of what you're seeing, you've been working in the space for what seems like a
long time now, relative to what a lot of people have been working at, what are some of the
most interesting use cases and uses for Together that you've seen from customers?
Yeah, that's a good question.
Most interesting use cases of Together.
I mean, maybe I could start with some of the more common ones.
A lot of folks use us for things like summarization.
A lot of people use us for things like summarization. A lot of people use us for chat apps.
We power Doc.Go's online chat.
We power this company called Zamato,
which is one of the leading delivery companies in India.
We power their kind of chat bot agent.
So we do a lot of these kind of chat bots, like TripAdvisor, we help them with a lot of stuff.
Yeah, we also power summarization for a lot of these different companies.
And then we just have a lot of these really unique customers. Those are kind of customers that are on our,
using our inference engine, kind of using like serverless or dedicated.
But then we have a lot of really interesting customers
on the GPU side and people kind of doing all sorts of things.
We have video game companies that are using us.
Latitude has built a bunch of video games,
leveraged our inference service.
Some audio companies like Cartesia.
Cartesia is a really great company
that has a fantastic model.
And we help serve their model.
So we do inference for them, essentially,
on our GPU clusters.
Like I mentioned, some video companies like Pica,
where we help them not only train their model on our GPUs,
but also run inference on it.
Yeah.
You mentioned in terms of use cases,
people doing things like summarization, chatbots.
There's bread and butter use cases for generative AI.
Do you feel like that? And I think now these are not necessarily the use cases
that get a lot of headlines, right? These sort of are actual practical use cases. But these are
great. These are actual real-world examples that are very, very useful applications of generative
AI. Whereas a lot of the headlines, I think, end up getting dedicated to sort of, I would say, like a little bit more out there, types of applications of like, where look very impressive, but are you know, are they
really going to see the light of day in a production scenario for a while? But do you think
that we end up focusing a little bit too much on some of the hype side of AI, and then we lose a
little bit of the sort of practicality of what companies are being successful with.
I think so for sure. If you go on Twitter and you see what people are talking about there versus
what enterprise companies are actually shipping in production, a lot of the time it's two very different things.
But I think that's also what's a little bit cool about AIs is that there's these really cool,
there's always bleeding edge stuff. There's always bleeding edge stuff that comes out
that that's interesting to look into.
And we have seen people go a little bit beyond
the traditional chat bot summarization stuff.
I mean, we see a lot of people do text extraction.
We see people do a lot of just image generation
for different use cases, like personalization, a lot of
code generation too, actually, I forgot to mention that one. A lot of code generation.
Yeah, people just do that. And yeah, a lot of people also using our, I guess one of our
new product is like, we acquired a company called Code Sandbox, and we have like a Code
Sandbox SDK now, where people can spin up these mini VMs
and be able to run a lot of the lovable, bold type companies
like that to be able to run an entire Next.js app, basically,
on these VMs.
So we have folks using us for that, too.
Yeah, I think that when it comes to writing code,
in a lot of ways, software engineering
has become the tip of the spear for a lot of these AI use
cases, which I think maybe is surprising to some.
But I think potentially one of the reasons why
AI has been so successful there is, one,
there's inherently a human in a loop that's
going to be evaluating the outcome.
So if a mistake is made, presumably it
gets caught by a person at some point.
But the other thing is, from tying it back
to what we were talking about earlier,
from an eval test perspective, there's
a lot of existing stuff to test against.
I think one of the challenges with the really general use
of a general model is
what is the eval test case? If someone can ask anything in the world to this thing,
how do you test whether it's able to answer that correctly? And the advantage of historical
predictive AI or purpose-built models is that you have a very good understanding when you have that
model of what the inputs and outputs should be, So it's easier to build like a test set to test and evaluate it.
And when it comes to code, like if I have, you know, some sort of agent that's going to do PR reviews,
well, there is probably thousands of human PR reviews in your organization that become essentially your eval test set for that.
100%. Yeah, that's a really good way to think about it.
What are, for you and your team, what kind of projects
are you guys focused on?
Yeah, so I run the DevRel team here at Together.
So we have a few different focuses.
I think on one side, we really just
care about the developer experience of the product.
And so we're really involved in being customer zero,
trying to test out everything before it goes out,
trying to relay information from the community
back to the product teams, trying to really think about,
OK, a new member onboarding, and someone signs up
for the website, what really happens?
How do we basically take developers down the you know, the happy path and,
and, you know, get the time to first value or in our case, that's like having to make an API call,
you know, as fast as possible after they, they sign up, you know, so we, we care a lot about that
and think a lot about that. Our docs as well. So we own our documentation and we'll, we'll continue
to make that better and ship more guides and stuff like that.
So that's from the DX point of view.
And then from the content point of view, we'll put out a bunch of block posts and videos
and go to conferences and try to get the word out.
And then on the last point is we just try to be customers ourselves.
We try to build apps like our customers do. And that leads to both some great feedback, I think,
for our engineering team, but also is a cool, like,
growth engine.
Like, we'll just release interesting apps and make them
free and fully open source and have all these people come in
and developers love looking at code.
So they'll check them out, they'll clone them, and
obviously to run any of the projects,
you need a Together API key
to call our various open source models.
So yeah, a big part of what we do is just try to build
really cool, impressive demos to highlight
cool use cases from Together.
And if I'm using a model hosted by Together,
then what is sort is that application experience
of programming against that?
Is it similar or comparable to using something like OpenAI
directly, or we're going through some sort of AI,
GenAI framework, like a Lang chain or a mom index
or something like that?
Yeah. I would say it's similar to all of that. you know, Gen.ai framework, like a Lang chain or Lama index or something like that?
Yeah, yeah, I would say it's similar to all of that.
We actually, you can actually use our API
through the OpenAI SDK by changing like two lines of code.
On it, you can use our API with Lama index or Lang chain
or Lang graph or crew AI, or, you know,
we're compatible basically with every single
major framework out there.
And then we also have our own SDKs.
If you want to NPM install or PIP install together
in like five, six lines of code, you can get up and running.
OK.
And then can you talk a little bit about your agent recipes
project, which I forget.
You came up with that a little while ago.
I thought it was a really, really cool project
to show how some of these sort of agent design patterns
essentially work.
Yeah, absolutely.
So Ancient Recipes, it was actually
inspired by an article that Anthropic did
on how to build agents.
And I thought the data really.
Yeah, that's a great article.
It's an amazing article. Highly recommend everybody to read.
I talked to the Anthropic team a few weeks back and they told me they're working on a version 2,
which I'm really excited about, or part 2 to that blog post. But anyway, I was kind of inspired by
that and I liked how they broke down these different agents, or really mostly workflows,
into different recipes
that they've been seeing their customers use.
And so I thought I'd just go a little bit of a step further
and not only take that kind of use case,
but also try to give people a little bit more information
on what are some good use cases for this specific recipe,
and also write code.
So I wrote TypeScript and Python code.
Me and Zan on my team worked on this.
And yeah, we tried to build a really cool resource
for people to check out when building agents
and building LLM apps,
people to go and literally copy paste code
for different patterns that we've seen work well.
And we're actually expanding on it even more
and releasing a few kind of recipes of our own pretty soon.
Oh, cool.
Yeah, I think when you mentioned,
like, they're a little bit more like workflows
than fully identical.
I think in reality, like, going back
to what we were talking about earlier of putting
some of these things actually into production,
I think when it comes to agents, at least what I'm seeing
is most things that are hitting production
are probably more akin to workflows than they are to,
like, I just give this thing access to AWS,
and it goes and optimizes and fixes everything.
I think we're a long way from that level of autonomy.
And a lot of times, I think a good
application of an agent is like, we have a playbook that some person follows today. And if I have a playbook already defined,
then that's a very good application of essentially doing something agentic, but it might look a little bit more like a workflow. And
that gives you a little bit more determinism over something that is inherently
stochastic in nature in terms of operating
with these probabilistic models.
100%, yeah, that's what we've been seeing as well.
Workflows is really where you can get a lot of value out
of having these LLMs do a lot of cool things.
But also, you're kind of restricting their output
a little bit more and giving them more guardrails.
And as a result, it's just so much more reliable than just,
like you said, giving an agent just random access
to a ton of different tools.
Yeah, and I think, too, to your point earlier about where
you were saying why people make the mistake of trying
to give too many instructions in a prompt,
by thinking about sort of decomposing a problem into a multi-agent,
whether it's a workflow or whatever pattern,
it's a natural framework for helping people shrink
the problem in the prompt that they're getting.
This particular node in this graph or in this workflow,
the only thing I want you to do is
produce a summary of this output
from another part of the process or something like that.
And that just increases your reliability
and also makes your testing easier.
100%.
Yeah, I've experimented with it a lot for this project
that we just launched a couple of days ago called Together
Chat.
And yeah, we ended up going with more of a workflow approach
for that one with a router.
And the way it works is you have this kind of like chat, it's kind of like a chatbot, right?
But you send a message and the router will intelligently try to figure out like
whether it should just answer the question, whether it should use a search API to try to look it up,
or whether it should generate an image because you can also generate images in the chat.
And so I found this kind of approach to work generally really well.
Yeah, I think that's part,
that's sort of where the agentic nature is,
is the decision around what tool or data to access,
not necessarily in the full sort of keys of the kingdom
and the execution of that once I figured out
what tools I'm going to use.
In terms of, you've built a lot of AI apps
over the last couple of years.
What are some of the things that you've learned along the way
that are typical mistakes that you
see as people who are just starting to enter,
building some of these types of applications?
I think most people don't launch fast enough.
I think that's something I've seen over and over and over again,
where people try to overcomplicate problems a little bit.
And I think the great part,
I think for a lot of these apps, what I really like is that it's very quick.
I spend one week, two weeks at most on any of them, and I launch them,
and sometimes I'll get millions of people
to go check them out and say they work great
and all this stuff.
When really, I didn't work on this thing
with a team of five people for 10 months.
This is just a really quick one or two week thing.
And then you could just add evals in and just build up
to where you want to, or just listen to your users
and what they complain about.
And so for most, I feel like for most, for a lot of AI apps, it makes sense to where you want to, or just listen to your users and what they complain about.
I feel like for a lot of AI apps,
it makes sense to launch them that way,
is to get them to 90% very quickly,
and launch and see how it goes, or do early access,
or beta, or whatever it is.
So yeah, in terms of the launching,
that's one thing I've seen.
The other thing I've seen is that there's
no shortage of cool
stuff to build in the AI space right now. Like if you're a builder, like this is like,
I feel like one of the best times to, to, to be building cool things. Like there's, there's,
I have a, I keep a notion doc of ideas that I want to build and it's at like 70 something right now
of like things that are like, I want to build and I have like a top 15 right now that I'm working towards trying to all get done to ship all of them this year.
But yeah, it's just a really exciting time and you can do so much with so little I think
and so many of these like apps that I build are just like one API endpoint or two API
endpoints or sometimes it's a chain of like two or three. But I found it fascinating how much
you can build with not a lot of API calls
and of just a very simple app.
Yeah, I mean, I think that one of the things that you're
doing sort of a great service to these models
by sort of a lot of demonstrations
of like the art of the possible.
I would love to see businesses do a little bit more
of this themselves.
I think a lot of times, when I talk to business,
especially enterprises, they're really hyper-focused
on trying to figure out what this killer use case
that's gonna deliver massive ROI to their businesses
and then invest like a tremendous amount of time.
But you can make really meaningful impact
by focusing on, you know,
what are my internal knowledge workers doing today
that feels like, you know, kind of a waste of process
or a waste of a person's time around collecting
a bunch of information from different places.
Like every business has these things.
If you just went sort of piece by piece
of automating some of those, you're going to learn a lot
and you're going to get, I think ROI pretty quickly.
And it doesn't require a 50 person engineering team
and a ton of, you know, like massive testing
to get something essentially that's, you know, 90% good.
Yeah, absolutely.
You know, going back to together, I guess, what's on the horizon for you guys from a product standpoint?
You raised all this, a large amount of money right now.
What's next for you?
Yeah, a lot of things.
I think, well, one thing I said, like I said, we just released Together Chat, which is kind
of like our first official consumer app as part of Together.
So we're kind of, I would say, dabbling a little bit in the consumer space and seeing
how that goes.
We're also working on a lot of things across many different products.
So one thing that we have coming up is called Together Code Interpreter. And we're going to basically allow people to not only, like I said before,
one of the big use cases of UpTogether is just generating code from a lot of our LLMs.
And we're going to let people not only generate code, but actually run code and do both with
like the same API key, essentially, or the same SDK. So that's something that's coming that I'm really excited about.
Continuing to just host amazing open source models
as they come up, like better kernels, better inference stack,
just being able to just be faster.
And we're seeing a lot of companies
ask for Blackwells now, like B200s.
So we're kind of moving from Hopper to Blackwell and we're just getting like, you know,
thousands and thousands of Blackwells coming in very soon that I'm excited about
because that'll be an improvement across the board, you know, for our customers that do inference and training,
but also for our own like models to be able to run even faster on this great hardware.
Yeah, we're improving our fine tuning service
to make it a lot easier.
For our GPU cluster service, right now,
you have to talk to our sales team
and try to do things that way.
But we're releasing instant GPU clusters,
which I think is waitlist only right now.
But we'll hopefully be releasing somewhat soon
where you can just sign up for a platform
and get access to GPUs right away,
or very, very quick in a self-serve manner.
So yeah, kind of just trying to push on all fronts
and just trying to be this,
our branding is like we're the AI acceleration cloud
and just be this basically end- know, our branding is like, we're like the AI acceleration cloud and just like, be this basically end to end AI cloud for whatever you're trying to do from inference to fine tuning to training and try to run your entire stack on together.
Awesome. Well, as long as thanks so much for coming back. This is great.
Yeah, thanks for having me.
Alright, cheers.