Software Huddle - AI Agents and Long Context Windows with Mark Huang
Episode Date: June 18, 2024Today we have Mark Huang on the show. Mark has previously held roles in Data Science and ML at companies like Box and Splunk and is now the co-founder and chief architect of Gradient, an enterprise AI... platform to build and deploy autonomous assistants. In our chat, we get into some of the stuff he’s seeing around autonomous AI agents and why people are so excited about that space. Mark and his team has also recently been working on a project to extend the Llama-3 context window. They were able to extend the model from 8K tokens all the way to 1 million through a technique called theta-scaling. He walks us through the details of this project and how longer context windows will impact the types of use cases we can serve with LLMs. Follow Mark: https://x.com/markatgradient Follow Sean: https://x.com/seanfalconer
Transcript
Discussion (0)
At Gradient, we are in the business of enterprise AI automation.
When you start to think about AI systems, in particular, like large language models,
there's also a lot to think through from like a privacy security perspective.
What exactly counts as an autonomous agent?
A non-deterministic executed path. It means that you need to have some sort of mechanism or an intelligence that can interpret the intent
of the user and understand the actions and the side effects that occur in the environment to
redirect the new set of execution traces. You're maybe giving some sort of high level instruction
and then it's making decisions along the way based on what's available to determine which path to take that leads to the outcome that you're looking for.
So like the instruction overall is probably some high-level goal, and the agent is tasked
with achieving that, and you will grant it abilities to interact with the environment.
Hey everyone, Sean here, and today we have Mark Wong on the show. Mark previously held roles in data science and machine learning at companies like Box and Spunk,
and is now the co-founder and chief architect at Gradient,
an enterprise AI platform to build and deploy autonomous assistants.
In our chat, we get into some of the stuff he's seeing around autonomous AI agents
and why people are so excited about that space.
Mark and his team has also recently been working on a project to extend the Lama 3 context window. They were able to extend the model from 8,000 tokens all the way
to 1 million through a technique called Theta Scaling. Mark walks me through the details of
this project and how longer context windows will impact the types of use cases that we can serve
with LLAMs. All right, as always, if you ever have any suggestions or comments about the show,
please reach out to me or Alex.
And with that said, let's get over to my conversation with Mark.
Mark, welcome to Software Huddle.
Hi.
Nice to meet you, Sean.
It's a pleasure being on here.
And, you know, I'm a big fan of your show.
Awesome.
Well, it's good to have a fan.
You know, sometimes you're like when you're creating a lot of this type of content, you're
like anybody out there listening besides my mom?
So it's good that folks are paying attention to this.
So you're the co-founder and chief architect at Gradient.
So I wanted to start off by just learning, what's sort of the story behind Gradient?
What is it? Why did you start it? Why become a founder?
Yeah, so at Gradient, we are in the business of
enterprise AI automation. And the way that we do that is we leverage agents in order to
provide flexible workflows that enable people to automate almost any sort of operational business task. And with that, we try to really address all the pain points
that previously you would encounter
when you're really familiar with robotics processing automation,
mostly because language models are just so powerful
and we're able to actually break through the common patterns
that existed before.
So you say enterprise AI, like, what does that mean, I guess, as someone who's just learning
about this stuff? Like, how is that different, maybe than if I'm essentially not in the enterprise
space? What are the certain things that you have to focus on? Because this is enterprise versus if
you're doing this for, you know, mid market or, you know, small businesses? Yeah, so I distinguish it in on a few
levels, what I've really noticed on the long tail and the developer facing applications and products
out there are that they're really easy to get started. And they're actually wonderful for the democratization and usage of AI and having people
learn about it. But where they really fall short in the enterprise are the aspects of
being able to grow and ensure security and data governance and also with respect to the
interoperability with integrations and just the reliability and continuous, I would effectively call it almost like you are sort of beholden to your, your companies and
your users using your product within there. Right? Yeah, I mean, there's a ton of stakeholders
involved in enterprise. And I think also, like, when you start to think about AI systems, in
particular, like large language models, there's also a lot to think through from like a privacy security perspective.
In any enterprise,
the sort of requirements generally
for privacy and security,
like the CISO being involved
is going to be probably a higher bar
than you might meet at other types of companies.
So that's also something that you have to think through
if you're building enterprise AI,
whatever, like architecture or applications
and trying to sell them to enterprise.
Are you going to be able to essentially address the concerns of the CISO and the other people involved from a security perspective, as well as some of these other things that
you're talking about around like reliability and, you know, your build, what's your throughput,
how resilient is your systems and so forth?
Yeah, for sure.
And, you know, this is
something I never really realized before I had to start selling to, you know, management and going
from, you know, largely our sales process and the way that we operate is we have to sell from the
top down. So in a way, we're bringing all these stakeholders along for us, we're becoming the medium for people understanding where their
AI strategy should be, and also bringing the execs all to the table to understand like,
where's our value add there? And how do we build this relationship between all the teams together?
Do you see this as being particularly tricky, just because it's such a new space,
and many of the people you're probably talking to are interested, but maybe they're not experts.
They don't know exactly what they even want or need.
Yeah, I'd be lying to you if I told you it was easy.
I think as with all things in a way, it's a really noisy market and it's noisy because there's a lot of disruption happening. But that's also where all the progress being really good as a communicator for having the type of empathy for the stakeholder that you're talking to.
And that requires a lot of intentionality.
And to be honest, like something that I was never used to.
Like I'm a builder at pie heart i'm a technologist and what i wanted to do is always
build incredible product but to a certain extent i need to build product that means something to
somebody i need to build a product that is intentional enough for them to understand
um how they leverage it so uh i i think that's probably the number one thing I've had to learn
throughout this process. Yeah. And I think that's hard. That's probably one of the hardest things
for, I would think anyway, even speaking from my own perspective, like when I went from essentially
engineer academic to, uh, founding a company was like, how do you learn the business side of the
business and get good at being comfortable
standing up and pitching what you're talking about, but not being scared to do that and also
not coming across as a used car salesman trying to build a product. Those are hard lessons to
learn, but starting your own company is certainly a forcing function for having to learn that stuff or you're just not going to survive yeah i i 100 agree and um you know super glad that you can also empathize
with the situation there because um yeah you you are like you're taking a lot of shots at goal
to figure out how to position yourself too right and um something i like you know i like to tell
a lot of people is like,
you know, VCs, they like to create all these market maps. And it's not really in service of
just categorizing companies, it's actually for people to be able to create an understanding of
how your product fits into the ecosystem and how that actually delivers usefulness or value towards the end user.
And without that, right, it just seems like a bunch of chaos,
because especially in a space where like 100 companies are getting started, I feel like every week,
figure out what someone even does. Right. Like that's probably
the first question you probably heard as a founder all the time where you're like, someone comes over
to you and they're, they, the first question they ask is like, can you help me understand what you
do? And, um, as a person who's head down building the product and just you yourself understanding
so, uh, so intimately what you are having to explain that to someone with no context,
like that's a skill in itself, right? So I think that's also important, too, because it does build
a little bit of conviction yourself when you're the founder, try to try to grow your company.
To, you know, the whole, give me the elevator pitch, give me the one liner, like that's
super, super important. Yeah, absolutely. I think i think i for me i was kind of originally forced to develop some of those skills when i was doing
my phd because we used to bring you know guests into the lab all the time and we'd stand up in a
line and our professor would be like you know tell them what you're you're working on in like you
know 30 seconds or less so you're trying to consolidate like you know four years of research
distill it down into like 30 seconds and it it was like the people I think that spent the understood what project that they were working on the most were able to essentially distill and condense that in a way that was understandable to anyone the best, because it really comes from getting that intimate familiarity with it, and then figuring out how to like translate to sort of different audiences and that's a real you know um skill set that's super super important for
all people to sort of develop over time yeah i i even tell my employees to a certain extent like
with respect to how you can like how do you how do you grow and how do you become you know the
next iteration in your career and increase your impact.
We always say, bring others along with you.
The people who are absolutely the best at that,
they're able to achieve great things.
And I parallel that almost even to the same respect
as to how, if you ever heard about the reasons
why OpenAI was able to beat Google
in training these large models, it's the fact that, unfortunately, within the organizational structure of Google, it's hard to bring others along with you.
And if you could do that, and you can just raise your focus on helping others understand what you need and do, that's a powerful instrument in itself.
Yeah, absolutely.
All right, so I want to talk agents. So I think
like autonomous agents, I feel like they're like all the rage right now. Like, you know, last year,
maybe it was about rag. Everyone was talking about rag. Now we're talking about, you know,
agents and there's a lot of articles about it and how this is sort of the future and so on.
So what exactly counts as an autonomous agent? And what are some of the
things that someone actually could use that they had an autonomous agent? Yeah, so in the modern
definition of what an autonomous agent is, is actually it is a executed path, a non deterministic
executed path. And by that, it means that you need to have some sort of mechanism or
an intelligence that can interpret the intent of the user and understand the actions and the
side effects that occur in the environment to redirect the new set of execution traces.
So it's similar to a graph, right?
Like I've seen a lot of people create that pattern
for creating agents.
And a lot of it comes with being able
to handle out of domain scenarios.
Okay. So this is in some ways, with being able to handle out-of-domain scenarios.
Okay.
So this is in some ways maybe a replacement for something that we might think of
as a finite state machine
where it's a deterministic flow in that case.
We're specifically telling somebody
some sort of automation to take this path
under these conditions and stuff like that.
In the case of an autonomous agent, you're maybe giving some sort of high level instruction
and then it's making decisions along the way based on what's available to determine which
path to take that leads to the outcome that you're looking for.
Yeah.
If you parallel that towards self-driving cars to a certain extent, right?
Like what has been the hardest part about self-driving has been a certain extent, right? Like what has been the hardest part about
self-driving has been the ability to plan and determining plans just means, as you said,
taking high level, typically vague and ambiguous instructions to satisfy that goal. So like the
instruction overall is probably some high level goal
and the agent is tasked with achieving that
and you will grant it abilities
to interact with the environment.
So all the different applications of it,
actually, I think, you know, right now,
the really interesting ones are all related
with just software applications, but, you know, they're parallels to self-driving cars, right now, the really interesting ones are all related with just software applications.
But, you know, that parallels to self-driving cars, to robotics, and then various systems that are, you know, almost physical to a certain extent.
And, you know, we're still fairly early in this space.
But where do you, besides software systems, like where is some of the initial momentum and excitement around this?
Is it more sort of business use cases
or are you seeing things
that are potentially consumer facing?
Like, hey, I know I want to,
I don't know, book a restaurant on this night
or I want to book a vacation with these parameters.
It needs to be tropical.
And then like, hey, autonomous agent,
go off and take care of this for me.
Yeah, I think on both sides, I'm super excited,
either consumer facing or business facing
for the applications there,
particularly for the business facing aspects of stuff.
It'll completely change the way that you think
we interface to AI applications today.
People are really used to chatbots.
I think ChatGPT conceptualized probably one of the most seamless experiences with embedding AI.
But for agents, people will get used to asynchronous workflows where you're not actually waiting for the response back.
And you may not actually care for the latency component there,
but you're sending something out to achieve what you want.
And it would be great if I could just tell something to schedule out my entire day,
to a certain extent, book flights, figure out figure out, you know, day two of my vacation and, and, and all of that, because, um, we're, we're, we're taking away and granting
access to, to, to a lot of, you know, tedious things that maybe we don't, we don't really
care to do because they're, they're pretty simple. And I do think it does raise the bar,
uh, as a, as, um, you know, a society on what are the things that we're going to be doing with all our free time, right?
Right, yeah.
How are you going to work alongside AI that way?
Yeah.
So most of my free time is taken up with a two and a four-year-old.
So I'm not sure.
I'm still trying to figure out how AI will help me there.
But if there's some agent out there that can help me, I will be very,
very interested in pay top dollar for. But in terms of the fact that latency potentially matters
less, you know, if I'm chatting to something like chat GPT, I'm expecting almost, you know,
sort of real time responses, right? But if I'm saying like, go figure out my vacation,
and even if it took 24 hours and came back with something that was like, really, really good, I wouldn't really care that it took 24 hours to do that. So how does the fact that latency is less of a like, of a issue, change the current operating patterns that people have. And it does open up the door for strategies and applications that can may require more compute time handle much more context, then the focus will shift
much more into those aspects.
And, you know, you won't actually be delving in so much research into specifically cutting
down, you know, the matrix multiplication speeds and maybe focusing on other aspects
of research and development.
Yeah.
So I think that's another area besides like agents that now is a big topic
conversation is around like the size,
this growing size of context windows.
Like there's been a ton of advancement in the last year.
Like the length that was like bleeding edge a year ago is now sort of table
stakes.
So like,
how did that happen?
Like,
was there particular things that happened in the research or, or even companies that are leading the charge with LLMs and generally AI?
Something that allowed them to sort of break the barrier on what we can do in terms of the size of a context window?
Yeah, there's been a lot of different technological innovations, algorithmic too as well.
Probably the first, to my mind, that kicked a lot of things off were most of the, I'll separate them between the lossless algorithms and then the lossy algorithms.
So for the lossless algorithms, the first that really came about was flash attention.
Probably everybody's heard of it.
And that's what really pushed the context window longer than what we initially were used to, mostly because they viewed the way that they do the computation of the attention mechanism
through different chunks that are little batches that you only actually grab the set of attention that you
need at the time when you need to do the back propagation or you actually have to do it
at inference time.
The second sort of class of things that have been happening is just, you know, NVIDIA keeps
on spending money to increase the VRAM on a lot of these chips. So then it just gives you the requirement on that side
to be able to actually support these models.
And then on the loss,
I'll talk maybe for briefly on the lossy approaches.
You have quantization, which is really simple,
representing a number from a particular precision
to a smaller unit of precision.
And then that comes coupled though
with model degradation at a certain stage,
but there's research to kind of figure out
how you can ensure sparsity to retain that model quality.
And those are sort of the three high-level innovations that have really set forth and
enabled us to increase the context length.
And those continue to be actually the same things that people try to innovate on today.
So there's really no lack of minds trying to work on today so um there's no you know there's really no lack of uh minds trying to
work on these problems yeah so it sounds like there's been both sort of innovation on the
software side as well as the hardware side to help like make these context windows larger
by having a larger context window does that help keep models smaller?
You know, that is a good question, because I think there are two things that are happening there.
In terms of keeping models smaller, I think that's more so you want the smallest model that can achieve the tasks that you intended to perform. But the main blocker for that
is actually the amount of tokens
that you train it on, in my opinion.
Because if you really think about the scaling laws
or even the chinchilla paper,
like they're trying to find the compute optimal.
What's the compute optimal training
that you need, right?
At a particular model size.
And then also how many tokens you need to train a model on for that.
People have pushed far beyond those heuristics.
And if you look at Lama 3, it was trained on 15 trillion tokens.
And that has enabled you to have a more powerful model at a smaller size.
And then the context window, if you increase it there, now you're unlocking like, hey, now you have a more powerful model at smaller size that can also handle a lot of tokens at runtime. people are just pushing the parameters to be as large as possible because as you increase the
parameters it's actually more sample it's it's more sample efficient with the tokens so in a way
like you're trading off the runtime costs of serving this thing in service of having something
that is incredibly powerful and then how do you, like, how do people typically evaluate
the size of the context window
and, like, what's actually working?
It's still an area of open research.
My company, Gradient,
we really are trying to get in a,
you know, do a lot of this work
and collaborate in the open for it.
In particular, like,
NVIDIA has their benchmark
suite that we really like. For the, you know, the uninitiated, the first evaluation is the needle
in the haystack evaluation that Greg created. And that is benchmarking a model's ability to retrieve. It's basically
pass key retrieval where you're retrieving a key or sorry, you're retrieving a value
given a key that's just sent in a ocean of tokens. And then you ablate where that key actually
is located. And then the ruler evals are even more comprehensive where they include
both needle in the haystack and then three other different types of evaluations. One is variable
tracking. So you're tracking state of a variable over all of these large context windows. Another
is aggregation, wherein you want to find a variable and then you want to do an aggregation
on it, like counting the number of occurrences or sums. And then finally, a distraction
evaluation where it takes SQuAD and then generates a synthetic data set in which
you throw a model distracting context and still make it answer to see if you can properly answer a question.
And these are really comprehensive.
Like, don't get me wrong, but there's even more nuanced aspects that go much more beyond what I would consider these evals looking at, which will definitely touch more on multi-hop reasoning and and uh planning capabilities because those are
still very short context lengths yeah yeah the needle in a haystack one i think is um
i mean that kind of like showing that that works like i remember when i think when gemini was
announced they showed uh like essentially taking out like a black and white film and then finding uh one specific like frame
that happened in the film like there's a whole like collection of jobs in certain verticals where
people's job is basically to find like a needle in a haystack like if you think about like the
legal industry there's a lot of people that just comb through thousands of pages of stuff to to
try to find certain information and it feels like that is a place that's going to change drastically in the next few years.
If you can put some of that stuff into these context windows or into like, you know, an
LLM that's whether it's RAG or some other training methodology to be able to look that
stuff up immediately.
Yeah, I think we're already seeing it to a certain extent with a lot of the business use cases
that we're trying to support with these models
and seeing effectively a simplification
of some of the use cases we have deployed
from using RAG or actually using RAG
in addition to the long contacts.
These models are getting so good actually at doing things that are derivatives
of the jobs that we, you know, the jobs to be done, right.
To a certain extent and being able to expose those capabilities has really
opened up the door to, you know, being,
being able to actually be more productive and efficient for that.
So I even find myself spending less time, right,
like combing through gobs and gobs of pages
to find the one piece of information that's relevant to me,
which is a search problem to some extent,
but I don't need to do the initial pruning
and combing through of the documents afterwards.
Right. Yeah. I mean, there's a lot of like sort of classic search problems that you might have
used like a search engine for and then point you to a link, you click on the link, and then you do
like essentially an internal page search or something like that to find the answer that
you're looking for. And now you don't have to sort of even intake those steps and piece it
together. And then that becomes even more amplified when you're looking for. And now you don't have to even intake those steps. You can piece it together. And that becomes even more amplified
when you're talking about potentially reams of data
that exists only in a non-digitized format
in a back office somewhere.
Yeah.
And when you're talking about reams of data there,
it really does bring up the interesting aspects
of multimodality that, you know,
sort of top of mind for me
and a lot of our customers to a certain extent.
You mentioned Gemini looking through frames of a video,
audio modalities, sensory modalities,
like all those things,
if you can really kind of harness audio modalities, sensory modalities, like all those things,
if you can really kind of harness and unify the power of large models for that,
you just open up the door
to really interesting aspects too.
And we as human beings have,
we're really proficient at figuring those things out
and exposing that capability within models, I think that's just like the next, you know, we'll be working on that for the next
six months at least, because I always say in LLM or in language model world or, you know, AI world,
you underestimate what you can achieve within a year these days. So that, you know, we're looking ahead for those aspects
very, very soon. Yeah. So, you know, back to the actual process of extending a context window,
like if you take an existing foundation model, how do you actually go about extending the context
window? What is that process? Yeah, so what we employed personally
was a curriculum learning approach
where you stage different training runs
to iteratively increase the context length
of the underlying base model.
And it all really starts out with tracking the proxy
for what a context extension,
a successful context extension is occurring.
So what you do is you, we employed theta scaling,
which is a technique for positional interpolation,
because what is happening is you have to get the model
to attend to portions of its context
because they're all tokens and properly leverage that context in the setting when it may not
have seen the new set of tokens coming in.
So when you apply that, what you're doing is you're sort of shrinking down the sine
and the cosine amplitude curves that are occurring.
And then you're overlapping them as if they already incurred during training.
And then you take these samples and you take some sort of data set and then you synthetically
generate a data set that has the new context length during training.
So you're really tracking the initial perplexity curves,
which are just next token prediction,
to guide you in determining whether your context length extension,
you chose the right hyperparameters for that.
And then on top of that, there's the distributed training aspects
of managing the cluster and figuring out how to trade that.
And then you did this with Lama3, is that right?
Yeah, that's correct.
We used the Lama3 model because we found the number one complaint about that model was the fact that it only had 8,000 tokens, which is definitely on the shorter side of what is expected from users today.
Honestly, I still don you know, honestly,
I still don't quite know why they chose that context.
Like maybe they're already looking ahead towards the,
to the larger model that they're going to release or even the Lama 4 model,
you know, that might be released next year.
But that model was in particular the most, you know,
the best suitable candidate for us at the time.
And we'd already been doing internal benchmarks to extend context-aligned submodels.
So right after the model dropped, it was a golden opportunity for us to contribute back
to the community there.
And then how big were you able to extend the context window beyond originally 8,000
tokens? Yes. So for the two, we did the 8 billion parameter model and the 70 billion parameter model,
and we hit 1 million tokens that basically that passed all of the needle in the haystack
evaluations with flying colors. We also extended the 8 billion to $4 million tokens actually just to put the line, be the first
to plant the flag in the ground for that.
We're still improving it.
Initially, there is some degradation in the model's evaluations there.
And that's mostly because if you think about theta scaling, you are starting
to reach the limits of floating point precision. Because if you look at our theta parameter,
it's a huge number now. So think about taking a neural network and doing all these multiplications.
And if you've ever had to actually train a deep learning model before, like understanding like the vanishing gradient problem and the exploding gradient problem, all those type of corner cases, they start to arise in these type of scenarios too.
How long does it take to extend the model like that? With respect to the different stages of training that we used, in terms of GPU hours, I think it was in the hundreds, the thousands of hours that we had to use in order to finally get to the 1 million context-length model.
So it's not for the faint of heart, I would say. It certainly requires, you know, really good hardware and beneficiaries.
We were fortunate enough to get sponsored by Crusoe, which is a GPU provider for that, and get access to a really nice cluster.
And then once we had that, we were off to the races in which we know we were gonna tackle this problem
and deliver something to the open source.
And what is the output?
So is it essentially, are you actually,
you're taking the open source model
and then you're generating essentially like a new version
that you're contributing back?
Yeah, exactly.
We have our model weights up and hugging face right now.
I don't know what our ranking is at the moment, but we were the number two and number four models maybe a month ago when we did our release.
And then to this date, we're at a hundred and something thousand downloads to date.
So it's been a really fruitful contribution back and people ask all these
questions about different things the models can do, even unexpected things that we want
to evaluate ourselves. So it was exactly the type of release and in collaboration that we wanted.
And we hope to do more of that, those things. And we want to learn
more about people's use cases too. I mean, like what sort of scale of this project, like I get the
number of GPU hours that you had to spend like training this, but like how many people were
involved and also how did you know that you'd actually be able to be successful? And maybe
you didn't know, but like how do you kind of get started on something like this? Yeah, so maybe I'll first say the honest truth, we didn't know
that we'd be successful. We had a strong conviction that we could have done it. But it's similar to
the scaling laws to a certain extent, you don't know when the scaling laws will stop. Like, are they going to keep continuing as you throw more compute and more flops to the
problem?
And are you going to get that deterministic improvement in models, right?
With respect to ourselves, you know, we're a small startup.
So it was just a team of four of us working on this problem together for those two weeks. And between the
day that we started to the end of two weeks, we were able to achieve the task at hand that we
wanted, including all the evaluations too. So I'm not going to say that we didn't pull a few
all-nighters in between that because we certainly did and we worked through the weekend to get it done. But what it really does require is intentionality between understanding how to
construct the data set synthetically, how do you set up and run empirical experiments to figure out
what is the optimal network topology to use when you do this training? So if any folks are familiar with multi-node training,
most people just avoid multi-node training because it's a pain in the butt.
But beyond that, it's also an empirical setup where from a system standpoint,
you need to trade off and balance network bandwidth communication with your computational flops. So that was the, you know, the,
the iterative experiments we had to do beforehand before we really started the
project on other models. And then, you know, for those two weeks,
it was just hammering away with Lama three and,
and making sure all the training runs finished successfully.
And then what are what were were there any like
sort of unexpected challenges that you ran into doing this like where was is it just smooth
sailing from the beginning or did you uh you know hit some roadblocks along the way um i i don't know
if anybody's who's trained these large models can ever say it is smooth sailing from the beginning
i think between uh figuring out how to get the correct data to how you're evaluating the models properly and then actually babysitting
the training job, there were a lot of things that we learned along the process as well as
things that got validated. So one thing was that one expected aspect that arose there is how robust
the theta scaling trick works. So like positional interpolation is kind of amazing from that
standpoint, where the models really are set up for success to extend their context length, if you
just provide the correct data.
You know, I'll kind of reference something that Ilya Ilin had said to Dario,
who's the founder of Anthropic,
where he was basically said like,
these models, they just want to learn.
So like get the data right,
get the training right and get out of the way.
And that's how like we kind of treated it.
So yeah, throughout the process,
it's like setting up the ablations
and the experiments correctly
and then figuring out how we need to pivot
and what are the knobs we can turn
in order to make sure we can do it.
What are some of the common use cases
that you're seeing from your customers
or people you're interacting with
around using these long context windows?
Yeah, so the interesting ones that we've really seen are particularly in the finance domain
for table reasoning, where you want to answer specific questions or do an investment analysis
or financial analysis on top of like many, many documents. And maybe it's one document
with many pages, or you have to take the knowledge and link multiple pieces of information across
multiple documents. And with retrieval augmented generation, the main failure node you would see
there is the model's inability to disambiguate pieces of information
and then combine it and synthesize it where the lossiness that is required when you summarize
context and then you kind of proxy it isn't enough to answer fine-grained questions there.
And on the parallel track within the healthcare space, there is an aspect of having models produce responses
that require citations for grounding and preventing hallucination for planned beneficiary
information. So with respect to answering questions of coverage and insurance and
figuring out whether or not these thousands of pages of
different documents can answer the question at hand for maybe customer service representatives
or payers has been a common use case too as well.
So yeah, we're surprised as to how effective these long context models could work.
But for these use cases,
it's just been really, really interesting
to see how accurate they are.
What do you think needs to happen
for like AI agents to become successful
and more widely used?
I think, you know, at the very, the lowest level,
it's nothing different from typical software.
You got to get your P999s in place and you have to hit the reliability.
Beyond that, from algorithmic and a model standpoint, I do think we need to improve the planning capabilities and the reasoning capabilities of these models so that they can be trusted, which relates to the P99s, for longer-term tasks.
Because there's a little bit of difficulty in orienting the models and aligning them properly
for their goals. goals like we have found
like a specific set of use cases that they're really well attuned to be uh useful for but um
to be generally useful there's still a lot of work to be done and uh beyond uh one one step
that is related to as well are the interpretability aspects of it too. They're a black box.
We can't get around that.
In a way, that's a feature, not a bug.
But we don't want to entirely treat them as a black box, right?
So we want to have some traceability there.
And the research needs to be done a little bit more for that
in order to have fully autonomous agents working across all enterprises.
Right.
I mean, as we start to deploy these things into production and have them potentially
have access to sort of turn the knobs for different systems at their discretion, there's
probably, I mean, there's a lot of potential security issues that we have to work through
if we're using agents in that way. Yeah. And the security aspect too has to be created in line with the advancements in the
models themselves to a certain extent, right? I'm sure you're familiar with the aspect of tool use.
How do you combine tool use with the fact that you have to
give uh these models covered access to the applications it needs um as well as the model
uh in certain ways in certain scenarios needing um the context that might be uh you know data that
could be user-facing right like you have to just have the guardrails for that
down pat before, you know, release them on significant revenue streams.
Right, absolutely. So what's next for Gradient and some of the work that you're doing there?
I think we just continue down the path of understanding how the improvement in the long context length enables a model to learn on the fly.
Right. Like meta learning and in contact learning is the essential emerging capability that came out in what is different about AI today.
And really tie those to all the little all the use cases that exist from AI automation in all the industries that we sort of touch on. It's still, you know, most APIs for tool use are still relatively naive, where you just give it a spec, and then you give it the JSON, the function signature, and then the payload coming out.
But if you can provide it more context, such as documents and unit tests, like that's a much richer ability for the model to understand how to use an application or tool in service of its completing its task, which is very interesting. So you don't even need, it doesn't have to be a code completion
co-pilot model for it to want to use that tool, right? And attaching that to what I think is the
future of these agents, which is when are we going to have the time when I can just send off an agent
for like one or two days and it closes a few Jira tickets or it closes your ServiceNow tickets and like
you don't even have to be bothered, right? Like do you have to wait all that time
and you can do the best work that you're
facilitated to do. So that combination
I'm super excited to see and I think that's why I'm
a builder in this time because I excited to see. And I think that's why I'm a builder in this time, because I want to see how these things affect us.
Yeah, just with a prompt, basically, just go break up my monolith into Microsoft architecture and redeploy on Kubernetes.
Exactly. That's something that I want done today. Yeah, awesome.
Well, let's go quick fire here.
So if you'd master one skill you don't have right now, what would it be?
I'd say public speaking.
It's still really tough for me to do presentations and do public speaking.
I think it requires a lot of preparation on my part, and I want to get better at it.
Awesome.
Yeah, I think it's just a matter of
rep. You can skip and get the reps in, uh, in order to get there. What was the most time in
your day? Exactly. I was telling you with respect to, um, you know, the retrieval aspect of a lot
of things. Um, it's not so much like, I think retrieval to a certain extent has gotten way
better and that's where a lot of research and applications have, have focused on. But, um,
just like
the verification step of all that tends to be hard, right? Like, in a way, when you when you
hire an employee, like once the aspect of like, oh, I can trust this person, that's almost saying,
like, I don't need to do the verification step anymore. If you can invest in one company,
that's not the company you founded, who would it be? Oh, that's a tough one. You know, there's to a certain extent,
I could already invest in them,
which is probably like, you know,
some of the chip makers,
because I think that they're really poised
to be positioned to do well in this environment.
And then from private markets perspective,
you know, it's really tough
because I'm just, I'm so heads down in our company
and just have such so much conviction for us
to be able to like you know
conquer this this market awesome all right and then what tool or technology could you not live
without yeah with respect to that it's you know the llm services now today like i don't know what
i would do without chat gpt or co-pilot or perplexity like like these, these are actually fundamental tools that I use every single day,
that if you took them away from me right now, I would feel, you know, I feel naked to a certain
extent, like doing work. I think that's when you can tell when like a particular technology is like
truly transformative is, you know, if you think about like the internet, or mobile phone, like,
it's hard to imagine now that you've gotten used to having those technologies always available.
And now, I think with ChatGPT and some more co-pilots and LL Empowered applications, it's hard to remember what your work life or even your personal life was before you had those technologies available to you because you've kind of become so dependent and used to having them.
Yeah, yeah, for sure.
Which person influenced you the most in your career?
I mean, I'm going to take a sentimental route from here.
It's probably my father.
Hey, if you would have asked me 10 years ago,
I probably wouldn't have said this,
but perspective, like, you know,
maturing in perspective
has played a deep role but sort of the aspect of like you know if i really look at my relentless
pursuit and uh uh just like diligence for my work uh was built up from the way that my dad had
always you know he came to this country uh from, built up his career. I never saw him
complain or stop, but I just saw him run through walls. And then that today motivates me, like,
even just talking about it right now, I want to, I want to conquer everything, just thinking about
how, you know, he really pursued his career. So I wouldn't be the way I am without that.
That's awesome.
Yeah, I definitely think as most people,
I think I feel like as they get older,
build, develop more of an appreciation for their parents.
Yeah, for sure.
Five years from now,
will there be more people writing code or less?
So I am going to take a middle ground,
which is interesting. I think that the language of code will change actually, to a certain extent, I don't think less people will be writing code, because you do need some sort of domain specific language to interface with the AI and the applications that exist. But I don't think that the current world
of what we view code as will necessarily increase.
Like, I don't think that we're going to be writing,
you know, people will be writing as much Java
or C++ or any of that.
But will there exist some interface
and some lingua franca
that you will have to communicate with AI more
with? Absolutely. I don't think that will, there's very little probability that that will not happen.
Awesome. Well, anything else you'd like to share?
Yeah. I mean, you know, call to action. I would ask people, if you're looking to evaluate long
context models and you have use cases out there in terms of your enterprise, or you even want to chop it up with me and talk about what you really think the most useful applications of AI could be for agents in the enterprise, give me a call out.
Visit our website and email us it out.
I think that we're always interested in that. Like,
I care about the problems that people are facing with these things. And
I want to hear about and work with the individuals that are really excited about it, too.
Awesome. Well, I think that's a great place to leave it, Mark. Thanks so much for joining.
And I really enjoyed this.
Yeah, me too, Sean. It was excellent. Thank you very much for having me on.
Cheers.