Software Huddle - AI Agents and Long Context Windows with Mark Huang

Starting point is 00:00:00 At Gradient, we are in the business of enterprise AI automation. When you start to think about AI systems, in particular, like large language models, there's also a lot to think through from like a privacy security perspective. What exactly counts as an autonomous agent? A non-deterministic executed path. It means that you need to have some sort of mechanism or an intelligence that can interpret the intent of the user and understand the actions and the side effects that occur in the environment to redirect the new set of execution traces. You're maybe giving some sort of high level instruction and then it's making decisions along the way based on what's available to determine which path to take that leads to the outcome that you're looking for.

Starting point is 00:00:49 So like the instruction overall is probably some high-level goal, and the agent is tasked with achieving that, and you will grant it abilities to interact with the environment. Hey everyone, Sean here, and today we have Mark Wong on the show. Mark previously held roles in data science and machine learning at companies like Box and Spunk, and is now the co-founder and chief architect at Gradient, an enterprise AI platform to build and deploy autonomous assistants. In our chat, we get into some of the stuff he's seeing around autonomous AI agents and why people are so excited about that space. Mark and his team has also recently been working on a project to extend the Lama 3 context window. They were able to extend the model from 8,000 tokens all the way

Starting point is 00:01:29 to 1 million through a technique called Theta Scaling. Mark walks me through the details of this project and how longer context windows will impact the types of use cases that we can serve with LLAMs. All right, as always, if you ever have any suggestions or comments about the show, please reach out to me or Alex. And with that said, let's get over to my conversation with Mark. Mark, welcome to Software Huddle. Hi. Nice to meet you, Sean.

Starting point is 00:01:53 It's a pleasure being on here. And, you know, I'm a big fan of your show. Awesome. Well, it's good to have a fan. You know, sometimes you're like when you're creating a lot of this type of content, you're like anybody out there listening besides my mom? So it's good that folks are paying attention to this. So you're the co-founder and chief architect at Gradient.

Starting point is 00:02:14 So I wanted to start off by just learning, what's sort of the story behind Gradient? What is it? Why did you start it? Why become a founder? Yeah, so at Gradient, we are in the business of enterprise AI automation. And the way that we do that is we leverage agents in order to provide flexible workflows that enable people to automate almost any sort of operational business task. And with that, we try to really address all the pain points that previously you would encounter when you're really familiar with robotics processing automation, mostly because language models are just so powerful

Starting point is 00:02:59 and we're able to actually break through the common patterns that existed before. So you say enterprise AI, like, what does that mean, I guess, as someone who's just learning about this stuff? Like, how is that different, maybe than if I'm essentially not in the enterprise space? What are the certain things that you have to focus on? Because this is enterprise versus if you're doing this for, you know, mid market or, you know, small businesses? Yeah, so I distinguish it in on a few levels, what I've really noticed on the long tail and the developer facing applications and products out there are that they're really easy to get started. And they're actually wonderful for the democratization and usage of AI and having people

Starting point is 00:03:48 learn about it. But where they really fall short in the enterprise are the aspects of being able to grow and ensure security and data governance and also with respect to the interoperability with integrations and just the reliability and continuous, I would effectively call it almost like you are sort of beholden to your, your companies and your users using your product within there. Right? Yeah, I mean, there's a ton of stakeholders involved in enterprise. And I think also, like, when you start to think about AI systems, in particular, like large language models, there's also a lot to think through from like a privacy security perspective. In any enterprise, the sort of requirements generally

Starting point is 00:04:48 for privacy and security, like the CISO being involved is going to be probably a higher bar than you might meet at other types of companies. So that's also something that you have to think through if you're building enterprise AI, whatever, like architecture or applications and trying to sell them to enterprise.

Starting point is 00:05:09 Are you going to be able to essentially address the concerns of the CISO and the other people involved from a security perspective, as well as some of these other things that you're talking about around like reliability and, you know, your build, what's your throughput, how resilient is your systems and so forth? Yeah, for sure. And, you know, this is something I never really realized before I had to start selling to, you know, management and going from, you know, largely our sales process and the way that we operate is we have to sell from the top down. So in a way, we're bringing all these stakeholders along for us, we're becoming the medium for people understanding where their

Starting point is 00:05:47 AI strategy should be, and also bringing the execs all to the table to understand like, where's our value add there? And how do we build this relationship between all the teams together? Do you see this as being particularly tricky, just because it's such a new space, and many of the people you're probably talking to are interested, but maybe they're not experts. They don't know exactly what they even want or need. Yeah, I'd be lying to you if I told you it was easy. I think as with all things in a way, it's a really noisy market and it's noisy because there's a lot of disruption happening. But that's also where all the progress being really good as a communicator for having the type of empathy for the stakeholder that you're talking to. And that requires a lot of intentionality.

Starting point is 00:07:02 And to be honest, like something that I was never used to. Like I'm a builder at pie heart i'm a technologist and what i wanted to do is always build incredible product but to a certain extent i need to build product that means something to somebody i need to build a product that is intentional enough for them to understand um how they leverage it so uh i i think that's probably the number one thing I've had to learn throughout this process. Yeah. And I think that's hard. That's probably one of the hardest things for, I would think anyway, even speaking from my own perspective, like when I went from essentially engineer academic to, uh, founding a company was like, how do you learn the business side of the

Starting point is 00:07:42 business and get good at being comfortable standing up and pitching what you're talking about, but not being scared to do that and also not coming across as a used car salesman trying to build a product. Those are hard lessons to learn, but starting your own company is certainly a forcing function for having to learn that stuff or you're just not going to survive yeah i i 100 agree and um you know super glad that you can also empathize with the situation there because um yeah you you are like you're taking a lot of shots at goal to figure out how to position yourself too right and um something i like you know i like to tell a lot of people is like, you know, VCs, they like to create all these market maps. And it's not really in service of

Starting point is 00:08:31 just categorizing companies, it's actually for people to be able to create an understanding of how your product fits into the ecosystem and how that actually delivers usefulness or value towards the end user. And without that, right, it just seems like a bunch of chaos, because especially in a space where like 100 companies are getting started, I feel like every week, figure out what someone even does. Right. Like that's probably the first question you probably heard as a founder all the time where you're like, someone comes over to you and they're, they, the first question they ask is like, can you help me understand what you do? And, um, as a person who's head down building the product and just you yourself understanding

Starting point is 00:09:20 so, uh, so intimately what you are having to explain that to someone with no context, like that's a skill in itself, right? So I think that's also important, too, because it does build a little bit of conviction yourself when you're the founder, try to try to grow your company. To, you know, the whole, give me the elevator pitch, give me the one liner, like that's super, super important. Yeah, absolutely. I think i think i for me i was kind of originally forced to develop some of those skills when i was doing my phd because we used to bring you know guests into the lab all the time and we'd stand up in a line and our professor would be like you know tell them what you're you're working on in like you know 30 seconds or less so you're trying to consolidate like you know four years of research

Starting point is 00:10:03 distill it down into like 30 seconds and it it was like the people I think that spent the understood what project that they were working on the most were able to essentially distill and condense that in a way that was understandable to anyone the best, because it really comes from getting that intimate familiarity with it, and then figuring out how to like translate to sort of different audiences and that's a real you know um skill set that's super super important for all people to sort of develop over time yeah i i even tell my employees to a certain extent like with respect to how you can like how do you how do you grow and how do you become you know the next iteration in your career and increase your impact. We always say, bring others along with you. The people who are absolutely the best at that, they're able to achieve great things. And I parallel that almost even to the same respect

Starting point is 00:10:58 as to how, if you ever heard about the reasons why OpenAI was able to beat Google in training these large models, it's the fact that, unfortunately, within the organizational structure of Google, it's hard to bring others along with you. And if you could do that, and you can just raise your focus on helping others understand what you need and do, that's a powerful instrument in itself. Yeah, absolutely. All right, so I want to talk agents. So I think like autonomous agents, I feel like they're like all the rage right now. Like, you know, last year, maybe it was about rag. Everyone was talking about rag. Now we're talking about, you know,

Starting point is 00:11:36 agents and there's a lot of articles about it and how this is sort of the future and so on. So what exactly counts as an autonomous agent? And what are some of the things that someone actually could use that they had an autonomous agent? Yeah, so in the modern definition of what an autonomous agent is, is actually it is a executed path, a non deterministic executed path. And by that, it means that you need to have some sort of mechanism or an intelligence that can interpret the intent of the user and understand the actions and the side effects that occur in the environment to redirect the new set of execution traces. So it's similar to a graph, right?

Starting point is 00:12:30 Like I've seen a lot of people create that pattern for creating agents. And a lot of it comes with being able to handle out of domain scenarios. Okay. So this is in some ways, with being able to handle out-of-domain scenarios. Okay. So this is in some ways maybe a replacement for something that we might think of as a finite state machine

Starting point is 00:12:53 where it's a deterministic flow in that case. We're specifically telling somebody some sort of automation to take this path under these conditions and stuff like that. In the case of an autonomous agent, you're maybe giving some sort of high level instruction and then it's making decisions along the way based on what's available to determine which path to take that leads to the outcome that you're looking for. Yeah.

Starting point is 00:13:18 If you parallel that towards self-driving cars to a certain extent, right? Like what has been the hardest part about self-driving has been a certain extent, right? Like what has been the hardest part about self-driving has been the ability to plan and determining plans just means, as you said, taking high level, typically vague and ambiguous instructions to satisfy that goal. So like the instruction overall is probably some high level goal and the agent is tasked with achieving that and you will grant it abilities to interact with the environment.

Starting point is 00:13:55 So all the different applications of it, actually, I think, you know, right now, the really interesting ones are all related with just software applications, but, you know, they're parallels to self-driving cars, right now, the really interesting ones are all related with just software applications. But, you know, that parallels to self-driving cars, to robotics, and then various systems that are, you know, almost physical to a certain extent. And, you know, we're still fairly early in this space. But where do you, besides software systems, like where is some of the initial momentum and excitement around this? Is it more sort of business use cases

Starting point is 00:14:28 or are you seeing things that are potentially consumer facing? Like, hey, I know I want to, I don't know, book a restaurant on this night or I want to book a vacation with these parameters. It needs to be tropical. And then like, hey, autonomous agent, go off and take care of this for me.

Starting point is 00:14:46 Yeah, I think on both sides, I'm super excited, either consumer facing or business facing for the applications there, particularly for the business facing aspects of stuff. It'll completely change the way that you think we interface to AI applications today. People are really used to chatbots. I think ChatGPT conceptualized probably one of the most seamless experiences with embedding AI.

Starting point is 00:15:14 But for agents, people will get used to asynchronous workflows where you're not actually waiting for the response back. And you may not actually care for the latency component there, but you're sending something out to achieve what you want. And it would be great if I could just tell something to schedule out my entire day, to a certain extent, book flights, figure out figure out, you know, day two of my vacation and, and, and all of that, because, um, we're, we're, we're taking away and granting access to, to, to a lot of, you know, tedious things that maybe we don't, we don't really care to do because they're, they're pretty simple. And I do think it does raise the bar, uh, as a, as, um, you know, a society on what are the things that we're going to be doing with all our free time, right?

Starting point is 00:16:09 Right, yeah. How are you going to work alongside AI that way? Yeah. So most of my free time is taken up with a two and a four-year-old. So I'm not sure. I'm still trying to figure out how AI will help me there. But if there's some agent out there that can help me, I will be very, very interested in pay top dollar for. But in terms of the fact that latency potentially matters

Starting point is 00:16:34 less, you know, if I'm chatting to something like chat GPT, I'm expecting almost, you know, sort of real time responses, right? But if I'm saying like, go figure out my vacation, and even if it took 24 hours and came back with something that was like, really, really good, I wouldn't really care that it took 24 hours to do that. So how does the fact that latency is less of a like, of a issue, change the current operating patterns that people have. And it does open up the door for strategies and applications that can may require more compute time handle much more context, then the focus will shift much more into those aspects. And, you know, you won't actually be delving in so much research into specifically cutting down, you know, the matrix multiplication speeds and maybe focusing on other aspects of research and development. Yeah.

Starting point is 00:18:04 So I think that's another area besides like agents that now is a big topic conversation is around like the size, this growing size of context windows. Like there's been a ton of advancement in the last year. Like the length that was like bleeding edge a year ago is now sort of table stakes. So like, how did that happen?

Starting point is 00:18:20 Like, was there particular things that happened in the research or, or even companies that are leading the charge with LLMs and generally AI? Something that allowed them to sort of break the barrier on what we can do in terms of the size of a context window? Yeah, there's been a lot of different technological innovations, algorithmic too as well. Probably the first, to my mind, that kicked a lot of things off were most of the, I'll separate them between the lossless algorithms and then the lossy algorithms. So for the lossless algorithms, the first that really came about was flash attention. Probably everybody's heard of it. And that's what really pushed the context window longer than what we initially were used to, mostly because they viewed the way that they do the computation of the attention mechanism

Starting point is 00:19:18 through different chunks that are little batches that you only actually grab the set of attention that you need at the time when you need to do the back propagation or you actually have to do it at inference time. The second sort of class of things that have been happening is just, you know, NVIDIA keeps on spending money to increase the VRAM on a lot of these chips. So then it just gives you the requirement on that side to be able to actually support these models. And then on the loss, I'll talk maybe for briefly on the lossy approaches.

Starting point is 00:20:03 You have quantization, which is really simple, representing a number from a particular precision to a smaller unit of precision. And then that comes coupled though with model degradation at a certain stage, but there's research to kind of figure out how you can ensure sparsity to retain that model quality. And those are sort of the three high-level innovations that have really set forth and

Starting point is 00:20:34 enabled us to increase the context length. And those continue to be actually the same things that people try to innovate on today. So there's really no lack of minds trying to work on today so um there's no you know there's really no lack of uh minds trying to work on these problems yeah so it sounds like there's been both sort of innovation on the software side as well as the hardware side to help like make these context windows larger by having a larger context window does that help keep models smaller? You know, that is a good question, because I think there are two things that are happening there. In terms of keeping models smaller, I think that's more so you want the smallest model that can achieve the tasks that you intended to perform. But the main blocker for that

Starting point is 00:21:28 is actually the amount of tokens that you train it on, in my opinion. Because if you really think about the scaling laws or even the chinchilla paper, like they're trying to find the compute optimal. What's the compute optimal training that you need, right? At a particular model size.

Starting point is 00:21:45 And then also how many tokens you need to train a model on for that. People have pushed far beyond those heuristics. And if you look at Lama 3, it was trained on 15 trillion tokens. And that has enabled you to have a more powerful model at a smaller size. And then the context window, if you increase it there, now you're unlocking like, hey, now you have a more powerful model at smaller size that can also handle a lot of tokens at runtime. people are just pushing the parameters to be as large as possible because as you increase the parameters it's actually more sample it's it's more sample efficient with the tokens so in a way like you're trading off the runtime costs of serving this thing in service of having something that is incredibly powerful and then how do you, like, how do people typically evaluate

Starting point is 00:22:45 the size of the context window and, like, what's actually working? It's still an area of open research. My company, Gradient, we really are trying to get in a, you know, do a lot of this work and collaborate in the open for it. In particular, like,

Starting point is 00:23:04 NVIDIA has their benchmark suite that we really like. For the, you know, the uninitiated, the first evaluation is the needle in the haystack evaluation that Greg created. And that is benchmarking a model's ability to retrieve. It's basically pass key retrieval where you're retrieving a key or sorry, you're retrieving a value given a key that's just sent in a ocean of tokens. And then you ablate where that key actually is located. And then the ruler evals are even more comprehensive where they include both needle in the haystack and then three other different types of evaluations. One is variable tracking. So you're tracking state of a variable over all of these large context windows. Another

Starting point is 00:23:59 is aggregation, wherein you want to find a variable and then you want to do an aggregation on it, like counting the number of occurrences or sums. And then finally, a distraction evaluation where it takes SQuAD and then generates a synthetic data set in which you throw a model distracting context and still make it answer to see if you can properly answer a question. And these are really comprehensive. Like, don't get me wrong, but there's even more nuanced aspects that go much more beyond what I would consider these evals looking at, which will definitely touch more on multi-hop reasoning and and uh planning capabilities because those are still very short context lengths yeah yeah the needle in a haystack one i think is um i mean that kind of like showing that that works like i remember when i think when gemini was

Starting point is 00:24:58 announced they showed uh like essentially taking out like a black and white film and then finding uh one specific like frame that happened in the film like there's a whole like collection of jobs in certain verticals where people's job is basically to find like a needle in a haystack like if you think about like the legal industry there's a lot of people that just comb through thousands of pages of stuff to to try to find certain information and it feels like that is a place that's going to change drastically in the next few years. If you can put some of that stuff into these context windows or into like, you know, an LLM that's whether it's RAG or some other training methodology to be able to look that stuff up immediately.

Starting point is 00:25:42 Yeah, I think we're already seeing it to a certain extent with a lot of the business use cases that we're trying to support with these models and seeing effectively a simplification of some of the use cases we have deployed from using RAG or actually using RAG in addition to the long contacts. These models are getting so good actually at doing things that are derivatives of the jobs that we, you know, the jobs to be done, right.

Starting point is 00:26:14 To a certain extent and being able to expose those capabilities has really opened up the door to, you know, being, being able to actually be more productive and efficient for that. So I even find myself spending less time, right, like combing through gobs and gobs of pages to find the one piece of information that's relevant to me, which is a search problem to some extent, but I don't need to do the initial pruning

Starting point is 00:26:42 and combing through of the documents afterwards. Right. Yeah. I mean, there's a lot of like sort of classic search problems that you might have used like a search engine for and then point you to a link, you click on the link, and then you do like essentially an internal page search or something like that to find the answer that you're looking for. And now you don't have to sort of even intake those steps and piece it together. And then that becomes even more amplified when you're looking for. And now you don't have to even intake those steps. You can piece it together. And that becomes even more amplified when you're talking about potentially reams of data that exists only in a non-digitized format

Starting point is 00:27:13 in a back office somewhere. Yeah. And when you're talking about reams of data there, it really does bring up the interesting aspects of multimodality that, you know, sort of top of mind for me and a lot of our customers to a certain extent. You mentioned Gemini looking through frames of a video,

Starting point is 00:27:40 audio modalities, sensory modalities, like all those things, if you can really kind of harness audio modalities, sensory modalities, like all those things, if you can really kind of harness and unify the power of large models for that, you just open up the door to really interesting aspects too. And we as human beings have, we're really proficient at figuring those things out

Starting point is 00:28:02 and exposing that capability within models, I think that's just like the next, you know, we'll be working on that for the next six months at least, because I always say in LLM or in language model world or, you know, AI world, you underestimate what you can achieve within a year these days. So that, you know, we're looking ahead for those aspects very, very soon. Yeah. So, you know, back to the actual process of extending a context window, like if you take an existing foundation model, how do you actually go about extending the context window? What is that process? Yeah, so what we employed personally was a curriculum learning approach where you stage different training runs

Starting point is 00:28:51 to iteratively increase the context length of the underlying base model. And it all really starts out with tracking the proxy for what a context extension, a successful context extension is occurring. So what you do is you, we employed theta scaling, which is a technique for positional interpolation, because what is happening is you have to get the model

Starting point is 00:29:22 to attend to portions of its context because they're all tokens and properly leverage that context in the setting when it may not have seen the new set of tokens coming in. So when you apply that, what you're doing is you're sort of shrinking down the sine and the cosine amplitude curves that are occurring. And then you're overlapping them as if they already incurred during training. And then you take these samples and you take some sort of data set and then you synthetically generate a data set that has the new context length during training.

Starting point is 00:30:03 So you're really tracking the initial perplexity curves, which are just next token prediction, to guide you in determining whether your context length extension, you chose the right hyperparameters for that. And then on top of that, there's the distributed training aspects of managing the cluster and figuring out how to trade that. And then you did this with Lama3, is that right? Yeah, that's correct.

Starting point is 00:30:29 We used the Lama3 model because we found the number one complaint about that model was the fact that it only had 8,000 tokens, which is definitely on the shorter side of what is expected from users today. Honestly, I still don you know, honestly, I still don't quite know why they chose that context. Like maybe they're already looking ahead towards the, to the larger model that they're going to release or even the Lama 4 model, you know, that might be released next year. But that model was in particular the most, you know, the best suitable candidate for us at the time.

Starting point is 00:31:06 And we'd already been doing internal benchmarks to extend context-aligned submodels. So right after the model dropped, it was a golden opportunity for us to contribute back to the community there. And then how big were you able to extend the context window beyond originally 8,000 tokens? Yes. So for the two, we did the 8 billion parameter model and the 70 billion parameter model, and we hit 1 million tokens that basically that passed all of the needle in the haystack evaluations with flying colors. We also extended the 8 billion to $4 million tokens actually just to put the line, be the first to plant the flag in the ground for that.

Starting point is 00:31:54 We're still improving it. Initially, there is some degradation in the model's evaluations there. And that's mostly because if you think about theta scaling, you are starting to reach the limits of floating point precision. Because if you look at our theta parameter, it's a huge number now. So think about taking a neural network and doing all these multiplications. And if you've ever had to actually train a deep learning model before, like understanding like the vanishing gradient problem and the exploding gradient problem, all those type of corner cases, they start to arise in these type of scenarios too. How long does it take to extend the model like that? With respect to the different stages of training that we used, in terms of GPU hours, I think it was in the hundreds, the thousands of hours that we had to use in order to finally get to the 1 million context-length model. So it's not for the faint of heart, I would say. It certainly requires, you know, really good hardware and beneficiaries.

Starting point is 00:33:11 We were fortunate enough to get sponsored by Crusoe, which is a GPU provider for that, and get access to a really nice cluster. And then once we had that, we were off to the races in which we know we were gonna tackle this problem and deliver something to the open source. And what is the output? So is it essentially, are you actually, you're taking the open source model and then you're generating essentially like a new version that you're contributing back?

Starting point is 00:33:39 Yeah, exactly. We have our model weights up and hugging face right now. I don't know what our ranking is at the moment, but we were the number two and number four models maybe a month ago when we did our release. And then to this date, we're at a hundred and something thousand downloads to date. So it's been a really fruitful contribution back and people ask all these questions about different things the models can do, even unexpected things that we want to evaluate ourselves. So it was exactly the type of release and in collaboration that we wanted. And we hope to do more of that, those things. And we want to learn

Starting point is 00:34:25 more about people's use cases too. I mean, like what sort of scale of this project, like I get the number of GPU hours that you had to spend like training this, but like how many people were involved and also how did you know that you'd actually be able to be successful? And maybe you didn't know, but like how do you kind of get started on something like this? Yeah, so maybe I'll first say the honest truth, we didn't know that we'd be successful. We had a strong conviction that we could have done it. But it's similar to the scaling laws to a certain extent, you don't know when the scaling laws will stop. Like, are they going to keep continuing as you throw more compute and more flops to the problem? And are you going to get that deterministic improvement in models, right?

Starting point is 00:35:14 With respect to ourselves, you know, we're a small startup. So it was just a team of four of us working on this problem together for those two weeks. And between the day that we started to the end of two weeks, we were able to achieve the task at hand that we wanted, including all the evaluations too. So I'm not going to say that we didn't pull a few all-nighters in between that because we certainly did and we worked through the weekend to get it done. But what it really does require is intentionality between understanding how to construct the data set synthetically, how do you set up and run empirical experiments to figure out what is the optimal network topology to use when you do this training? So if any folks are familiar with multi-node training, most people just avoid multi-node training because it's a pain in the butt.

Starting point is 00:36:12 But beyond that, it's also an empirical setup where from a system standpoint, you need to trade off and balance network bandwidth communication with your computational flops. So that was the, you know, the, the iterative experiments we had to do beforehand before we really started the project on other models. And then, you know, for those two weeks, it was just hammering away with Lama three and, and making sure all the training runs finished successfully. And then what are what were were there any like sort of unexpected challenges that you ran into doing this like where was is it just smooth

Starting point is 00:36:52 sailing from the beginning or did you uh you know hit some roadblocks along the way um i i don't know if anybody's who's trained these large models can ever say it is smooth sailing from the beginning i think between uh figuring out how to get the correct data to how you're evaluating the models properly and then actually babysitting the training job, there were a lot of things that we learned along the process as well as things that got validated. So one thing was that one expected aspect that arose there is how robust the theta scaling trick works. So like positional interpolation is kind of amazing from that standpoint, where the models really are set up for success to extend their context length, if you just provide the correct data.

Starting point is 00:37:48 You know, I'll kind of reference something that Ilya Ilin had said to Dario, who's the founder of Anthropic, where he was basically said like, these models, they just want to learn. So like get the data right, get the training right and get out of the way. And that's how like we kind of treated it. So yeah, throughout the process,

Starting point is 00:38:05 it's like setting up the ablations and the experiments correctly and then figuring out how we need to pivot and what are the knobs we can turn in order to make sure we can do it. What are some of the common use cases that you're seeing from your customers or people you're interacting with

Starting point is 00:38:21 around using these long context windows? Yeah, so the interesting ones that we've really seen are particularly in the finance domain for table reasoning, where you want to answer specific questions or do an investment analysis or financial analysis on top of like many, many documents. And maybe it's one document with many pages, or you have to take the knowledge and link multiple pieces of information across multiple documents. And with retrieval augmented generation, the main failure node you would see there is the model's inability to disambiguate pieces of information and then combine it and synthesize it where the lossiness that is required when you summarize

Starting point is 00:39:12 context and then you kind of proxy it isn't enough to answer fine-grained questions there. And on the parallel track within the healthcare space, there is an aspect of having models produce responses that require citations for grounding and preventing hallucination for planned beneficiary information. So with respect to answering questions of coverage and insurance and figuring out whether or not these thousands of pages of different documents can answer the question at hand for maybe customer service representatives or payers has been a common use case too as well. So yeah, we're surprised as to how effective these long context models could work.

Starting point is 00:40:03 But for these use cases, it's just been really, really interesting to see how accurate they are. What do you think needs to happen for like AI agents to become successful and more widely used? I think, you know, at the very, the lowest level, it's nothing different from typical software.

Starting point is 00:40:29 You got to get your P999s in place and you have to hit the reliability. Beyond that, from algorithmic and a model standpoint, I do think we need to improve the planning capabilities and the reasoning capabilities of these models so that they can be trusted, which relates to the P99s, for longer-term tasks. Because there's a little bit of difficulty in orienting the models and aligning them properly for their goals. goals like we have found like a specific set of use cases that they're really well attuned to be uh useful for but um to be generally useful there's still a lot of work to be done and uh beyond uh one one step that is related to as well are the interpretability aspects of it too. They're a black box. We can't get around that.

Starting point is 00:41:27 In a way, that's a feature, not a bug. But we don't want to entirely treat them as a black box, right? So we want to have some traceability there. And the research needs to be done a little bit more for that in order to have fully autonomous agents working across all enterprises. Right. I mean, as we start to deploy these things into production and have them potentially have access to sort of turn the knobs for different systems at their discretion, there's

Starting point is 00:41:57 probably, I mean, there's a lot of potential security issues that we have to work through if we're using agents in that way. Yeah. And the security aspect too has to be created in line with the advancements in the models themselves to a certain extent, right? I'm sure you're familiar with the aspect of tool use. How do you combine tool use with the fact that you have to give uh these models covered access to the applications it needs um as well as the model uh in certain ways in certain scenarios needing um the context that might be uh you know data that could be user-facing right like you have to just have the guardrails for that down pat before, you know, release them on significant revenue streams.

Starting point is 00:42:53 Right, absolutely. So what's next for Gradient and some of the work that you're doing there? I think we just continue down the path of understanding how the improvement in the long context length enables a model to learn on the fly. Right. Like meta learning and in contact learning is the essential emerging capability that came out in what is different about AI today. And really tie those to all the little all the use cases that exist from AI automation in all the industries that we sort of touch on. It's still, you know, most APIs for tool use are still relatively naive, where you just give it a spec, and then you give it the JSON, the function signature, and then the payload coming out. But if you can provide it more context, such as documents and unit tests, like that's a much richer ability for the model to understand how to use an application or tool in service of its completing its task, which is very interesting. So you don't even need, it doesn't have to be a code completion co-pilot model for it to want to use that tool, right? And attaching that to what I think is the future of these agents, which is when are we going to have the time when I can just send off an agent for like one or two days and it closes a few Jira tickets or it closes your ServiceNow tickets and like

Starting point is 00:44:28 you don't even have to be bothered, right? Like do you have to wait all that time and you can do the best work that you're facilitated to do. So that combination I'm super excited to see and I think that's why I'm a builder in this time because I excited to see. And I think that's why I'm a builder in this time, because I want to see how these things affect us. Yeah, just with a prompt, basically, just go break up my monolith into Microsoft architecture and redeploy on Kubernetes. Exactly. That's something that I want done today. Yeah, awesome. Well, let's go quick fire here.

Starting point is 00:45:07 So if you'd master one skill you don't have right now, what would it be? I'd say public speaking. It's still really tough for me to do presentations and do public speaking. I think it requires a lot of preparation on my part, and I want to get better at it. Awesome. Yeah, I think it's just a matter of rep. You can skip and get the reps in, uh, in order to get there. What was the most time in your day? Exactly. I was telling you with respect to, um, you know, the retrieval aspect of a lot

Starting point is 00:45:35 of things. Um, it's not so much like, I think retrieval to a certain extent has gotten way better and that's where a lot of research and applications have, have focused on. But, um, just like the verification step of all that tends to be hard, right? Like, in a way, when you when you hire an employee, like once the aspect of like, oh, I can trust this person, that's almost saying, like, I don't need to do the verification step anymore. If you can invest in one company, that's not the company you founded, who would it be? Oh, that's a tough one. You know, there's to a certain extent, I could already invest in them,

Starting point is 00:46:06 which is probably like, you know, some of the chip makers, because I think that they're really poised to be positioned to do well in this environment. And then from private markets perspective, you know, it's really tough because I'm just, I'm so heads down in our company and just have such so much conviction for us

Starting point is 00:46:24 to be able to like you know conquer this this market awesome all right and then what tool or technology could you not live without yeah with respect to that it's you know the llm services now today like i don't know what i would do without chat gpt or co-pilot or perplexity like like these, these are actually fundamental tools that I use every single day, that if you took them away from me right now, I would feel, you know, I feel naked to a certain extent, like doing work. I think that's when you can tell when like a particular technology is like truly transformative is, you know, if you think about like the internet, or mobile phone, like, it's hard to imagine now that you've gotten used to having those technologies always available.

Starting point is 00:47:26 And now, I think with ChatGPT and some more co-pilots and LL Empowered applications, it's hard to remember what your work life or even your personal life was before you had those technologies available to you because you've kind of become so dependent and used to having them. Yeah, yeah, for sure. Which person influenced you the most in your career? I mean, I'm going to take a sentimental route from here. It's probably my father. Hey, if you would have asked me 10 years ago, I probably wouldn't have said this, but perspective, like, you know,

Starting point is 00:47:44 maturing in perspective has played a deep role but sort of the aspect of like you know if i really look at my relentless pursuit and uh uh just like diligence for my work uh was built up from the way that my dad had always you know he came to this country uh from, built up his career. I never saw him complain or stop, but I just saw him run through walls. And then that today motivates me, like, even just talking about it right now, I want to, I want to conquer everything, just thinking about how, you know, he really pursued his career. So I wouldn't be the way I am without that. That's awesome.

Starting point is 00:48:27 Yeah, I definitely think as most people, I think I feel like as they get older, build, develop more of an appreciation for their parents. Yeah, for sure. Five years from now, will there be more people writing code or less? So I am going to take a middle ground, which is interesting. I think that the language of code will change actually, to a certain extent, I don't think less people will be writing code, because you do need some sort of domain specific language to interface with the AI and the applications that exist. But I don't think that the current world

Starting point is 00:49:06 of what we view code as will necessarily increase. Like, I don't think that we're going to be writing, you know, people will be writing as much Java or C++ or any of that. But will there exist some interface and some lingua franca that you will have to communicate with AI more with? Absolutely. I don't think that will, there's very little probability that that will not happen.

Starting point is 00:49:32 Awesome. Well, anything else you'd like to share? Yeah. I mean, you know, call to action. I would ask people, if you're looking to evaluate long context models and you have use cases out there in terms of your enterprise, or you even want to chop it up with me and talk about what you really think the most useful applications of AI could be for agents in the enterprise, give me a call out. Visit our website and email us it out. I think that we're always interested in that. Like, I care about the problems that people are facing with these things. And I want to hear about and work with the individuals that are really excited about it, too. Awesome. Well, I think that's a great place to leave it, Mark. Thanks so much for joining.

Starting point is 00:50:18 And I really enjoyed this. Yeah, me too, Sean. It was excellent. Thank you very much for having me on. Cheers.

CODACE Plant Stand

Software Huddle - AI Agents and Long Context Windows with Mark Huang

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Software Huddle - AI Agents and Long Context Windows with Mark Huang

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.