Latent Space: The AI Engineer Podcast - Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust

Starting point is 00:00:00 Welcome back, listeners. This is Charlie, your AI co-host. The LLM-Ops landscape has been one of the most competitive battlegrounds of the four wars of the AI stack that we have been referencing throughout this year. We started covering this in our Human Loop episode with Raza Habib last year, and this year, a new player has burst on the scene with Brain Trust, led by today's guest, Angkor Goyal. Brain Trust started just over a year ago as an ink,

Starting point is 00:00:32 incubation with Ella Gill, and this week Brain Trust raised a $36 million series A at a $150 million valuation with Martin Casado at Andresen Horowitz. We were honored to do the deepest dive into Anchor's journey running a pre-LM AI company and getting disrupted by Burt, getting acquired by Figma, and choosing to start an AI engineering infrastructure company despite having the perfect resume to start a vector database. As the infra provider of choice to notion, Stripe, Versal and others, Ankur also sees the market share of workloads across many of the top AI product teams, so we could not resist asking questions about the rise of Claude 3 and the surprisingly low adoption of open source models in AI, which also leads to his hot take on fine tuning.

Starting point is 00:01:24 In latent space news, you'll notice that we've cut the Suno intro songs. Many of you love them. Some skip them, so we're dropping them for now. Alessio was out of town for this episode, but is running the Decibel AI Pioneer Summit next week. Swix will be in New York City for Columbus Day, so if you are in the tri-state area, come by the latent space meetup linked in the show notes. Watch out and take care. Uncle Guilla, welcome to Leiton Space. Thanks for having me. Thanks for coming all the way over to our studio. It was a long, long trek. Yeah. You got T-boned, no, by traffic. Yeah.

Starting point is 00:02:05 You were the first VP of, the first VP of V-PV-V-G at Single Store. Yeah. Then you started Impera. I ran it for six years, got acquired into Figma, where you were at for eight months. And you just celebrated your one-year anniversary of Brain Trust. I did, yeah. What a journey. I kind of want to go through each in turn because I have a personal relationship with single store

Starting point is 00:02:24 just because I have been a follower and fan of databases for a while. H-TAP is always a dream of every database guy. It's still the dream. When H-TAP, and single store, I think, is the leading H-Tap. Yeah. What's that journey like? And then maybe we'll cover the rest later. But sounds good.

Starting point is 00:02:39 We can start single-store first. Yeah, yeah. In college, as an Indian, you know, the first-generation Indian kid, I basically had two options. I had already told my parents I wasn't going to be a doctor. They're both doctors. So, you know, only two options left. Do a PhD or work at a big company. And after my sophomore year, I worked at Microsoft, and it just wasn't for me.

Starting point is 00:03:00 I realized that the work I was doing was impactful. Like people, you know, there were millions. I was working on Bing and like the distributed compute infrastructure at Bing, which is actually now part of Azure. And there were hundreds of engineers using the infrastructure that we were working on. But the level of intensity was too low. So it felt like, you know, you got work-life balance and impact, but very little creativity, very little sort of room to do interesting things.

Starting point is 00:03:30 So I was like, okay, let me cross that off the list. The only option left is to do research. I did research the next summer. And I kind of realized, again, no one's like working that hard. Maybe the times have changed. But at that point, you know, there's a lot of creativity. And so you were just bouncing around fun ideas and working on stuff and really great work-life balance. But no one would actually use the stuff that we built. And that was not super energizing for me. And so I had this like existential crisis. And I moved out to San Francisco because I had a friend. who was here and crashed on his couch and was talking to him and just very, very confused. And he said, you should talk to a recruiter, which felt like really weird advice. I'm not even sure I would give that advice to someone nowadays, but I met this really great guy named John. And he introduced me to like 30 different companies. And I realized that there's actually a lot of interesting stuff happening in startups. And maybe I could find this kind of company that let me be very creative and work really hard and have a lot of impact. And I don't give a shit about work life balance. And so I talked to all these companies.

Starting point is 00:04:32 And I remember I met MemSQL when it was three people and interviewed. And I thought I just totally failed the interview. But I had never had so much fun in my life. And I left. I remember I was at 10th in Harrison and I stood at the bus station. And I called my parents and said, I'm sorry, I'm dropping out of school. I thought I wouldn't get the offer. But I just realized that if there's something like this company, then this is where I need to be.

Starting point is 00:04:57 Luckily, things worked out. And I got an offer and I joined as employee number two. And I worked there for almost six years. And it was an incredible experience. I learned a lot about systems, got to work with amazing customers. There are a lot of things that I took for granted that I later learned at Empira that I had taken for granted. And the most exciting thing is I got to run the engineering team, which was a great opportunity to learn about tech, kind of on a larger stage, recruit a lot of great people. And I think for me personally set me up to do. a lot of interesting things after. Yeah. There's so many ways I can take that. The most curious, I think, for general audiences is, is the dream real of single store? Should obviously more people be using it? I think there's a lot of marketing from single store that makes sense, but there's a lot

Starting point is 00:05:46 of doubt in people's minds. What do you think you've seen as the most convincing as to, like, when is it suitable for people to adopt single store and when is it not? Bear in mind that I'm now eight years removed from single store. So they've done a lot of stuff since I left. But maybe like the meta thing I would say or the meta learning for me is that even if you build the most sophisticated or advanced technology in a particular space, it doesn't mean that it's something that everyone can use. And I think one of the tradeoffs with single store specifically is that you have to be willing to invest in hardware and software cost

Starting point is 00:06:21 that achieves the dream. And at least when we were doing it, it was way cheaper than Oracle ExaData or SAP HANA, which were kind of the prevailing alternatives. So not like ultra expensive, but it's not, single store is not the kind of thing that when you're like building a weekend project

Starting point is 00:06:37 that will scale to millions, you would just kind of spin up single store and start using. And I think it's just expensive. It's packaged in a way that is expensive because the size of the market and the type of customer that's able to drive value almost requires the price to work that way. And you can actually see Nikita almost overcompensating for it now with neon and sort of attacking the market from a different angle.

Starting point is 00:07:00 This is Nikita Shamgunov, the actual original founder. Yes. Yeah, yeah, yeah. So now he's like doing the opposite. He's built the world's best free tier and is building like hyper inexpensive Postgres. But because the number of people that can use single store is smaller than the number of people that can use free Postgres. Yet the amount that they're willing to pay for that use case is higher. Single store is packaged in a way that just makes it harder to use. I know I'm not directly answering your question, but for me, that was one of those sort of utopian things. It's the technology analog to like if two people love each other, why can't they be together? You know, like single store in many ways is the best database technology and it's

Starting point is 00:07:40 the best in a number of ways, but it's just really hard to use. I think Snowflake is going through that right now as well. As someone who works in observability, I dearly miss the variant type that I used to use in Snowflake. It is without any question, at least in my experience, the best implementation of semi-structured data and sort of solves the problem of storing it very, very efficiently and querying it efficiently, almost as efficiently as if you specified the schema exactly,

Starting point is 00:08:09 but giving you total flexibility. So it's just a marvel of engineering, but it's packaged behind Snowflake, which means that the minimum query time is quite high. I have to have a Snowflake Enterprise license, right? I can't deploy it on a laptop. I can't deploy it in a customer's premises or whatever. So you're sort of constrained to the packaging

Starting point is 00:08:28 by which one can interface with Snowflake in the first place. And I think every observability product in some sort of platonic ideal would be built on top of Snowflake's variant implementation and have better performance. It would be cheaper. You know, the customer experience would be. better. But, you know, alas, it's just not economically feasible right now for that to be the case.

Starting point is 00:08:51 Do you buy what Honeycomb says about needing to build their own, like, super wide column store? I do, given that they can't use Snowflake, if the variant type were exposed in a way that allowed more people to use it. And by the way, I'm just sort of zeroing in on Snowflake in this case. Redshift has something called Super, which is fairly similar. Clickhouse is also working on something similar, and that might actually be the thing that lets more people use it. DukDB does not. No, DuckDB has a struct type, which is dynamically constructed, but it has all the downsides of traditional structured data types, right? So it's just not, like, for example, if you create, if you infer a bunch of rows with the

Starting point is 00:09:31 struct type, and then you present the N plus first row, and it doesn't have the same schema as the first N rows, then you need to change the schema for all the preceding rows, which is the main problem that the variant type solves. So yeah, I mean, it's possible that on the extreme end, there's something specific to what Honeycomb does that wouldn't directly map to the variant type. And I don't know enough about Honeycomb, and I think they're a fantastic company. So I don't mean to like pick on them or anything. But I would just imagine that if one were starting the next Honeycomb and the variant type were available in a way that they could consume, it might accelerate them dramatically or even be the terminal solution. I think being so early in Single Store also

Starting point is 00:10:11 taught you among all these engineering lessons, you also learned a lot of business lessons that you took with you into Impera. And Impera, you actually, that was your first, maybe, I don't know if it's your exact first experience, but your first AI company. Yeah, it was. Tell that story. There's a bunch of things I learned and a bunch of things I didn't learn. The idea behind Imperial originally was I saw when Alex Net came out that you were suddenly able to do things with data that you could never do before. And I think I was way too early into this observation. When I started Impera, The idea was, what if we make using unstructured data as easy as it is to use structured data? And maybe ML models are the glue that enables that.

Starting point is 00:10:48 And I think deep learning presented the opportunity to do that because you could just kind of throw data at the problem. Now, in practice, it turns out that pre-LMs, I think the models were not powerful enough. And more importantly, people didn't have the ability to capture enough data to make them work well enough for a lot of use cases. So it was tough. However, that was the original idea. And I think some of the things I learned were how to work with really great companies. We worked with a number of top financial services companies. We worked with public enterprises.

Starting point is 00:11:19 And there's a lot of nuance and sophistication that goes into making that successful. I'll tell you the things I didn't learn, though, which I learned the hard way. So one of them is when I was the VP of Engineering, I would go into sales meetings and the customer would be super excited to talk to me. and I was like, oh my God, I must just be the best salesperson ever. And, oh, yeah, after I finished the meeting, the salespeople would just be like, yeah, okay, you know what? It looks like the technical POC succeeded, and we're going to deal with some stuff. It might take some time, but they'll probably be a customer. And then I didn't do anything.

Starting point is 00:11:54 And a few weeks later or a few months later, there were a customer. Money shows up. Exactly. And like, oh, my God, I must have the Midas touch, right? Like, I go into the meeting. I did that guy. Yeah, I just, you know, I sort of speak a little bit. and they become a customer, I had no idea how hard it was to get people to take meetings with you

Starting point is 00:12:12 in the first place. And then once you actually sort of figure that out, the actual mechanics of closing customers at scale, dealing with revenue retention, all this other stuff, it's so freaking hard. I learned a lot about that. And I thought it was just an invaluable experience at Impera to sort of experience that myself firsthand. Did you have a main salesperson or sales advisor? Yes, a few different things. One, I lucked into, it turns out my wife, Alana, who I started dating right as I was starting in Pira. Her father, who are just super close now, is a seasoned, very, very seasoned and successful sales leader. So he's currently the president of Cloudflare. At the time, he was the president of

Starting point is 00:12:52 Polo Alto Networks, and he joined just right before the IPO and was managing a few billion dollars of revenue at the time. And so I would say I learned a lot from him. I also hired someone named Jason, who I worked with that MemSQL. And he's just an excellent. exceptional account executive. So he closed probably like 90 or 95% of our business over our years at Impura. And, you know, he's just exceptionally good. And I think one of the really fun lessons, we were trying to close a deal with Stitch Fix at Impura early on. It was right around my birthday. And so I was hanging out with my father-in-law and talking to him about it. And he was like, look, you're super smart. Impura sounds really exciting. Everything you're talking about, a mediocre account executive can

Starting point is 00:13:32 just do and do like much better than what you're saying, if you're dealing with these kinds of problems, like, you should just find someone who can do this a lot better than you can. And that was one of those, again, very humbling things that you sort of like he's telling you to delegate. I think in this case, you're a mediocre. I think in this case, he's actually saying, yeah, you're like, you're making a bunch of rookie errors in trying to close a contract that any mediocre or better salesperson will be able to do for you or in partnership with you. That was really interesting to learn. But the biggest thing that I learned, which was, I'd see very humbling is that at MemSQL, I worked with customers that were very technical, and I always got

Starting point is 00:14:11 along with the customers. I always found myself motivated when they complained about something to solve the problems. And then most importantly, when they complained about something, I could relate to it personally. At Empira, I took kind of the popular advice, which is that developers are a terrible market. So we sold to line of business. And there are a number of benefits to that. Like, we were able to sell six or seven-figure deals much more easily than we could at single store or now we can at brain trust. However, I learned firsthand that if you don't have a very deep, intuitive understanding of your customer, everything becomes harder. Like, you need to throw product managers at the problem. Your own ability to see around corners is much weaker. And, you know,

Starting point is 00:14:55 depending on who you are, it might actually be very difficult. And for me, it was so difficult that I think it made it challenging for us to, one, like, stay focused on a particular segment, and then two, out-compete or do better than people that maybe had inferior technology that we did, but really deeply understood what the customer needed. So that, I would say, like, if you just asked me, what was the main humbling lesson that I faced with it? It was that. Yeah. Okay. One more question on this market, because I think after Impera, there's a cohort of new Imperus coming out, Data Lab. I don't know if you saw that. I get a phone call about one every week.

Starting point is 00:15:30 Yeah. What do you learn about this, like, you know, unstructured data to structured data market? Like, everyone thinks now you can just throw it now and all of them at it. Obviously, it's going to be better than what you had. Yeah, I mean, I think the fundamental challenge is not a technology problem. It is the fact that if you're a business, let's say you're the CEO of a company that is in the insurance space and you have a number of inefficient processes that would benefit from unstructured to structured data. And you have the opportunity to create a new.

Starting point is 00:15:57 consumer user experience that totally circumvents the unstructured data and is a much better user experience for the end customer. Maybe it's an iPhone app that does the insurance underwriting survey by having a phone conversation with the user and filling out the form or something instead. And the second option potentially unlocked a totally new segment of users and maybe costs you like 10 times as much money. And the first segment is kind of this pain, right? It affects your cogs. It's annoying. There's a solution that works, which is throwing people at the problem, but it could be a lot better. Which one are you going to prioritize? And I think as a technologist, you know, maybe this is the third lesson. You tend to think that if a problem is

Starting point is 00:16:44 technically solvable and you can justify the ROI or whatever, then it's worth solving. And you also tend to not think about how things are outside of your control. But if you empathize with the CEO or a CTO who's sort of considering these two projects, I can tell you straight up, they're going to pick the second project. They're going to prioritize the future. They don't want the unstructured data to exist in the first place. And that is the hardest part. It is very, very hard to motivate a large organization to prioritize the problem. And so you're always going to be a second or third tier priority. And there's revenue in that because it does affect people's day to day lives. and there are some people who care enough to sort of try to solve it.

Starting point is 00:17:28 I would say this in very stark contrast to brain trust, where if you look at the logos on our website, almost all of the CEOs or CTOs or founders are daily active users of the product themselves, right? Like every company that has a software product is trying to incorporate AI in a meaningful way, and it's so meaningful that literally the exec team is using the product every day. Yeah. Just to not bury the lead,

Starting point is 00:17:52 the logos are Instacart. Stripe, Zapier, Airtable, Notion, Replit Brex, Versa, El Coda, and the browser company of New York. I don't want to jump the gun to brain trust. I don't think you've actually told the Impera Acquisition story publicly that I can tell. I have not. It's on the surface when it's like, I think I first met you maybe like slightly before the acquisition. And I was like, what the hell is Figma acquiring this kind of company? You're not a design tool.

Starting point is 00:18:17 Yeah. Any details you can share? Yeah, I would say like the super candid thing that we realized. And this is just for timing context, this, I probably personally realized this during the summer of 2022. And then the acquisition happened in December of 2022. And just for temporal context, chat GPT came out in November of 2022. So at Empira, I think our primary technical advantage was the fact that if you were extracting data from like PDF documents, which ended up being the flavor of unstructured data that we focused on, back then you had to assemble,

Starting point is 00:18:53 thousands of examples of a particular type of document to get a deep neural network to learn how to extract data from it accurately. And we had sort of figured out how to make that really small, like maybe two or three examples through a variety of like old school ML techniques and maybe some fancy deep learning stuff. But we had this like really cool technology that we were proud of. And it was actually primarily computer vision based because at that time, computer vision was a more mature field. And if you think of it, document as like one part visual signals and one part text signals, the visual signals were more readily available to extract information from. And what happened is text, starting with Bert,

Starting point is 00:19:36 and then accelerating through and including chat GPT, just totally cannibalized that. I remember I was in New York and I was playing with Bert on Hugging Face, which had made it like really easy at that point to actually do that. And they have like this little square, you know, in the in the right hand panel of a model. And I just started copy pasting documents into a question answering fine tune of Burt and seeing whether it could extract the invoice number and this other stuff. And I was like somewhat mind boggled by how often it would get it right. And that was really scary. Hang on. This is a vision-based Bert? Nope. So this was raw PDF parsing? Yep. No, no, no, PDF parsing.

Starting point is 00:20:18 taking the PDF, command A, copy paste, yeah. So there's no visual signal, right? And by the way, I know we don't want to talk about brain trust yet, but this is also when some of the seeds were formed because I had a lot of trouble convincing our team that this was real. And part of that, naturally, not to anyone's fault, is just like the pride that you have and what you've done so far. Like there's no way something that's not trained or whatever for our use case

Starting point is 00:20:44 is going to be as good, which is in many ways true. But part of it is just like I had no simple way of proving that it was going to be better. Like there's no tooling. I could just like run something and show people. I remember on the flight, before the flight, I downloaded the weights. And then on the flight, when I didn't have internet, I was like playing around with a bunch of documents. And anecdotally it was like, oh my God, this is amazing. And then that summer, we went deep into layout LM, Microsoft. I personally got super into Hugging Face.

Starting point is 00:21:14 and I think for like two or three months was the top non-employee contributor to Hugging Face, which was a lot of fun. We created like the document model type and like a bunch of stuff. And then we fine-tuned a bunch of stuff and contributed it as well. It was, I love that team. Clim is now an investor in brain trust. So it started forming that relationship. And I realized like, and again, this is all pre-Chad GPT. I realized like, oh my God, this stuff is clearly going to cannibalize all the stuff that we've built. And we quickly retooled Impero's product to use, um, use, layout LM as kind of the base model. And in almost all cases, we didn't have to use our fancy,

Starting point is 00:21:50 but somewhat more complex technology to extract stuff. And then I started playing with GPT3. And that just totally blew my mind. Again, layout LM is visual, right? So almost the same exact exercise. Like I took the PDF contents, past it into chat GPT, no visual structure, and it just destroyed layout LM. And I was like, oh my God, what is stable here? And I even remember going through the psychological justification of like, oh, but GPT3 is so expensive and blah, blah, blah, blah, blah. So nobody would call it in quantity, right? Yeah, exactly. But as I was doing that, because I had literally just gone through that, I was able to kind of zoom out and be like, you're an idiot.

Starting point is 00:22:31 There's a declining cost. Yeah. And so I realized, wow, okay, this stuff is going to change like very, very dramatically. And I looked at our commercial attraction. I looked at our exhaustion level. I looked at the team and I thought a lot about what would be best for the team. And I thought about all this stuff I'd been talking about. How much did I personally enjoy working on this problem?

Starting point is 00:22:52 Is this the problem that I want to raise more capital and work on with a high degree of integrity for the next five, 10, 15 years? And I realized the answer was no. And so we started pursuing, we had some inbound interest already given, you know, now chat GPT, it's like this stuff was starting to pick up. I guess chat GPT still hadn't come out, but like GPT3 was gaining some awareness. And there weren't that many AI teams or ML teams at the time. So we also started to get some inbound.

Starting point is 00:23:21 And I kind of realized like, okay, this is probably a better path. And so we talked to a bunch of companies and ran a process. Elad was insanely helpful. Was he an investor in Imper? He was an investor in Imper. Yeah, I met him at a pizza shop in 2016 or 2017. and then we went on one of those famous, very long walks

Starting point is 00:23:43 the next day. We started near Salesforce Tower and we ended in Noi Valley and Elad walks at like the speed of light. So we, I think it was like 30 or 40. It was crazy. And then he invested, yeah. And then I guess we'll talk more about him in a little bit.

Starting point is 00:23:57 But yeah, I mean, I was talking to him on the phone pretty much every day through that process. And Figma had a number of positive qualities to it. One is that there was a sense of stability because of the acquisition. Figma's acquisition. Another is the problem... By Adobe?

Starting point is 00:24:13 Yeah. Oh, oops. Yeah. The problem domain was not exactly the same as what we were solving, but was actually quite similar in that it is a combination of like textual, like language signal, but it's multimodal. So our team was pretty excited about that problem and had some experience. And then we met the whole team and we just thought these people are great.

Starting point is 00:24:34 And that's true. Like, they're great people. And so we felt really excited about working that. But is there a question of like, would you, because the company was shut down, like, effectively after you're basically kind of letting down your customers? Yeah, yeah. How does that? I mean, and obviously, you don't have to cover this so we can cut this out if it's too comfortable. But like, I think that's a question that people have when they go through acquisition offers.

Starting point is 00:24:57 Yeah, yeah. No, I mean, it was hard. It was really hard. I would say that there's two scenarios. There's one where it doesn't seem hard for a founder. and I think in those scenarios, it ends up being much harder for everyone else. And then in the other scenario, it is devastating for the founder. In that scenario, I think it works out to be less devastating for everyone else.

Starting point is 00:25:20 And I can tell you, it was extremely devastating. I was very, very sad for like three, four months. To be acquired, but also to be shutting down. Yeah, I mean, just winding a lot of things down, winding a lot of things down. I think our customers were very understanding, and we worked with them. To be honest, if we had more traction than we did, then it would have been harder. But there were a lot of document processing solutions. The space is very competitive.

Starting point is 00:25:51 And so I think I'm hoping, although I'm not 100% sure about this, but I'm hoping we didn't leave anyone totally out to pasture. And we did very, very generous refunds and worked quite closely with people. and wrote code to help them where we could. But it's not easy. It's not easy. It's one of those things where I think as an entrepreneur, you sometimes, you sort of resist making what is clearly the right decision because it feels very uncomfortable. And you sort of have to accept that it's your job to make the right decision. And I would say for me, this is one of N formative experiences where viscerally see the gap between what feels like the right decision and what is clearly the right

Starting point is 00:26:33 decision and you have to sort of embrace what is clearly the right decision and then map back and make, you know, fix the feelings along the way. And this was definitely one of those cases. Well, thank you for sharing that. That's something that not many people get to hear. And I'm sure a lot of people are going through that right now, bringing up Klem, like he mentions very publicly that he gets so many inbound, like acquisition offers. I mean, I don't know what you call it. Please buy me offers. Yeah, yeah, yeah. And I think people are kind of doing that math in this AI winter that we're somewhat going through. For sure.

Starting point is 00:27:05 Okay. Maybe we'll spend a little bit on Figma. Figma AI, you know, I watched closely the past two config configs. A lot going on. You're only there for eight months. So what would you say is like interesting going on in Figma, at least from the time that you're there and like whatever you see now as an outsider? Last year was an interesting time for Figma.

Starting point is 00:27:24 One, Figma was going through an acquisition. Two, Figma is trying to think about what is Figma beyond being a design tool. And three, Figma is, Sigma is kind of like Apple, a company that is really optimized around a periodic, like annual release cycle rather than something that's continuous. If you look at some of the really early AI adopters, like Notion, for example, notion is shipping stuff constantly. I mean, they actually have a conference coming up, but it's a new thing. We were consulted on that. Oh, great. Yeah. Because I've been like the World's Fair. Oh, great, great, great. Yeah. I'll be

Starting point is 00:27:58 there if anyone is there. Hit me up. But, you know, very, very iterative company. Like, Ivan and Simon and a couple others, like, hacked the first versions of Notion AI at a retreat. Yeah, exactly. Yeah, yeah, yeah, yep, yep. And so I think with those three pieces of context in mind, it's a little bit challenging for Figma, very high product bar. Probably of the software products that are out there right now, like one of, if not the best, just quality product. Like, it's not janky. You sort of rely on it to work type of products.

Starting point is 00:28:29 It's quite hard to introduce AI into that. And then the other thing I would just add to that is that visual AI is very new and it's very amorphous. Vectors are very difficult because they're a data inefficient representation. So the vector format in something like Figma is choose up like many, many, many, many, many more tokens than HTML and JSX. So it's a very difficult medium to just sort of throw into an LLM compared to writing problems or coding problems. And so it's not trivial for Figma to release like, oh, you know, you. you know, this company has blah, blah, AI and Acme AI and whatever. It's like, it's not super trivial for Figma to do that.

Starting point is 00:29:08 And I think for me personally, I really enjoyed, like, everyone that I worked with and everyone that I met, but I am a creature of shipping. Like, I wake up every morning nowadays to several complaints or questions, you know, from people. And I just like pounding through stuff and shipping stuff and making people happy and and iterating with them. And it was just literally challenging for me to do that in that environment. That's why it ended up not being

Starting point is 00:29:37 the best fit for me personally. But I think it's going to be interesting what they do. And when they do within the framework that they're designed to, as a company, to ship stuff, when they do sort of make that big leap, I think it could be very compelling. Yeah. I think there's a lot of value in being the chosen tool for an industry because then you just get a lot of community patients for figuring stuff out. The unique problem that Figma has is it caters to designers who hate AI right now.

Starting point is 00:30:03 Well, you mentioned AI. They're like, oh, I'm going to. Well, the thing is, in my limited experience and working with designers myself, I think designers do not want AI to design things for them, but there's a lot of things that aren't in the traditional designer toolkit that AI can solve. And I think the biggest one is generating code. So in my mind, there's this very interesting convergence happening between UI engineering and design. And I think Figma can play an incredibly important part in that transformation, which rather than being threatening is empowering to designers and probably helps designers contribute and collaborate with engineers more effectively, which is a little bit different than the focus around actually designing things in the editor.

Starting point is 00:30:49 Yeah, I think everyone's keen on that. Dev mode was, I think, the first segue into that. So we're going to go into Braintrust now, about 20-something minutes into the editor. podcast. So what was your idea for brain trust tell the full origin story? At Empira, while we were having an existential revelation, if you will, we realized that the debates we were having about what model and this and that were really hard to actually approve anything with. So we argued for like two or three months and then prototyped an eval system on top of Snowflake and some scripts and then shipped the new model like two weeks later. And it wasn't perfect.

Starting point is 00:31:28 There were a bunch of things that were less good than what we had before. But in aggregate, it was just way better. And that was a holy shit moment for me. I kind of realized there's this. Sometimes in engineering organizations or maybe organizations more generally, there are what feel like irrational bottlenecks. And it's like, why are we doing this? Why are we talking about this, whatever?

Starting point is 00:31:49 This was one of those obvious irrational bottlenecks. And you articulate the bottleneck again? Was it simply evils or? Yeah, the bottleneck is there's approach A and it has these tradeoffs. And approach B has these other tradeoffs. Which approach should we use? And if people don't, you know, very clearly align on one of the two approaches, then you end up going in circles. You know, this approach, hey, check out this example. It's better at this example. Or I was able to achieve it with this document, but it doesn't work with all of our customer cases, right? And so you end up going in circles. If you introduce e-vals into the mix, then you sort of change the discussion from being hypothetical or, you know, one example and another example into being something

Starting point is 00:32:33 that's extremely straightforward and almost scientific. Like, okay, great, let's get an initial estimate of how good layout LM is compared to our hand-built computer vision model. Oh, it looks like there are these 10 cases, invoices that we've never been able to process that like now we can suddenly process, but we regress ourselves on these three. Let's think about how to engineer a solution to actually improve these three and then measure it and make sure we do. And so it gives you a framework to have that. And I think aside from the fact that it literally lets you run the sort of scientific process of improving an AI application, organizationally, it gives you a clear set of tools, I think, to get people to agree. And I think in the absence of e-vals, what I saw

Starting point is 00:33:15 at Empira, and I see with almost all of our customers before they start using brain trust is this kind of like stalemate between people on which prompt you use or which model to use or which technique to use that once you sort of embrace engineering around e-vels, it just goes away. Yeah. We just did an episode with Hamil Hussein here. And the cynic in that statement would be like, this is not new. All ML engineering deploying models to production always evolves evils. You discovered it and you built your own solution, but everyone has in the industry has their own solution. Why the conviction that there's a company here? I think the fundamental thing is prior to Burt, I was as a traditional software engineer incapable of participating in the sort of what happens

Starting point is 00:34:05 behind the scenes in ML development. And so ignore the sort of CEO or founder title. Just imagine I'm a software engineer who is very empathetic about the product. All of my information about what's going to work and what's not going to work is communicated through the black box of interpretation by ML people. So I'm told that this thing is better than that thing, or it'll take us three months to improve this other thing. What is incredibly empowering about these, I would just maybe say the quality that Transformers bring to the table, and even Burt does this, but, you know, GPT3 and then four, like very emphatically do it, is that software engineers can now participate in this discussion. But all the tools that ML people have built over the years to help them navigate evals and data generally are very hard to use for software engineers.

Starting point is 00:34:56 I remember when I was first acclimating to this problem, I had to learn how to use hugging face and weights and biases. And my friend Yonda was at weights and biases at the time. And I was talking to him about this. And he was like, yeah, well, prior to weights and biases, all data scientists had was software engineering tools. and it felt really uncomfortable to them. And weights and biases kind of brought software engineering to them. And then I think the opposite happened. For software engineers, it's just really hard to use these tools.

Starting point is 00:35:25 And so I was having this really difficult time wrapping my head around what seemingly simple stuff is. And last summer, I was talking to a lot about this. And I think primarily just venting about it. And he was like, well, you're not the only software engineer who's starting to work on AI now. And that is when we realize that the real gap is that software engineers who have a particular way of thinking, a particular set of biases, a particular type of workflow that they run are going to be the ones who are doing AI engineering. And that the tools that were built for ML are fantastic in terms of the scientific inspiration, the metrics they track, the level of quality that they inspire.

Starting point is 00:36:09 But they're just not usable for software engineers. And that's really where the opportunity is. Yeah, I was talking with Sarah Guo at the same time, and that led to the rise to the AI engineer and everything that I've done. So, like, very much similar philosophy there. I think it's just interesting that software engineering and ML engineering should not be that different. Like, it's still engineering at the same. You're still making computers boop.

Starting point is 00:36:29 Like, I don't know. Why? Yeah. Well, I mean, there's a bunch of dualities to this. There's, like, the world of continuous mathematics and discrete mathematics. I think ML people think like continuous mathematicians and software. for engineers, like myself, who are obsessed with algebra. We like to think in terms of discrete math. I often talk to people about is, I feel like there are people for whom Numpai is incredibly intuitive,

Starting point is 00:36:54 and there are people for whom it is incredibly non-intuitive. For me, it is incredibly not intuitive. I was actually talking to Hamel the other day. He was talking about how there's an eval tool that he likes and I should check it out. And I was like, this thing, what, are you freaking kidding me? It's like terrible. He's like, yeah, but it has data frames. I was like, yes, exactly. You know, I, like, it's very, very different. You don't like data frames? Why? It's super hard for me to think about manipulating data frames and extracting a column or a row out of

Starting point is 00:37:22 data frames. And by the way, this is someone who's worked on databases for more than a decade. It's just very, very programmer-wise. It's very non-ergonomic for me to manipulate a data frame. And what's your preference then? Or loops. Ah. Yeah.

Starting point is 00:37:38 Okay. Well, maybe you should capture a statement of, like, what is brain trust today? because there's a little bit of the origin story. And you've had a journey over the past year, and obviously in our Series A, which will like Woohoo, congrats. Put a little intro for the Series A stuff. What is Brain Trust today? Brain Trust is an end-to-end developer platform for building AI products.

Starting point is 00:37:58 And I would say our core belief is that if you embrace evaluation as the sort of core workflow in AI engineering, meaning every time you make a change, you evaluate it and you use that to drive the next set of changes that you make, then you're able to build much, much better AI software. That's kind of our core thesis. And we started probably as no surprise by building, I would say, by far the world's best evaluation product, especially for software engineers and now for product managers and others. I think there's a lot of data scientists now who like brain trust, but I would say early on, a lot of software engineers. lot of like ML and data science people hated brain trust. It felt like really weird to them. Things have changed a little bit, but really like making e-vils something that software engineers,

Starting point is 00:38:49 product managers can immediately do. I think that's where we started. And now people have pulled us into doing more. So the first thing that people said is like, okay, great, I can do e-vals. How do I get the data to do e-vals? And so what we realized, you know, anyone who's spent some time in e-vils knows that one of the biggest pain points is et-ling data from your logs into a dataset format that you can use to do e-vils. And so what we realized is, okay, great, when you're doing e-vals, you have to instrument your code

Starting point is 00:39:16 to capture information about what's happening and then render the e-vel. What if we just capture that information while you're actually running your application? There's a few benefits to that. One, it's in the same familiar trace-and-span format that you use for e-vals. But the other thing is that you've almost like

Starting point is 00:39:31 accidentally solved the ETL problem. And so if you structure your code so that the same function abstraction that you define to evaluate on equals equals the abstraction that you actually use to run your application, then when you log your application itself, you actually log it in exactly the right format to do e-vals. And that turned out to be a killer feature in brain trust. You can just turn on logging, and now you have an instant flywheel of data that you can collect in data sets and use for e-vals. And what's cool is that customers, they might start using us for e-vals,

Starting point is 00:40:06 and then they just reuse all the work that they did and they flip a switch and boom, they have logs. Or they start using us for logging, and then they flip a switch and boom, they have data that they can use and the code already written to do e-vals. The other thing that we realized is that BrainTrust went from being kind of a dashboard

Starting point is 00:40:22 into being more of a debugger, and now it's turning into kind of an IDE. And by that, I mean, at first, you ran an eval and you'd look at our web UI and sort of see a chart or something that tells you how your e-val did. but then you wanted to interrogate that and say, okay, great, 8% better. Is that 8% better on everything, or is that 15% better and 7% worse? And where it's 7% worse, what are the cases that regressed?

Starting point is 00:40:48 How do I look at the individual cases? They might be worse on this metric, are they better on that metric? Let me find the cases that, you know, differ. Let me dig in detail. And that sort of turned us into a debugger. And then people said, okay, great, now I want to take action on that. I want to, like, save the prompt or change the model and then click a button and try it again. And that's kind of pulled us into building this very, very souped up playground. And we started

Starting point is 00:41:10 by calling it the playground. And it started as like, you know, my wish list of things that annoyed me about the open AI playground. First and foremost, it's durable. So every time you type something, it just immediately saves it. If you, you know, lose the browser window, whatever, it's all saved. You can share it and it's collaborative, kind of like Google Docs, Notion, Figma, et cetera. And so you can work on it with colleagues in real time, and that's a lot of fun. It lets you compare multiple prompts and models side by side with data. And now you can actually run e-vals in the playground. You can save the prompts that you create in the playground and deploy them into your code base. And so it's become very, very advanced. And I remember actually we had an

Starting point is 00:41:52 intro call with Brex last year, who's now a customer. And one of the engineers on the call said, he saw the playground, he said, I want this to be my IDE. it's not there yet. Here's a list of like 20 complaints, but I want this to be my ID. I remember when he told me that I had this very strong reaction, like what the F?

Starting point is 00:42:08 You know, we're building an eval observability thing. We're not building an IDE, but I think he turned out to be, you know, right. And that's a lot of what we've done over the past few months and what we're looking to in the future. How literally can you take it?

Starting point is 00:42:21 Can you fork Vs code and the new cursor? It's not, I mean, we're friends with the, we're friends with the cursor people. And now, part of the same portfolio. And sometimes people say, you know,

Starting point is 00:42:33 AI and engineering, are you Cursor, are you competitive? And what I think is like, you know, Cursor is taking AI and making traditional software engineering like insanely good with AI. And we are taking some of the best things about traditional software engineering

Starting point is 00:42:49 and bringing them to building AI software. And so we're almost like yin and yang in some ways with development. But forking VS code and doing crazy stuff is not off the table. It's all ideas that were, you know, cooking at this point. Interesting. I think that when people say analogies,

Starting point is 00:43:08 they should often take it to the extreme and see what that generates in terms of ideas. And when people say IDE, like, literally go there. Yeah. Because I think a lot of people treat their playground and they say figuratively IDE, they don't mean it. Yeah. And they should.

Starting point is 00:43:21 They should mean it. Yeah, yeah. Hello, this is Charlie again. Brain Trust is a UI-centric platform. So, in the spirit of show Don't Tell, we are cutting over to our YouTube here for the next 10 minutes so that you can see the demo of the new evils functionality that Ankur has just shipped together with the series A announcements. Scroll into our show notes for the YouTube link and give us a little like and subscribe for the Algo boost. Why don't you? See you on the other side.

Starting point is 00:43:49 So we've had this playground in the product for a while, and kind of the TLDR of it is that it lets you test prompts. They could be prompts that you save in Brain Trust or prompts that you just type on the fly against a bunch of different models or your own fine-tune models, and you can hook them into the datasets that you create in Brain Trust to do your e-vals. So I've just pulled this like press release data set.

Starting point is 00:44:13 And this is actually one of the first features we built. It's really easy to run stuff. And by the way, we're trying to see if we can build a prompt that summarizes the document well. But what's kind of happened over time is that people have pulled us to make this prompt playground more and more powerful. So I kind of like to think of brain trust as two ends of the spectrum.

Starting point is 00:44:36 If you're writing code, you can create evals with like infinite complexity. You know, like you don't even have to use large language models. You can use any models you want. You can write any scoring functions you want. And you can do that in like the most complicated code bases in the world. And then we have this playground that like dramatically simplifies things. it's so easy to use that non-technical people love to use it. Technical people enjoy using it as well.

Starting point is 00:45:00 And we're sort of converging these things over time. So one of the first things people asked about is if they could run e-vals in the playground. And we've supported running pre-built e-vals for a while. But we actually just added support for creating your own e-vals in the playground. And I'm going to show you some cool stuff. So we'll start by adding the summary quality thing. and if we look at the definition of it, it's just a prompt that maps to a few different choices and each one has a score. We can try it out and make sure that it works.

Starting point is 00:45:35 And then let's run it. So now you can run not just the model itself, but also the summary quality score and see that it's not great. Right. So we have some room to improve it. You're with me so far? Okay, so the next thing you can do is let's try to tweak this prompt. So let's say like in one to two lines. And let's run it again. One thing I notice about the, you're using an LLM as a judge here.

Starting point is 00:46:09 Yeah. That prompt about one to two lines should actually go into the LM as judge input. It is. It is a metric. Oh, okay. Did it, was that, oh, this was generated? No, no, no. This is how I pre-wrote this.

Starting point is 00:46:23 Yeah, okay, okay, all right, right. So you're matching up the prompt to the e-val that you already knew. Exactly. Exactly. So the idea is, like, it's useful to write the e-vail before you actually, like, tweak the prompt so that you can measure the impact of the tweak. So you can see that the impact is pretty clear, right? It goes from 54% to 100% now.

Starting point is 00:46:45 This is a little bit of a toy example, but you kind of get the point. Now, here's an interesting case. If you look at this one, there's something that's obviously wrong with this. What is wrong with this new summary? Has it an intro. Yeah, exactly. So let's actually add another evaluator. And this one is Python code.

Starting point is 00:47:02 It's not a prompt. And it's very simple. It's just checking if the word sentence is here. And this is a really unique thing. As far as I know, we're the only product that does this. But this Python code is running in a sandbox. It's totally dynamic. So, for example, if we change this,

Starting point is 00:47:20 it'll flip the Boolean. Obviously, we don't want to save that. We can also try running it here. And so it's really easy for you to, it's really easy for you to actually go and tweak stuff and play with it and create more interesting scores. So let's save this and then we'll run with this one as well. Awesome.

Starting point is 00:47:42 And then let's try again. So now let's say the last thing I'll show you, and this is a little bit of kind of an allude to what's next, is that the playground experience really powerful for doing this interactive editing, but we're already sort of running at the limits of how much information we can see about the scores themselves and how much information is fitting here. And we actually have a great user experience that until recently, you could only access by writing an e-val in your code. But now you can actually go in here and kick off like full

Starting point is 00:48:14 brain trust experiments from the playground. So we'll, in addition to this, we'll actually add one more. We'll add the embedding similarity score. And we'll say, you know, original summarizer, short summary, and no sentence wording. And then click create, and this is actually going to kick off full experiments. So if we go into one of these things, now we're in the full brain trust UI. And one of the really cool things is that you can actually now not just compare one experiment, but compare multiple experiments. And so you can actually look at all of these experiments together and understand like,

Starting point is 00:48:50 okay, good. I did this thing which said, like, please keep it to one to two sentences. Looks like it improved the summary quality and sentence checker, of course, but it looks like it actually also did better on the similarity score, which is kind of my main score to track how well the summary compares to, like, a reference summary. And you can go in here and then like very granularly look at the diff between, you know, two different versions of the summary and do kind of this whole experience. So this is something that we actually just shipped like a couple weeks ago, and it's already really powerful. But what I wanted to show you is kind of what, like, even the next version or next iteration of this is. And by the time the podcast airs, what I'm about

Starting point is 00:49:29 to show you will be live. So we're almost done shipping it. Excellent. But before I do that, any questions on this stuff? No, this is a really good demo. Okay, cool. So as soon as we showed people this kind of stuff, they said, well, you know, my, this is great. And I wish I could do everything with this experience, right? Like, imagine you could, like, create an agent or create, do rag, like more interesting stuff with this kind of interactivity. And so we were like, huh, it looks like we built support for you to do, you know, to run code. And it looks like we know how to actually run your prompts. I wonder if we can do something more interesting. So we just added support for you to actually define your own tools. Yeah, so here we're just writing like really simple

Starting point is 00:50:09 type script code that wraps the browser base API. And then similarly, really simple type script code that wraps the XA API, and then we give it a type definition. This will get used as the schema for a tool call, and then we give it a little bit of metadata, so BrainTrust knows where to store it and what to name it and stuff. And then you just run a really simple command, NPX, Brain Trust, Push, and then you give it these files, and it will bundle up all the dependencies and push it into BrainTrust, and now you can actually access these things from BrainTrust. So if we go to the search tool, we could say, you know, what is the tallest mountain?

Starting point is 00:50:47 And it'll actually run search via Exa. What I'm very excited to show you is that now you can actually do this stuff in the playground, too. So if we go to the playground, let's try playing with this. So let's create a dataset, one row in here and we'll say, what is the premier conference for AI engineers? Ooh, I wonder what you'll find. Let's plug this in, and let's start without using any tools. I'm not sure I agree with this statement. That was correct, as of his training data.

Starting point is 00:51:22 Okay, so let's add this XA tool in, and let's try running it again. Watch closely over here. So you see it's actually running. Yeah. There we go. Not exactly accurate, but good enough. Yeah, yeah. So I think this is really cool because it's sort of for probably 80 or 90% of the use cases that we see with people doing this like very, very simple, I create a prompt. It calls some tools. I can like very ergonomically write the tools, plug into popular services, et cetera, and then just call them kind of like assistance API style stuff. It covers so many use cases and it's honestly so hard to do. Like if you try to do this by yourself. You have to write a for loop. You have to host it somewhere.

Starting point is 00:52:16 You know, with this thing, you can actually just access it through our rest API. So every prompt gets a rest API endpoint that you can invoke. And so we're very, very excited about this. And I think it kind of represents the future of AI engineering, one where you can spend a lot of time writing English and sort of crafting the use case itself. You can reuse tools across different use cases. And then most importantly, the development process is very nicely and kind of tightly integrated with evaluation. And so you have the ability to score, create your own scores and sort of do all of this very interactively as you actually build stuff. Yeah. I have more questions to ask on that, but we'll keep the demo and the video

Starting point is 00:52:59 tight. I thought about a business in this area. I'll tell you why I didn't do it. And I think that might be generative for insights onto this industry that you would have that I don't. When I interviewed Franthropic, they gave me Cloud in Sheets. And with Cloud in Sheets, I was able to build my own evils. Because I can use sheet formulas. I can use LM. I can use Cloud to evaluate Claude, whatever. And I was like, okay, there will be AI spreadsheets.

Starting point is 00:53:23 They'll all be plugins. Spreadsheets is like the universal business tool of whatever. You can API spreadsheets. I'm sure Airtable, you know, Howie's an investor in you now, but I'm sure Airtable has some kind of... They're a customer too, actually. Yeah. The second thing was that Human Loop. also existed. Human loop being one of the very, very first movers in this field where,

Starting point is 00:53:42 same thing. Durable playground, you can share them, you can save the prompts and call them as APIs. You can also do e-vails and all the other stuff. So there's a lot of tooling, and I think you saw something or you just had the self-belief where I didn't, or you saw something that was missing still, even in that space from DIY, no-code, Google Sheets, to custom tool. They were first movers. Yeah, I mean, I think evels, it's not hard to do an initial eval script. And not to be too cheeky about it, I would say almost all of the products in the space are spreadsheet plus plus, right? Like, you know, here's a script, generates an eval. I look at the cells, you know, whatever, side by side and compare it. And your first demo to me, the main thing I was impressed by

Starting point is 00:54:32 was that you can run all these things in parallel so quickly. Yeah, exactly. So I had built spreadsheet Plus Plus a few times. And there were a couple nuggets that I realized early on. One is that it's very important to have a history of the e-vals that you've run and make it easy to share them and publish in Slack Channel stuff like that because that becomes a reference point for you to have discussions among a team. So at Empira, when we were first ironing out our layout LM usage,

Starting point is 00:55:02 would publish like screenshots of the e-bells in a Slack channel and go back to those screenshots and then like riff on ideas from a week ago that maybe we abandoned. And having the history is just really important for collaboration. And then the other thing is that writing four loops is quite hard. Like writing the right for loop that parallelizes things is durable. Someone doesn't screw up the next time they write it, you know, all this other stuff. It sounds really simple, but it's actually not. And we sort of pioneered this syntax where instead of writing a for loop to do an e-vel. You just create something called e-vall, and you give it an argument which has some data. Then you give it a task function, which is some function that takes some

Starting point is 00:55:44 input and return some output. Presumably, it calls an LLM. Nowadays, it might be an agent. You know, it does whatever you want. And then one or more scoring functions. And then Brain Trust basically takes that specification of an e-vel and then runs it as efficiently and seamlessly as possible. there's a number of benefits to that. The first is that we can make things really fast, and I think speed is a superpower. Early on, we did stuff like cache things really well, parallelized things. Async Python is really hard to use, so we made it easy to use. We made exactly the same interface in TypeScript and Python, so teams that were sort of navigating the two realities could easily move back and forth between them. And now what's become possible,

Starting point is 00:56:27 because this data structure is totally declarative, and Eval is actually, actually not just a, it's not just a code construct, but it's actually a piece of data. So when you run an e-val in brain trust now, you can actually optionally bundle the e-vel and then send it. And as you saw in the demo, you can run code functions and stuff. Well, you can actually do that with the evils that you write in your code. So all the scoring functions become functions in brain trust. The task function becomes something you can actually interactively play with and debug in the UI. And so turning it into this data structure actually makes it a much more powerful thing. And by way you can run an eval in your codebase, save it to BrainTrust, and then hit it with an API,

Starting point is 00:57:06 and just try it a new model, for example. You know, that's like more recent stuff nowadays. But early on, just having the very simple declarative data structure that was just much easier to write than a for loop that you sort of had to cobble together yourself and making it really fast and then having a UI that just very quickly showed you the number of improvements or regressions and filter them, that was kind of like the key thing that worked. I gave a lot of credit to Brian from Zapier, who was our first user and super harsh. I mean, he told me straight up. I know this is a problem. You seem smart, but I'm not convinced of the solution. And almost like, you know, the Mr. Miyagi or something, right? He, like, I'd produce a demo and then he'd send me back

Starting point is 00:57:50 and be like, yeah, it's not good enough for me to show the team. And so we sort of iterated several times until he was pretty excited by the developer experience. That core developer experience was just more helpful enough and comforting enough for people that were new to e-vals that they were willing to try it out. And then we were just very aggressive about iterating with them. So people said, you know, I ran this e-vail. I'd like to be able to like rerun the prompt. So we made that possible. Or I ran this eval. It's really hard for me to group by model and actually see which model did better and why. I ran these evals. One thing is slower than the other. How do I correlate that with token counts? That's actually really hard to do.

Starting point is 00:58:31 It's annoying because you're often doing LLM as a judge and generating tokens by doing that too. And so you need to instrument the code to distinguish the tokens that are used for scoring from the tokens that are used for actually computing the thing. Now we're way out of the realm of what you can do with clod in sheets, right? In our case at least, once we got some very sophisticated early adopters of AI using the product, it was a no-brainer to just keep making the product better and better and better and better. could just see that from like the first week that people were using the product, that there was just a ton of depth here. There is a ton of depth. Sometimes it's not even like the ideas are not worth anything. It's almost just like the persistence and execution that I think you do very well. So whatever, kudos. Thanks. We're about to like

Starting point is 00:59:18 zoom out a little bit to industry observations, but I want to spend time on brain trust. Yeah. Any other area of brain trust or part of the brain trust story that you think is that people should appreciate or which is personally insightful to you that you want to discuss it? There's probably two things I would point to. The first thing, actually there's one silly thing and then two, maybe less silly thing. So when we started, there were a bunch of things that people thought were stupid about brain trust. One of them was this hybrid on-prem model that we have. And it's funny because Databricks has a really famous hybrid on-prem model.

Starting point is 00:59:50 And the CEO and others have a mixed perspective on it. And sometimes you talk to Databricks people and they're like, this is the worst thing ever. But I think Databricks is doing pretty well. And it's hard to know how successful they would have been without doing that. But because of that, and Snowflake was doing really well at the time. Everyone thought this hybrid thing was stupid. But I was talking to customers. And Zapier was our first user.

Starting point is 01:00:15 And then Coda and Airtable quickly followed. And there was just no chance they would be able to use the product unless the data stayed in their cloud. I mean, maybe they could a year from when we started or whatever, but I wanted to work with them now. And so it never felt like a question to me. I remember there's so many VCs that I talked to. Must be sad. Must be cloud. Yeah, exactly.

Starting point is 01:00:36 Like, oh, my God, look, here's a quote from the Databricks CEO. Or here's a quote from this person. You're just clearly wrong. I was like, okay, great. See ya. Luckily, you know, Elad, Alana, Sam. And now Martine were just like, that's stupid. You know, don't worry about that.

Starting point is 01:00:50 But Martin is king of not being religious in cloud stuff. Yeah, yeah, yeah. But yeah, I mean, I think that was just funny because it was something that just felt super obvious to me, and everyone thought I was pretty stupid about it. And maybe I am, but I think it's helped us quite a bit. We had this issue at Temporo, and the solution was like Cloud VPC peering. Yeah, yeah, yeah, yeah.

Starting point is 01:01:13 And what I'm hearing from you is you went further than that. You're bundling up your package software and you're shipping it over and you're charging by seat. You asked about single store and lessons from single store. I was going to go there. I have been through the ringer with on-prem software, and I've learned a lot of lessons. So we know how to do it really well. I think the tricks with brain trust are one that the cloud has changed a lot even since

Starting point is 01:01:37 Databricks came out. And there's a number of things that are easy that used to be very hard. I think serverless is probably one of the most important unlocks for us because it sort of allows us to bound failure into something that doesn't require restarting servers or restarting Linux processes. So even though it has a number of problems, it's made it much easier for us to have this model. And then the other thing is we literally engineered brain trust from day zero to have this model. If you treat it as an opportunity and then engineer a very, very good solution around it,

Starting point is 01:02:09 just like DX or something, right? You can build a really good system. You can test it well, et cetera. So we viewed it as an opportunity rather than a challenge. The second thing is the space was really crowded. I mean, you and I even talked about this. And it doesn't feel very crowded now. I mean, sometimes people literally ask me if we have any competitors. That's great.

Starting point is 01:02:28 We'll go into that industry stuff later. Sounds good. I think what I realized then, my wife, Atlanta, actually told me this when we were working on Impera. She said, you know, based on your personality, I want you to work on something next that is, you know, super competitive. And I kind of realized there's only one of two types of markets and startups. Either it's not crowded or it is crowded, right? And each of those things has a different set of tradeoffs. And I think there are founders that thrive in either environment.

Starting point is 01:02:58 I am someone who enjoys competition. I find it very motivating. And so, you know, just like personally, it's better for me to work in a crowded market than it is to work in an empty market. Again, people are like blah, blah, blah, blah, blah. And I was like, oh, you know what? Actually, this is what I want to be doing. There were a few strategic bets that we made early on at Brain Trust

Starting point is 01:03:18 that I think helped us a lot. So one of them I mentioned is the hybrid on-prem thing. Another thing is we were the original folks who really prioritized TypeScript. Now, I would say every customer and probably north of 75% of the users that are running Evals and Brain Trust are using the TypeScript SDK,

Starting point is 01:03:39 it's an overwhelming majority. And again, at the time, and still, like, AI is sort of, at least nominally dominated by Python, but product building is dominated by TypeScript. And the real opportunity to our discussion earlier is empowering product builders to use AI. And so even if it's not the majority of typists,

Starting point is 01:04:00 using AI stuff writing TypeScript, it worked out to be this magical niche for us that's led to a lot of, I would say, strong product market fit among product builders. And then the third thing that we did is, look, we knew that this LLM ops, or whatever you want to call it, space is going to be more than just evils. But again, early on, people were like evels.

Starting point is 01:04:23 I mean, there's one VC, I won't call them out. You know who you are because assume you're going to be listening to this. But there's one VC who like insisted on meeting us, right? And like, I've known them for a long time, blah, blah, blah. And they're like, you know, actually, after thinking about it, we don't want to invest in brain trust because it reminds me of CICD. And that's a crappy market. And if you were going after logging and observability, that was your main thing, then that's a great market, you know? But of all the things in LMOps or whatever, if you draw a parallel to the previous world of software development,

Starting point is 01:04:56 this is like CICD and CICD is not a great market. And I was like, okay, you know, it's sort of like the hybrid on-prem thing, like go talk to a customer. And you'll realize that this is the, I mean, I was at Figma when we were like we used Datadog and we built our own prompt playground.

Starting point is 01:05:12 It's not super hard to write some code that, you know, Vurcell has like a template that you can use to create your own prompt playground now. But evils were just really hard. And so I knew that the pain around evals was just significantly greater than anything else. And so if we built an insanely good solution around it, the other things would follow.

Starting point is 01:05:30 And lo and behold, of course, that VC came back a few months later and said, oh my God, you guys are doing observability now. Now we're interested. And that was another kind of interesting thing. We're going to tie this off a little bit with some customer motivations and quotes. We already talked about the logos that you have,

Starting point is 01:05:45 which are all very, very impressive. I've seen what Stripe can do. I don't know if it's quotable, but you said you had something from Versel from Malta. Yeah, yeah. Actually, I'll let you read it. It's on our website. I don't want to butcher his language. So Malta says, we deeply appreciate the collaboration.

Starting point is 01:06:00 I've never seen a workflow transformation like the one that incorporates eViles into mainstream engineering processes before it's astonishing. Yeah. I mean, I think that is a perfect encapsulation of our goal. Yeah. And for those who don't know, Malta used to work on Google. search. Yeah, he's super legit. Kind of scary, as are all of the Vursell people. My funniest quote of Amalta is he published this like very, very long guide to SEO, like how SEO works. And people are like, oh, you know, this is not to be trusted. This is not how it works. And literally the guy worked

Starting point is 01:06:34 on the search algorithm. Yeah. So people want to tell you. People don't believe when you are representing a company. Like I think everyone has an angle, right? Like in Silicon Valley, it's like this whole thing where, like, if you don't have skin in the game, like, you're not really in the know, because why would you? Like, you're not an insider. But then once you have skin in the game, you do have a perspective. You have a point of view. And maybe that segues into, like, a little bit of industry talk.

Starting point is 01:06:59 Sounds good. So unless you want to bring up your World's Fair, we can also brief on just, like, what you saw at the World's Fair, you're a speaker. Yeah. And you were one of the few who brought a customer, which is something I think I want to encourage more. Yeah. That, like, you know, I think the DVT conference also does.

Starting point is 01:07:14 Like, their conferences exclusively vendors and customers and then like sharing lessons learned and stuff like that. Maybe talk a little bit about, plug your talk a little bit and people can go watch it. Yeah. First, Omo is an insanely good engineer. He actually worked with Guillermo on Mutools back in the day. This was mafia. Yeah. And I remember when I first met him, speaking of TypeScript, we only had a Python SDK. And he was like, where's the TypeScript SDK? And I was like, you know, here's some curl commands you can use. This was on a Friday, and he was like, okay, and Zapiro is not a customer yet, but they were interested in brain trust. And so I built the TypeScript SDK over the weekend, and then he was the first user of it.

Starting point is 01:07:54 And what better than to have one of the core authors of Moot Tools, like shedding your TypeScript SDK from the beginning, I would give him a lot of credit for how some of the ergonomics of our product have worked out. By the way, another benefit of structuring the talk this way is he actually worked out of our office earlier that week and built the talk. and found a ton of bugs in the product or, like, usability things. And it was so much fun. He sat next to me at the office. He'd find something or complain about something. Then I'd point him to the engineer who works on it, and then he'd go and chat with them.

Starting point is 01:08:25 And we recently had our first off-site, and we were talking about some of, like, people's favorite moments in the company. And multiple engineers were like, that was one of the best weeks to get to interact with a customer that way. You know, a lot of people have embedded engineer. This is embedded customer. Yeah. Yeah, yeah. I mean, we might do more. Yeah, we might do more of it.

Starting point is 01:08:44 Sometimes, just like launches, right? Like sometimes these things are a forcing function for you to improve. Why did you discover it preparing for the talk and not as a user? Because when he was preparing for the talk, he was trying to tell a narrative about how they use brain trust. And when you tell a narrative, you tend to look over a longer period of time. And at that point, although I would say we've improved a lot since, that part of our experience was very, very, very, rough. So for example, now if you are working in our experiments page, which shows you all of your experiments over time, you can dynamically filter things, you can group things, you can create

Starting point is 01:09:23 like a scatter plot, actually, which Hemel sort of helping me work out when we're working on a blog post together. But there's all this analysis you can do. At that time, it was just a line. And so he just ran into all these problems and complained. But the conference was incredible. It is the conference that gets people who are working in this field together. And I won't say which one, but there was a POC, for example, that we had been working on for a while. And it was kind of stuck. And I ran into the guy at the conference and we chatted. And then like a few weeks later, things worked out. And so there's almost nothing better I could ask for or say in a conference than it leading to commercial activity and success for a company like us. And, you know, it's just,

Starting point is 01:10:06 it's true. Yeah, it's marketing. It's sales. It's hiring. And then it's also, honestly, for me as a curator, just I'm trying to get together the state of the art and make a statement on here's where the industry is at this point in time. And 10 years from now, we'll be able to look back at all the videos and go like, you know, how cute, how young, how naive we were. Yeah, yeah, yeah. One thing I fear is getting it wrong. And there's many, many ways for it to get it wrong. But, you know, I think people give me feedback and keep you honest. Yeah, yeah.

Starting point is 01:10:34 I mean, the whole team is super receptive to feedback. But I think honestly, just having the opportunity in space for people to organically connect with each other, that's the most important thing. Yeah, yeah. And you ask for dinners and stuff. We'll do that next year. Excellent. Actually, we're doing a whole syndicated track thing. So, you know, brain trust con or whatever may happen. One thing I think about when organized, like literally when I organize a thing like that or I do my content or whatever, I have to have a map of the world. And something I came to your office to do was this sort of, I call this like, the three-ringed circus or the impossible triangle.

Starting point is 01:11:09 And I think what ties into what your, that VC that rejected you did not see, which is that eventually everyone starts somewhere and they grow into each other's circles. So this is ostensibly, it started off as the sort of AI-LM-OPS market. And then I think we agreed to call it like the AI Infra map, which is ops, frameworks, and databases are sort of a general thing and then gateways and serving. and brain trust has bets and all these things, but started with e-vals. It's kind of like an e-vals framework. And then obviously you extend it into observability, of course, and now is doing more and more things.

Starting point is 01:11:47 How do you see the market? Does that jive with your view of the world? Yeah, for sure. I mean, I think the market is very dynamic. And it's interesting because almost every company cares. It is an existential question and how software is built is totally changing. And honestly, I mean, the last time I saw this, happen. It felt less intense, but it was cloud. Like I still remember I was talking to, I think it was

Starting point is 01:12:12 2012 or something. I was hanging out with one of our engineers at MemSQL or single store, a MSQL at the time. And I was like, is cloud really going to be a thing? Like, it seems like for some use cases, it's economic, but for, I mean, the oil company, whatever, that's running all these analytics and they have this hardware and it's very predictable, is cloud actually going to be, you know, worth it, like security. Yeah, I mean, he was right, but he was like, yeah, I mean, if you assume that the benefits of elasticity and whatnot are actually there, then the cost is going to go down, the security is going to go up, all these things will get solved. But it was for my naive brain at that point, it was just so hard to see. And I think the same thing, to a more intense degree

Starting point is 01:12:53 is happening in AI. And I would sort of, when I talk to AI skeptics, I often rewind myself into the mental state I was in when I was somewhat of a cloud skeptic early on. But it's a very dynamic marketplace. And I think there's benefit to separating these things and having kind of best of breed tools do different things for you. And there's also benefits to some level of vertical integration across the stack. And as a product-driven company that's navigating this, I think we are constantly thinking about how do we make bets that allow us to provide more value to customers and solve more use cases while doing so durably. Germo from for sell, who is also an investor and, you know, very sprightly character to interact with.

Starting point is 01:13:40 What do you say sprightly? I don't know. But anyway, he gave me this really good advice, which was as a startup, you only get to make a few technology bets, and you should be really careful about those bets. Actually, at the time, I was asking him for advice about how to make arbitrary code execution work, because obviously they've solved that problem. And in JavaScript, arbitrary code execution is such itself, such a dynamic thing. Like there's so many different ways of, you know, there's workers and Dino and Node and, you know, firecrackers, all this stuff, right? And ultimately, we built it in a way that just supports Node,

Starting point is 01:14:14 which I think Vrcel has sort of embraced as well. But where I'm kind of trying to go with this is in AI, there are many things that are changing and there are many things that you got to predict whether or not they're going to be durable. And if you predict that something's durable, then you can build depth around it. But if you make the wrong predictions about durability and you build depth, then you're very, very vulnerable because a customer's priorities might change tomorrow

Starting point is 01:14:39 and you've built depth around something that is no longer relevant. And I think what's happening with frameworks right now is a really, really good example of that playing out. We are not in the app framework universe. So we have the luxury of sort of observing it, pun intended, you know, from the side. You kind of, you are a little bit. I captured when you said if you structure your code with the same function extraction, triple equals to run eVals. Sure, yeah. It's a little bit.

Starting point is 01:15:06 But I would argue that that is a, it's kind of like a clever insight. And we, in the kindest way, almost trick you into writing code that doesn't require ETL. But it's not, it's good for you. Yeah, exactly. But you don't have to use, it's kind of like a lesson that is invariant to brain trust itself. Sure. I buy that. Yeah.

Starting point is 01:15:25 There was an obvious part of this market for you to start in. Which is maybe curious worth, we're spending like two seconds on it. You could have been the VectorDB CEO. Right? Yeah, I got a lot of calls about that. You're a database guy. Yeah. Why no vector database?

Starting point is 01:15:39 Oh, man. Like, I was drooling over that problem because it just checks every, like, it's, you know, performance and potentially server. It's just everything I love to type. The problem is that I had a fantastic opportunity to see these things play out at Figma. The problem is that the challenge in deploying VFECMA, Vector Search has very little to do with Vector Search itself and much more to do with the data adjacent to Vector Search.

Starting point is 01:16:07 So, for example, if you are at Figma, the vector search is not actually the hard problem. It is the permissions and who has access to what design files or design system components and blah, blah, blah, blah, blah, blah, blah, blah, all of this stuff that has been beautifully engineered into a variety of systems that serve the product. You think about something like vector search, and you really have two options. One is there's all this complexity around my application, and then there's this new little idea of technology, sort of a pattern or paradigm of technology, which is vector search. Should I kind of like cram vector search into this existing ecosystem?

Starting point is 01:16:46 And then the other is, okay, vector search is this new, exciting thing. Do I kind of rebuild around this new paradigm? And it's just super clear that it's the former. In almost all cases, vector search is not a storage or performance bottleneck. In almost all cases, the vector search involves exactly one query, which is nearest neighbors. The hard part... H&SW and... Yeah, I mean, that's the implementation of it.

Starting point is 01:17:13 But the hard part is how do I join that with the other data? How do I implement R-Back and all this other stuff? And there's a lot of technology that does that, right? So in my observation, database companies tend to succeed when the storage paradigm is closely tied to the execution paradigm. And both of those things need to be rewired to work. I think remember the databases are not just storage, but they're also compilers. And it's the fact that you need to build a compiler that understands how to utilize a particular storage mechanism that makes the N plus first database, something that is unique. If you think about Snowflake, it is separating storage from compute

Starting point is 01:17:58 and the entire sort of compiler pipeline around query execution hides the fact that separating storage from compute is incredibly inefficient, but gives you this really fast query experience. With Databricks, it's the arbitrary code is a first-class citizen, which is a very powerful idea, and it's not possible in other database technologies. But, okay, great, arbitrary code is a first-class citizen, in my database system, how do I make that work incredibly well? And again, that's a problem which sort of spans storage and compute. At least today, the query pattern for vector search is so constrained that it just doesn't have that property.

Starting point is 01:18:37 Yep. I think I fully understand and mostly agree. You want to hear the opposite view. I think yours is now the consensus view. And I want to hear the other. I mean, there's super smart people working on this, right? Yeah, we'll be having a chroma and I think Q-Ju-Jolent on, maybe. VESPA, actually.

Starting point is 01:18:53 Yeah. One other part of the sort of triangle that I drew that you disagree with, and I thought that was very insightful, was fine-tuning. Yeah. So I had all these overlapping circles, and I think you agree with most of them. And I was like, at the center of it all, because you need ops, you need, like, logging from ops, and then you need like a gateway and then you need a database with a framework or whatever, was fine-tuning.

Starting point is 01:19:14 And you were like, fine-tuning is not a thing. Yeah. Oh, this is not a business. Yeah, yeah. So there's two things with fine-tuning. One is like the technical merits or whether fine tuning is a relevant component of a lot of workloads. And I think that's actually quite debatable. The thing I would say is not debatable is whether or not fine tuning is a business outcome or not.

Starting point is 01:19:34 So let's think about the other components of your triangle. Ops slash observability, that is a business thing. Like do I know how much money my app costs? Am I enforcing, or sorry, do I know if it's up or down? Do I know if someone complains, can I like retrieve the information about that? Frameworks, e-vals, databases, do I know if I change my code? Did it break anything? Gateway, can I access this other model?

Starting point is 01:20:01 Can I enforce some cost parameter on it, whatever? Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning. It is can I automatically optimize my use case to perform better if I throw data at the problem. And fine-tuning is one of multiple ways to achieve that. I think the DSPY style prompt optimization is another one. Turpentine, you know, just like tweaking prompts with wording and hand-crafting few-shot examples and running evils. That's another, you know. Is it turpentine or framework? No, no, no, no, sorry. That's just a metaphor. Yeah, yeah, yeah. But maybe it should be a framework.

Starting point is 01:20:41 Right now it's a podcast network by Eric Torrenberg. Yes, yes. That's actually why I thought of that word. You know, old school elbow grease is what I'm saying of like, you know, hand tuning prompts. That's another way of achieving that business goal. And there's actually a lot of cases where hand tuning a prompt performs better than fine tuning because you don't accidentally destroy the generality that is built into the sort of world-class models. So in some ways, it's safer, right? But really the goal is automatic optimization. And I think automatic optimization is a really valid goal. but I don't think fine-tuning is the only way to achieve it. And so in my mind, for it to be a business, you need to align with the problem, not the technology. And I think that automatic optimization is a really great business problem to solve.

Starting point is 01:21:26 And I think if you're too fixated on fine-tuning as the solution to that problem, then you're very vulnerable to technological shifts. Like, you know, there's a lot of cases now, especially with large context models, where in-context learning just beats fine-tuning. And the argument is sometimes, well, yes, you can get as good of performance as in context learning, but it's faster or cheaper or whatever, that's a much weaker argument than, oh my God, I can really improve the quality of this use case with fine-tuning. You know, it's somewhat tumultuous. Like, a new model might come out. It might be

Starting point is 01:21:57 good enough that you don't need to use fine-tuning or might not have fine-tuning, or it might be good enough that you don't need to use fine-tuning as the mechanism to achieve automatic optimization with the model. But automatic optimization is a thing. And so that's kind of the semantic thing. which I would say is maybe, at least to me, it feels like more of an absolute. Like, I just don't think fine-tuning is a business outcome. I think it is one of several means to an end, and the end is valuable?

Starting point is 01:22:25 Now, is fine-tuning a technically valid way of doing automatic optimization? I think it's very context-dependent. I will say in my own experience with customers, as of the recording date today, which is September or something, yeah, very few of our customers are currently fine-tuning models,

Starting point is 01:22:42 and I think a very, very small fraction of them are running fine-tuned models in production. More of them were running fine-tuned models in production six months ago than they are right now. And that may change. I think what Open AI is doing with basically making it free and how powerful Lama 3AB is and some other stuff, that may change. Maybe by the time this airs, you know, more of our customers are fine-tuning stuff. But it seems very, it's changing all the time. but all of them want to do automatic optimization. Yeah, it's worth asking a follow-out question on that.

Starting point is 01:23:16 Who's doing that today well that you would call out? Automatic optimization? No one. Wow. DSPI is a step in that direction. Omar has decided to join Databricks and be an academic. And I have actually asked for who's making the DSPI startup. Yeah.

Starting point is 01:23:33 There's a few. Somebody should. There's a few. Yeah. You know, my personal perspective on this, which almost everyone at least hardcore engineers disagree with me about, but I'm okay with that, is if you look at something like DSPY, I think there's two elements to it. One is automatic optimization, and the other is achieving automatic optimization by writing code, in particular in DSPI's case code that looks a lot like

Starting point is 01:23:58 PyTorch code. And I totally recognize that if you were writing only TensorFlow before, then you started writing PiTorch, it's a huge improvement. And, oh, my God, it feels like so much nicer to write code. If you are a TypeScript engineer and you're writing NextJS, writing PyTorch sucks. Why would I ever want to write PyTorch? And so I actually think the most empowering thing that I've seen is engineers and non-engineers alike writing really simple code. And whether it's like simple typescript code that's auto-completed with cursor or it's English, I think that the direction of programming itself is moving towards simplicity. And I haven't seen something yet that really moves programming towards simplicity.

Starting point is 01:24:48 And I am, you know, maybe I'm a romantic at heart. But I think there is a way of doing automatic optimization that still allows us to write, you know, simpler code. Yeah. I think that people working on it. And I think it's a valuable thing to explore. I'll keep a lookout for it and try to report on it through latent space. And we'll integrate with everything.

Starting point is 01:25:09 So please let me know if you're working on this. We'd love to collaborate with you. For ops people in particular, you have a view of the world that a lot of people don't get to see, which is to see workloads and report aggregates, which is insightful to other people. Yeah. Obviously, you don't have them in front of you, but I just want to get rough estimates. You already said one, which is kind of juicy, which is open source models are a very, very small percentage. Do you have a sense, open AI versus anthropic versus co-heaval?

Starting point is 01:25:34 here, market share, at least through the segment that you see? So pre-Claude 3, it was close to 100% open AI, post-Claude 3. And I actually think haiku is slept down a little bit because before 4-0 Mini came out, haiku was a very interesting reprieve for people to have very, very... Is it a sonnet or haiku? Haiku. Sonnet, I mean, everyone knows sonnet, right? Oh my God.

Starting point is 01:25:57 But when Cloud 3 came out, Sonnet was like the middle child. Like, who gives a shit about sonnet? It's neither the super fast thing nor the super smart thing. But really, I think it was haiku that was the most interesting foothold because Anthropic is talented at figuring out either deliberately or not deliberately a value proposition to developers that is not already taken by Open AI and providing it. I think now Sonnet is both cheap and smart and it's quite pleasant to communicate with. But when Haiku came out, it was the smartest, cheapest, fastest, fastest,

Starting point is 01:26:31 model that was a very refreshing. And I think the fact that it supported tool calling was incredibly important. An overwhelming majority of the use cases that we see in production involve tool calling because it allows you to write code that reliably, sorry, it allows you to write prompts that reliably plug in and out of code. And so without tool calling, it was a very steep hill to use a non-open AI model with tool calling, especially because Anthropic embraced JSON schema as a format. So did opening I. I mean, they did it first. Yeah, yeah, I'm saying outside of opening. Yeah, yeah. Open A, I had already done it.

Starting point is 01:27:06 And so, and Athropic was smart, I think, to piggyback on that versus trying to say, hey, you know, do it our way instead. Because they did that, it became now you're in business, right? The switching costs is much lower because you don't need to unwind all the tool calls that you're doing. And you have this value proposition, which is like cheaper, faster, a little bit dumber with Haiku. And so I would say anecdotally now, every new process, that people think about, they do evaluate Open AI and Anthropic. We still see an overwhelming majority of customers using Open AI, but almost everyone is using Anthropic and Sonnet specifically for their side projects, whether it's via cursor or prototypes or whatever that they're doing.

Starting point is 01:27:48 It's actually kind of funny. I made fun of it. Yeah, I mean, I think one of the things that people don't give Open AI enough credit for, I'm not saying Anthropic does a bad job of this, but I just think Open AI does an extremely exceptional job of this is availability, rate limits, and reliability. It's just not practical outside of Open AI to run use cases at scale in a lot of cases. You can do it, but it requires quite a bit of work. And because Open AI is so good at making their models so available, I think they get a lot of credit for the science behind, you know, 01 and wow, it's like an amazing new model. In my opinion, they don't deserve enough credit for the, you know, show.

Starting point is 01:28:26 up every day and keeping the servers running behind one endpoint. You know, you don't need to provision an open AI endpoint or whatever. Just one endpoint. It's there. You need higher rate limits. It's there. You know, it's reliable. That's a huge part of, I think, what they do well.

Starting point is 01:28:42 Yeah, we interviewed Michelle from that team. They do a ton of work, and it's a surprisingly small team. It's really amazing. That actually opens the way to a little bit of something I assume, but you would know, which is I would assume that, like, it's all like small developers like, us use those model lab endpoints directly. Yeah. But the big boys, they all use Amazon for Anthropic, right?

Starting point is 01:29:03 Because they have the special relationship. They all use Azure for Open AI because they have that special relationship. And then Google has Google. Is that not true? It's not true. Isn't that weird? You wouldn't have like all this committed spend on AWS. Then you were like, okay, fine.

Starting point is 01:29:15 I'll use cloud because I already have that. In some cases, it's yes and. It hasn't been a smooth journey for people to get the capacity on public clouds that they're able to get through Open AI directly. I mean, I think a lot of this is changing, catching up, et cetera, but it hasn't been perfectly smooth. And I think there are a lot of caveats, especially around access to the newest models.

Starting point is 01:29:37 And with Azure early on, there's a lot of engineering that you need to do to actually get the equivalent of a single endpoint that you have with OpenAI. And most people built around assuming there's a single endpoint. So it's a non-trivial engineering effort to load balance across endpoints and deal with the credentials.

Starting point is 01:29:56 Every endpoint is a slightly different set of credentials, has a different set of models that are available on it. There are all these problems that you just don't think about when you're using OpenAI, et cetera, that you have to suddenly think about. Now, for us, that turned into some opportunity, right? Like, a lot of people use our proxy as a... This is the gateway.

Starting point is 01:30:14 Exactly. As a load balancing mechanism to sort of have that same user experience with more complicated deployments. But I think that in some ways, maybe a small fish in that in that pond, but I think that the ease of actually a single endpoint is, it sounds obvious or whatever, but it's not. And for people that are constantly, a lot of AI energy is spent on, and inference is spent on R&D, not just stuff that's running in production. And when you're doing R&D, you don't want to spend a lot of time on maybe accessing a slightly older version

Starting point is 01:30:47 of a model or dealing with all these endpoints or, you know, whatever. And so I think the sort of time to value and ease of use of what the model labs themselves have been able to provide, it's actually quite compelling. That's good for them, less good for the public cloud partners to them. I actually think it's good for both, right? It's not a perfect ecosystem, but it is a healthy ecosystem with now with a lot of tradeoffs and a lot of options. And as we're not a model lab, as someone who participates in the ecosystem, I'm happy.

Starting point is 01:31:19 Open AI released O1. I don't think Anthropic and Meta are sleeping on that. I think they're probably invigorated by it, and I think we're going to see exciting stuff happen. And I think everyone has a lot of GPUs now. There's a lot of ways of running Lama. There's a lot of people outside of meta who are economically incentivized for Lama to succeed.

Starting point is 01:31:37 And I think all of that contributes to more reliable endpoints, lower costs, faster speed, and more options for you and me who are just using these models and benefiting from them. It's really funny. We actually interviewed Thomas, from the Lama 3 post-training team. He's great. He actually talks a little bit about Lama 4,

Starting point is 01:31:55 and he was already down that path, even before O-1 came out. I guess it was obvious to anyone in that circle. But for the broader worlds, last week was the first time they heard about it. Yeah, yeah, yeah. I mean, speaking of O-1, I mean, let's go there. How has O-1 changed anything that you perceive?

Starting point is 01:32:11 You're in enough circles that you already knew what was coming. So did it surprise you in any way? Does it change your roadmap in any way? It is long inference. so maybe it changes some assumptions. Yeah, I mean, I talked about how way back, right, like rewining to Empira, if you make assumptions about the capabilities of models and you engineer around them, you're almost like guaranteed to be screwed.

Starting point is 01:32:35 And I got screwed not necessarily bad way, but I sort of felt that by Burt. Yeah, twice in like a short period of time. So I think that sort of shook out of me that temptation as an engineer that you have to say, oh, you know, GPT-40 is good at this, but models will never be good at that. So let me try to build software that works around that. And I think probably you might actually disagree with this. And I wouldn't say that I have a perfectly strong structural argument about this. So I'm open to debate, and I might be totally wrong.

Starting point is 01:33:07 But I think one of the things that was felt obvious to me and somewhat vindicated by 01 is that there's a lot of code and sort of like pads that people, went down with GPT-40 to sort of achieve this idea of more complex reasoning. And I think agenetic frameworks are kind of like a little Cambrian explosion of people trying to work around the fact that GPT-40 has somewhat, or, you know, related models have somewhat limited reasoning capabilities. And, you know, I look at that stuff and, you know, writing graph code that returns like edge in directions and all this, it's like, oh, my God, this is so complicated. It feels very clear to me that this type of logic is going to be built into the model. Anytime there is

Starting point is 01:33:54 control flow complexity or uncertainty complexity, I think the history of AI has been to push more and more into the model. In fact, no one knows whether this is true or whatever, but GPT4 was famously a mixture of experts. Mention on our podcast. Exactly. Yeah, I guess you broke the news, right? There are two breakers as Dylan and us, and ours was, George was the first, like, a loud enough person to make noise about it. Prior to that, a lot of people were building, you know, these like round-robin routers that were like, you know. And you look at that and you're like, okay, I'm pretty sure if you train a model to do this problem and you vertically integrate that into the LLM itself, it's going to be better. And that that happened with GPT4. And I think O1 is going to do that to agenic frameworks as well.

Starting point is 01:34:41 Hey, I think to me it seems very unlikely that the, you know, you and me sort of like sipping an espresso and thinking about how like different personified roles of people should interact with each other and stuff. It seems like that stuff is just going to get pushed into the model. That was the main takeaway for me. I think that you are very perceptive in your mental modeling of me because I do disagree 15, 25%. Obviously, they can do things that we cannot. but you as a business always want more control than OpenEI will ever give you. Yeah, yeah. They're charging you for like thousands of reasoning tokens and you can't see it.

Starting point is 01:35:18 Yeah. That's ridiculous. Come on. Well, it's ridiculous until it's not, right? I mean, it was ridiculous to GPT3 too. Well, GPT3, I mean, all the models had total transparency until now where you're paying for tokens you don't, you can't see. What I'm trying to say is that I agree that this particular flavor of transparency is novel. Where I disagree is that something that feels.

Starting point is 01:35:40 like an overpriced toy. I mean, I viscerally remember playing with GPT3, and it was very silly at the time, which was kind of annoying if you're doing document extraction. But I remember playing with GPT3 and being like, okay, yeah, this is a great,

Starting point is 01:35:53 but I can't deploy it on my own computer and blah, blah, blah, blah, blah, blah, blah. So it's never going to actually work for the real use cases that we're doing. And then that technology became cheap, available, hosted. Now I can run it on my, you know, hardware or whatever.

Starting point is 01:36:10 So I agree with you if that is a permanent problem. I'm relatively optimistic that I don't know if Lama 4 is going to do this, but imagine that meta figures out a way of open sourcing some similar thing, and you actually do have that kind of control on it. Yeah, it remains to be seen. But I do think that people want more control. And this part of like the reasoning step is something where if the model just goes off to do the wrong thing, you probably don't want to iterate in the prompt space.

Starting point is 01:36:39 you probably just want to chain together a bunch of bottle calls to do what you're trying to. Perhaps, yeah. I mean, it's one of those things where I think the answer is very gray, like the real answer is very gray. And I think for the purposes of thinking about our product and the future of the space and just for fun debates with people I enjoy talking to like you, it's useful to pick one extreme of the perspective and just sort of latch onto it. But yeah, it's a fun debate to have.

Starting point is 01:37:08 and maybe I would say more than anything, I'm just grateful to participate in an ecosystem where we can have these debates. Yeah, yeah, yeah, very, very helpful. Your data point on the decline of open source in production is actually very... Decline of fine-tuning in production. I don't think open source is...

Starting point is 01:37:25 I mean, it's been... Can you put a number, like 5%, 10% of your workload? Is open source? Yeah. Because of how we're deployed, I don't have like an exact number for you. Among customers running in production, it's less than 5%.

Starting point is 01:37:37 That's so small. That counters are, you know, the thesis that people want more control, that people want to create IP around their models and all that stuff. Like, it's actually very interesting. I think people want availability. You can engineer availability with open weights. Good luck. Really? Yeah.

Starting point is 01:37:55 You can use together fireworks, all these guys. They are nowhere near as reliable as, I mean, every single time I use any of those products and run a benchmark, I find a bug, text the CEO, and they fix them. something. It's nowhere near where Open AI is. It feels like using Joyant instead of using AWS or something. Like, yeah, great. Joint can build, you know, single click provisioning of instances and whatever. I remember one time I was using, I don't remember if it was joint or something else. I tried to provision an instance and the person was like, BRB, I need to run to Best Buy to go buy the hardware. Yes, anyone can theoretically do what Open AI has done, but they just haven't. I won't mention one thing which I'm trying to figure out.

Starting point is 01:38:37 We obliquely mentioned the GPU inference market. Is anyone making money? Will anyone make money? In the GPU inference market, people are making money today, and they're making money with really high margins. Really? Yeah. Because I calculated like the GROC numbers.

Starting point is 01:38:52 Dylan Patel thinks they're burning cash. I think they're about break-even. It depends on the company. So there are some companies that are software companies, and there are some companies that are hardware bets, right? I don't have any insider information, so I don't know about the hardware companies, but I do know for some of the software companies,

Starting point is 01:39:08 they have high margins and they're making money. I think no one knows how durable that revenue is. But, you know, all else equal. If a company has some traction and they have the opportunity to build relationships with customers, I think independent of whether their margins erode for one particular product offering, they have the opportunity to build higher margin products.

Starting point is 01:39:29 And so, you know, inference is a real problem. And it is something that companies are willing to pay a lot of money to solve. So to me, it feels like there's opportunity. Is the shape of the opportunity inference API? Maybe not. But we'll see. We'll see. Those guys are definitely reporting very high ARR numbers. Yeah. And from all the knowledge I have, the ARR is real. Again, I don't have any insider information. Together's numbers were like leaks or something on like the Kleiner Perkins podcast. And I was like, I don't think that was public, but now it is. So that's kind of interesting.

Starting point is 01:40:04 Okay, any other industry trends you want to discuss? Nothing else that I can think of. I want to hear yours. Okay, no, just generally workload market share. Yeah. You serve like superhuman. They have superhuman AI. They like do title summaries and all that.

Starting point is 01:40:18 I just would really like type of workloads, type of evals. What is Gen. AI being used in production today to do? Yeah, I would say about 50% of the use cases that we see are what I would call like single prompt manipulations. Summaries are often but not always a good example. of that. And I think they're really valuable. One of my favorite Gen AI features is we use linear at BrainTrust. And if a customer finds a bug on Slack, we'll click a button and then file a linear

Starting point is 01:40:47 ticket. And it auto generates a title for the ticket. I've no idea how it's implemented. Honestly, I don't care. Loom has some really similar features, which I just find amazing. So delightful. You record the thing, it titles it properly. Yeah. And even if it doesn't get it all the way proper, it sort of inspires me to maybe tweak it a little bit. It's just, it's so nice. And so I think there is an unbelievable amount of untapped value in single prompt stuff. And the thought exercise I run is like, anytime I use a piece of software, if I think about rebuilding that software as if it were rebuilt today, which parts of it would involve AI, like almost every part of it would involve running a little prompt here or there to have a little bit of delight. By the way,

Starting point is 01:41:33 Before you continue, I have a rule for building Small Talk, which we can talk about separately, but it should be easy to do those EI calls. Because if it's a big lift, if you have to edit five files, you're not going to do it. Right, right, right. But if you can just sprinkle intelligence everywhere, then you're going to do it more. I totally agree. And I would say this probably brings me to the next part of it. I'd say like probably 25% of the remaining usage is what you could call like a simple agent,

Starting point is 01:42:00 which is probably, you know, a prompt plus some tools, at least one or perhaps the only tool is a rag type of tool. And it is kind of like an enhanced, you know, chatbot or whatever that interacts with someone. Then I'd say probably the remaining 25% or what I would say are like advanced agents, which are things that maybe run for a long period of time or have a loop or, you know, do something more than that sort of simple but effective paradigm. And I've seen a huge change in how people write code over the past six months. So when this stuff first started being technically feasible, people created very complex programs that almost reminded me of like being, like studying math again in college. It's like, you know, here, let me like compute, you know,

Starting point is 01:42:47 the shortest path from this knowledge center to that knowledge center and then blah, blah, blah, blah. It's like, oh my God. You know, and you write this crazy continuation passing code. In theory, it's like amazing. It's just very, very, very. very hard to actually debug this stuff and run it. And almost everyone that we work with has gone into this model that I, actually exactly what you said, which is sprinkle intelligence everywhere and make it easy to write dumb code. And I think the prevailing model that is quite exciting for people on the frontier today,

Starting point is 01:43:19 and I dearly hope as a programmer succeeds is one where, like, what is AI code? I don't know. It's not a thing, right? It's just I'm creating an app. NPX create next app or whatever, like fast API, whatever you're doing. And you just start building your app. And some parts of it involves some intelligence, some parts don't.

Starting point is 01:43:41 You do some prompt engineering. Maybe you do some automatic optimization. You do e-vals as part of your sort of CI workflow. You have observability. It's just like, I'm just building software. And it happens to be quite intelligent as I do it because I happen to have these things available to me. And that's what I see more people doing. You know, the sexiest intellectual way of thinking about it is that you design an agent around the user experience that the user actually works with in the application rather than the technical implementation of how the components of an agent interact with each other.

Starting point is 01:44:16 And when you do that, you almost necessarily need to write a lot of little bits of code, especially UI code, between, you know, the LLM calls. And so the code ends up looking kind of dumber along the way because you almost have to write. code that engages the user and sort of crafts the user experience as the LLM is doing its thing. So here are a couple of things that you did not bring up. No one's doing the code interpreter agent, the Voyager agent where the agent writes code and then persists that code and reuses that code in the future. Yeah, so I don't know anyone is doing that. When code interpreter was introduced last year, I was like, this is AGI. There's a lot of people. It should be fairly obvious if you look at our customer list who they are, but I won't call them out specifically that are doing

Starting point is 01:45:01 code gen and running the code that's generated in arbitrary environments. But they have also morphed their code into this dumb pattern that I'm talking about, which is like, I'm going to write some code that calls an LM, it's going to write some code. I might show it to a user or whatever, and then I might just run it. But it's not the, I like the word Voyager that you use it. It's not, I don't know anyone is doing that. I mean, Voyager is in the paper. You understand. Yeah, yeah. Okay, cool. Yeah, so my term for this, if you want to use the term, you can use mine, is code core versus

Starting point is 01:45:33 LM core. Yeah. And this is a direct parallel from systems engineering where you have functional core imperative shell. This is a term that people use. You want your core system to be very well defined and imperative outside to be easy to work with. Yeah. And so the AI engineering equivalent is that you want the core of your system to not be this shog of where you just kind of like chucked. where you just kind of like chuck it into a very complex agent,

Starting point is 01:45:58 you want to sprinkle LLMs into a code base. Yeah, yeah, yeah. Because we know how to scale systems. We don't know how to scale agents that are quite hard to be reliable. Yeah, I mean, just tying that to the previous thing I was saying, I think while in the short term, there may be opportunities to scale agents by doing like silly things, feels super clear to me that in the long term,

Starting point is 01:46:18 anything you might do to work around that limitation of an LLM will be pushed into the LLM. If you build your system in a way that kind of a super, assumes LLMs will get better at reasoning and get better at sort of agentic tasks in the LLM itself, then I think you will build a more durable system. What is one thing you would build if you're not working on brain trust? A vector database. My heart is still with databases a lot.

Starting point is 01:46:43 I mean, sometimes I... Not ironically. Yes. Not a vector database. I'll talk about this in a second. But I think I love the Odyssey. I'm not Odysseus. I don't think I'm cool enough.

Starting point is 01:46:53 but I sort of romanticize going back to the farm, maybe just like Alana and I move to like the woods someday and I just sit in a cabin and write C++ or Rust Code on my MacBook Pro and like build a database or whatever. So that's sort of what I drool and dream about. I think practically speaking, I am very passionate about this variant type issue that we've talked about because I now work in observability where that is a cornerstone to the problem.

Starting point is 01:47:22 And I mean, I've been ranting to Nikita and other people that I like enjoy interacting with in the database universe about this. And my conclusion is that this is a very real problem for a very small number of companies. And that is why Datadogs, Blunk, Honeycomb, etc., built their own database technology, which is, in some ways it's sad because all of the technology is a remix of pieces of snow. Flake and Redshift and Postgres and other things, Redis, you know, whatever, that solve all of the technical problems. And I feel like if you gave me access to all the codebases and locked me in a room for a week or something, I feel like I could remix it into any database technology that would solve any problem. Back to our H-Tap thing, right? It's like kind of the same idea. But because of how

Starting point is 01:48:14 databases are packaged, which is for a specific set of customers that have a particular set of use cases in a particular flavor of wallet, the technology ends up being inaccessible for these use cases like observability that don't fit a template that you can just sell and resell. I think there are a lot of these little opportunities and maybe some of them will be big opportunities. Maybe they'll all be little opportunities forever. But I probably just, there's probably a set of such things, the variant type being the most extreme right now, that are high frustration for me and low value for database companies that are all interesting things for me to work on. Okay. Well, maybe someone listening is, you know, also excited and maybe they can come to you for advice. Anyone who

Starting point is 01:48:57 wants to talk about databases, I'm around. Maybe I need to refine my question. What AI company or product would you work on if you're not working on brain trust? Honestly, I think if I weren't working on brain trust, I would want to be working either independently or as part of a lab and training models. I think with databases and just in general, I've always taken pride in being able to work on like the most leading version of things

Starting point is 01:49:24 and maybe it's a little bit too personal, but one of the things I struggled with post single store is there are a lot of data tooling companies that have been very successful that I looked at and was like, oh my God, this is stupid. You can solve this inside of a database much better.

Starting point is 01:49:40 I don't want to call it any examples because I'm friends with a lot of these people. I probably have worked at some. Yeah, maybe. But what was a really sort of humbling thing for me, and I wouldn't even say I fully accepted it, is that people that maybe don't have the ivory tower experience of someone who worked inside of a relational database, but are very close to the problem, their perspective is at least as valuable in company building and product building as someone who has the ivory tower of like, oh, my God, I know how to make in-memory skip list that's durable. You know, and lock-free. And I feel like with AI stuff, I'm in the opposite scenario. Like, I had the

Starting point is 01:50:19 opportunity to be in the ivory tower and, you know, at open air, whatever, like train a large language model. But I've been using them for a while now, and I felt like an idiot. I kind of feel like I'm in the, I'm one of those people that I never really understood in databases who really understands the problem, but is not all the way in with the technology. And so that's probably what I'd work on. This might be a controversial question, but whatever. If Open AI came to you with an offer today, would you take it? Competitive fair market value, whatever that means for your investors. Yeah, I mean, fair market value, no.

Starting point is 01:50:54 But I think that, you know, I would never say never. But I really... Because then you'd be able to work on their platform. Oh, yeah. Bring your tools to them. And then also talk to the researchers. Yeah, I mean, we are very friendly collaborators with Open AI. And I have never had more fun.

Starting point is 01:51:14 day to day than I do right now. One of the things I've learned is that many of us take that for granted. Now having been through a few things, it's not something I feel comfortable taking for granted again. Independence. I wouldn't even call it independence. I think it's being in an environment that I really enjoy. I think independence is a part of it, but it's not the, I wouldn't say it's the high order bit. I think it's working on a problem that I really care about for customers that I really care about with people that I really enjoy working with. Among other things, I'll give a few shoutouts. I work with my brother.

Starting point is 01:51:49 I see him? No. He answered a few questions. He was sitting right behind. Oh, that was right. Yeah, yeah. And he's my best friend, right? Like, I love working with him. Our head of product, Eden,

Starting point is 01:52:00 he was like the first designer at Airtable and Cruise. And, you know, he is an unbelievably good designer. If you use the product, you should thank him. I mean, if you like the product, he's just so good. and he's such a good engineer as well. He destroyed our programming interviews, which he gave him for fun. But it's just such a joy to work with someone

Starting point is 01:52:20 who's just so good and so good at something that I'm not good at. Albert joined really early on, and he used to work in VC, and he does all the business stuff for us. He has negotiated giant contracts, and I just enjoy working with these people. And I feel like our whole team is just so good.

Starting point is 01:52:40 Yeah, you work. worked really hard to get here. Yeah, I'm just loving the moment. That's something that would be very hard for me to give up. Understood. While we're in the name dropping and doing shoutouts, I think a lot of people in the San Francisco startup see no Alana. Yeah.

Starting point is 01:52:54 And most people won't. What's one thing that you think makes her so effective that other people can learn from or that you learn from? Yeah, I mean, she genuinely cares about people. When I joined Figma, if you just look at my profile, I really don't mean this to sound arrogant. But if you look at my profile, it seems kind of obvious that if I were to start another company, there would be some VC interest.

Starting point is 01:53:18 And like literally there was. Again, I'm not that special, but... No, but you had two great runs. Yeah. So it's just, it seems kind of obvious. I mean, I'm married to Atlanta. So I can't... Of course, we're going to talk.

Starting point is 01:53:30 But like, the only people that really talked to me during that period were Elad and Atlanta. Why? It's a good question. You didn't try hard enough. I mean, it's not like I was trying to talk to VCs. I don't, I'm not, yeah. I mean, so in some sense, while you're talking to Alad is enough, and then Alana can fill in the rest, like, that's it, that's it.

Starting point is 01:53:51 Yeah, so I'm just saying that these are people that genuinely care about another human. There are a lot of things over that period of getting acquired, you know, being at Figma, starting a company, they're just really hard. and what Alana does really, really well is she really, really cares about people. And people are always like, oh, my God, how come she's in this company before I am or whatever? It's like, who actually gives a shit about this person and was getting to know them before they ever sent an email? You know what I mean? Like before they had started this company and 10 other VCs were interested and now you're interested.

Starting point is 01:54:28 Who is actually like talking to this person? Yeah, she does that consistently. Exactly. The question is obviously, how do you scale? that. How do you scale caring about people? Yeah. And do they have a personal CRM? Alana has actually built her entire software stack herself. She studied computer science and was a product manager for a few years,

Starting point is 01:54:49 but she's super technical and really, really good at writing code. For those who don't know, every YC batch, she makes the best of the batch, and she puts it all into one product. Yeah, she's just an amazing hybrid between a product manager, designer and engineer. Every time she runs into an inefficiency, she solves it. Cool. Well, you know, there's more to dig there, but I can talk to her directly. Thank you for all this.

Starting point is 01:55:11 This is a solid two hours of stuff. Any call to action? Yes. One, we are hiring software engineers. We are hiring salespeople. We are hiring a dev rel. And we are hiring one more designer. We are in San Francisco.

Starting point is 01:55:29 So ideally, if you're interested, we'd like you to be in San Francisco. there are some exceptions, so we're not totally close-minded to that, but San Francisco is significantly preferred. We'd love to work with you. If you're building AI software, if you haven't heard of brain trust, please check us out. If you have heard of brain trust and maybe tried us out a while ago or something and want to check back in, let us know or try out the product,

Starting point is 01:55:55 we'd love to talk to you. And I think more than anything, we're very passionate about the problem that we're solving and working with the best people on the problem. And so we love working with great customers and, you know, have some good things in place that have helped us scale that a little bit. So we have a lot of capacity for more. Well, I'm sure there would be a lot of interest, especially when you announce to your Series A. I've had the joy of watching you build this company a little bit. And I think you're one of the top founders I've ever met.

Starting point is 01:56:23 So it's just great to sit down with you and learn a little bit. It's very kind. Thank you. Thanks. That's it. Awesome.

Latent Space: The AI Engineer Podcast - Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.