The Changelog: Software Development, Open Source - Programming with LLMs (Interview)

Starting point is 00:00:00 I'm Jared and you're listening to the changelog. Where each and every week we have conversations with the hackers, the leaders, and the innovators of the software world. We pick their brains, we learn from their mistakes, we get inspired by their accomplishments, and we have a lot of fun along the way. For the past year, David Kroshaw has intentionally sought ways to use LLMs while programming in order to learn about them. He now regularly uses LLMs while working and considers their benefits a net positive honest

Starting point is 00:00:39 productivity. David wrote down his experience, which we found both practical and insightful. Hopefully you will too. But first, a quick mention of our partners at fly.io, the public cloud built for developers who ship. Check them out at fly.io. Okay, David Kroshaw on the changelog. Let's do it. Well, friends before the show, I'm here with my good friend David Shue over at Retool. Now David I've known about Retool for a very long time.

Starting point is 00:01:13 You've been working with us for many many years and speaking of many many years Brex is one of your oldest customers. You've been in business almost seven years. I think they've been a customer for almost all those seven years to my knowledge, but share the story. What do you do for Brex? How does Brex leverage Retail? And why have they stayed with you all these years?

Starting point is 00:01:32 So what's really interesting about Brex is that they are a extremely operational heavy company. And so for them, the quality of the internal tools is so important because you can imagine they have to deal with fraud, they have to deal with underwriting imagine they have to deal with fraud, they have to deal with underwriting, they have to deal with so many problems basically. They have a giant team internally basically just using internal tools day in and day out and so they have a very high bar for internal tools.

Starting point is 00:01:54 And when they first started, we were in the same YC batch actually, we were both at Winter 17 and they were, yeah, I think maybe customer number five or something like that for us. I think DoorDash was a little bit before them, but they were pretty early. And the problem they had was they had so many internal tools they needed to go and build, but not enough time or engineers to go build all of them. And even if they did have the time or engineers, they wanted their engineers focused on building external physics software because that is what would drive the business forward. Brex mobile app, for example, is awesome.

Starting point is 00:02:24 The Brex website, for example is awesome, the Brex website for example is awesome, the Brex expense flow, all really great external vision software. So they wanted their engineers focused on that as opposed to building internal crud UIs. And so that's why they came to us and it was honestly a wonderful partnership. It has been for seven, eight years now. Today I think Brex has probably around a thousand Retool apps they use in production, I want to say every week, which is awesome. And their whole business effectively runs now on Retool. And we are so, so privileged to be a part

Starting point is 00:02:54 of their journey. And to me, I think what's really cool about all this is that we've managed to allow them to move so fast. So whether it's launching new product lines, whether it's responding to customers faster, whatever it is, if they need an app for that, they can get an app for it in a day, which is a lot better than you know, six months or a year, for example, having schlep through spreadsheets, etc. So I'm really, really proud of our partnership with Brex. Okay, retail is the best way to build, maintain and deploy

Starting point is 00:03:23 internal software, seamlessly connected databases, build with elegant components, and customize with code. Accelerate mundane tasks and free up time for the work that really matters for you and your team. Learn more at Retool.com. Start for free. Book a demo. Again, Retool.com. We are here with David Kroshaw, CTO and co-founder of Tailscale.

Starting point is 00:04:02 David, welcome to the changelog. Yeah. Oh, I'm not actually the CTO anymore. Oh, no. Your LinkedIn is outdated. Oh, does it still say that? I thought I had updated it. Are you masquerading, David? That's real time LinkedIn updates here. We can do it.

Starting point is 00:04:17 Let me check. LinkedIn updates. I read it somewhere. I usually. Me too. Well, snap, I think show starts. I think my LinkedIn, it might be confusing because it still lists that I was the CTO. I stepped back from the CTO last year. Okay, so what are you doing now?

Starting point is 00:04:30 I am spending my time exploring sort of new product spaces, things that can be done. So both inside and outside of Tailscale. So. Very cool. Most of my work inside Tailscale is around helping on the sort of customer side, you know, talking to users or potential users about how it can be useful.

Starting point is 00:04:53 And then because I have such an interest in sort of the world of large language models, I've been exploring that. But that is not a particularly good fit for the tailscale product. You know, I spent quite a long time looking for ways to use this technology inside tailscale and like it doesn't really fit. And I actually think that's a good thing. You know, it's really nice to find clear lines like that when you find something where it's not particularly useful. And I wouldn't want to try and you know, a lot of companies are attempting to make things work, even if they don't quite make sense.

Starting point is 00:05:27 And I think it's very sensible of Tailscale to not go in that direction. Do you mean like deploying LLMs inside of the Tailscale product, or how do you mean it wouldn't fit? Well, yeah, so what would Tailscale do with LLMs is the question I was asking from a Tailscale perspective. I think Tailscale is extremely useful for running LLMs is the question I was asking from a Tailsco perspective. I think Tailsco is extremely useful for running LLMs

Starting point is 00:05:48 yourself for a network backplane. Right. In particular because of the sort of surprising nature of the network traffic associated with LLMs. On the inference side, so you can kind of think about working with models from both a training and an inference. These are sort of two sides of the coin here. And training is very, very data heavy and is usually done on extremely, extremely high bandwidth, low latency networks, extremely high bandwidth, low latency networks, InfiniBand style setups on clusters of machines

Starting point is 00:06:27 in a single room. Or if they spread beyond the room, the next room is literally in the building next door. The inference side looks very different. There's very little network traffic involved in doing inference on models in terms of bandwidth. The layout of the network is surprisingly messy. This is because the nature of finding GPUs is tricky even still today, despite the fact

Starting point is 00:06:59 that this has been a thing for years now. If you have... Very tricky. Yeah. I feel I should try and explain it just because it's always worth trying to explain things, but I'm sure you all know this, which is that if you're running a service on a cloud provider that you chose years ago for very good reasons, all the cloud providers are very good at fundamental services, but they all have some subset of

Starting point is 00:07:25 GPUs, and they have them available in some places and not others. And it's never quite what you're looking for. And if you are deciding to run your own model and do inference on it, you might find your GPU is in a region across the country, or it's on a cloud provider that's different than the one you're using. Or your cloud provider can do it, but it's twice the price of another one you can get. And this leads to people ending up far more in sort of multi-cloud environments

Starting point is 00:07:53 than they do in sort of traditional software. And so, Tailsco actually is very useful there. So for users, I think it's a great fit. But what does the product actually need as like new features to support that? And the answer is, it actually is really great as it is today for that. There's no specific AI angle that you can add to the product

Starting point is 00:08:15 and immediately make it more useful. Yeah, I think that's right. I mean, there are, we came up with some proposals, but they're not exciting. Like they would be very much, we'd be with some proposals, but they're not exciting. Like they would be very much, we'd be doing it because corporate at headquarters told us to find an angle for AI or something like that. And like we as a startup have the option

Starting point is 00:08:36 of just not doing that. And so we didn't. Well, we should probably just claim that Tailscale is a past and of course, hopefully a future sponsor of change log And that Adams a huge we're working on a scale. We're more current and brings it up often But this is not a sponsored episode. In fact, well, first of all, we don't do sponsored guest appearances But also I had no idea that you were a co-founder of tailscale when I read your blog post

Starting point is 00:09:01 That made me me either. I found it out afterwards. I was like, oh cool. That's great I didn't know we'd be talking about tailscale at all when I read your blog post that made me doubt you. I found it out afterwards. I was like, oh cool. That's great. I didn't know we'd be talking about tail scale at all when I came here today. So that we're both basically on the same page. Yeah, there we go. I still work there. I'm just kidding.

Starting point is 00:09:14 Real time update. I did double check LinkedIn. You are correct. It says 2019 to 2024 was CTO, but you just see co-founder tail scale and then CTO next to it and you move on. And that's probably what Ab and I both did. Same.

Starting point is 00:09:30 We didn't realize there was an end date on that particular role. Yeah, the nomenclature or usage on the metadata usage on LinkedIn is the, their UI is like which date, which month did you begin and it says presence. So the assumption was there. I didn't read your byline on the job role. Maybe what I'll do is I'll put something new above it

Starting point is 00:09:50 and that'll make it clearer. But I don't want to mislead anyone. Fair enough, fair enough. I also honestly don't check LinkedIn very often. It's not a big part of my life. And so it. We usually check it right before a show to make sure we get someone's title right,

Starting point is 00:10:04 which is why we're both eating crow right now for getting it wrong. But. Well, to get back into the mood or the groove, whichever you wanna call it. Let's get into both. Well, I'm a tail scale user as you know. I just trimmed some machines

Starting point is 00:10:18 because I've been doing some more home labbing. So I use tail scale really only in my home lab. And thank you so much for this free tier because I Don't want to give you any money Honestly, I'm just kidding with you I think you're amazing but like I got a I gotta put a hundred machines on my tail net before I have to pay you Any money I got 18. There's no way I'm ever gonna pay you money based on your tears. Not mine. That's totally great Which is that's by design. That's totally great. Which is, that's by design.

Starting point is 00:10:46 That's by design. And I think, you know, one, thank you, because it's let me be more network curious and more home lab curious. So you, as a corporation, Tailscale have allowed me and enabled me and so many others to do just that. And that's so cool.

Starting point is 00:11:01 And I think that's, I applaud you all for that choice by design. Well thank you, that's excellent. That being said, I mean, at the same time I gotta dig. And it's not really a dig, it's just really, tail scale's kinda boring in the fact that I don't have to do much to make it work. You know, I put it in, I tail scale up,

Starting point is 00:11:19 and I'm done. Okay, you just work. I never have to worry about you working unless you're not up. You're not gonna crash some more, or what are you looking for? I'm just saying, like it's pretty boring. You know, unless I'm doing. Okay, you just work. I never have to worry about you working unless you're not Oh, or what are you looking for? I'm just saying like it's pretty boring You know unless I'm doing like serves or I'm sharing a disk across the network I'm not doing that kind of stuff But you know this whole

Starting point is 00:11:34 Multi-cloud shared GPU thing is super cool because you can have a tail net on top of a different network and share that GPU access Which I'm assuming is what you meant by that Just so cool. Honestly it is I mean I love boring software And so for me the fact that you're having a boring experience is perfect. Yeah, no surprises. No surprises Yeah, it's a product that's designed to enable you to do more things not for you to spend your days having to configure it It is so smooth. The dev X experience on this thing is bar none. You know, I know my machines, I know where they're at, I know throughout a date. It's pretty easy to do that kind of stuff. And as an avid user and a lover of tailscale, again, not sponsored, just super passionate,

Starting point is 00:12:15 I can't see how an LLM would fit in the other. I just can't see how you would work in AI to make the platform better. I mean, I haven't thought about it deeply besides the, this 20 ish minutes so far in the conversation, but I mean, give me some time and I might. Yeah. If you can come up with anything, let me know. Well, I'm very excited about the idea of it. But software has to be in some sense, true to itself. You have to think about its purpose when you're, when you're working on it and

Starting point is 00:12:43 not step too far outside that. So I similarly wouldn't build a computer game in a tail scale. I don't think that would be a particularly, you know, good fit for a product. And I feel sort of- It's like an Easter egg. As an Easter egg would be great actually.

Starting point is 00:12:56 Like a little quiz game or something built into the terminal. You want to have a hundred machines in your tail net, you get access to, you unlock a secret machine name that's on your tail net by default. Or the Pro plan. Yeah.

Starting point is 00:13:08 There you go. Right, it can ask questions like what is the oldest machine on your tail net? Something like that. Oh yeah. So yeah, that would be a lot of fun actually. There are some questions I would probably ask the tail net. Like there are actually some things

Starting point is 00:13:18 I don't know about my tail net that I could discover via a chat LLM interface. So I mean, there are some things I can see some value in, but I mean, does everybody want that or need that? Maybe, I don't know. Yeah, I don't know either. I very much went looking for something I would use features like that for,

Starting point is 00:13:37 and I didn't come up with anything. If you do come up with anything, again, I'd be very happy to hear about it. Honestly, now that I'm thinking about it in real time, you know, you have a column whenever you're on your admin and you're on your machines dashboard, essentially, you can see last seen or ones that are out of date. And unless you're savvy,

Starting point is 00:13:56 you probably haven't enabled Tails' skills ability to auto update. Maybe you have, maybe you haven't. I forget which machines I've done it on. Like everyone I install again, once I knew that update was there, I do enable that, but sometimes I forget. So I might be like, okay,

Starting point is 00:14:10 are there any of my machines that haven't been seen in a while? Are there any versions that are out? Give me a list of the ones that are out of date that I should probably concern with around security. Because you're probably not emailing me about my security concerns, but my tail net knows which ones are too far out of date

Starting point is 00:14:25 if I have an auto updated. That's true. I think we did actually email customers once about an out of date version where we were concerned about security. I think that has only come up once. Mostly, keeping tail scale up to date is sort of proactive good security practice.

Starting point is 00:14:43 Tell scale up to date is sort of proactive good security practice. The, uh, it is fortunately not been a significant source of issues in part, you know, due to Keft design, you know, a lot of engineers work very hard to make it that way. For sure. And you got a lot of amazing engineers there. Yeah. It's a great team. I guess now I'm thinking about it. I do have some ideas. Nice. I mean, I think this is the idea I have for notion as well. I use notion a lot a lot more Anything where you have a platform where you can create your own things on top of it a tail net

Starting point is 00:15:19 You know one tail net is not the same as the next even though they operate the same the way I use mine I mean that'd be the way you use yours It would be nice to have an interface where I can just ask, tail scale, how to tail scale, basically. Like I have an idea, I wanna create a new group or a whatever, or I can be introduced to new features. It's discovery. And you're essentially, by not having your own, you force people to go into the, you know,

Starting point is 00:15:41 public LLMs essentially, into the chat GPTs, into the clods, into the Olamas, into the deep seeks or whatever you might have out there. And if you can corner the market and own your own, I think you'd be one better, sure. Cause you know your documentation better too. You know, it's the deterministic nature of it is maybe non-deterministic, but you can probably fine tune it a bit more to be more focused on your customer base. I'd probably ask my, my, uh, tail scale LLM more questions if I, if I could. Yeah. So that's,

Starting point is 00:16:13 there's like an interesting sort of meta question there about LLMs around how many models there should be in the world from a sort of a consumer perspective in a sense, because that's almost like, you know, you just, you know, consuming it like where, and this sounds very similar to like the question of how does search work on websites, which you could have asked 10 years ago or 20 years ago. You know, do I use the Wikipedia search or do I go to Google and type in my search and maybe put the word Wiki at the end to bring the Wikipedia links to the top? Both of these are valid strategies for searching Wikipedia.

Starting point is 00:16:57 I honestly don't use the Wikipedia search and haven't in a while, so it may be amazing. But I have as a consumer a general concept that the search systems on individual websites are not terrific and Google is baseline decent. And so as long as I'm searching public data, I would generally prefer the Google search. I guess in a sense that's a less and less true statement every year because the large chunks of websites are just not public data anymore. Like you can't search Facebook with Google, right? I'll search Instagram with it. You can't find a TikTok with it or anything like that. And so, um, the existence

Starting point is 00:17:37 of those, uh, I think they sometimes get called walled gardens says that we should have more fine tuned tools like that. And there's just a lot of similarities there. So should start up the size of tail scale, build customized models for that for its users, I think is a sort of a big open ended question around how the model space will evolve. And I think, you know, my last year of working with models, fine tuning them, training them, carefully prompting them, you can do more and more just with carefully structured prompts and long contexts that you used to have to use

Starting point is 00:18:18 fine tuning to achieve. But all of this, my sort of big takeaway is that they are actually extremely hard to turn into products and to get those details right in a general sense for shipping to users. They're actually quite easy to get going for yourself. And I think if anything, more people should explore running models locally and playing with them because they're a ton of fun and they can be very productive very quickly. But in much the same way that it's really easy to write a Python script that you

Starting point is 00:18:51 run yourself on your desktop, versus a Python script you ship in production for users. LLMs have this huge sort of complexity gap when it comes to trying to build products for others. And so I agree that that sort of tooling would be fun and should exist. I also think where we are today, it's quite hard for a team the size of a startup to ship that as not part of their core product experience. What if it enabled so much deeper and greater usage? Because the one thing you want to do as a startup

Starting point is 00:19:25 or a brand like you are, I would imagine at least from the outside is a deeper customer is better than a shallow customer, right? If I've only got a few machines, well, one, my affinity and my usage is lower. So maybe my value is lower, but if I'm deeply entrenched in it's, it's as a result of great documentation which you have, but docs, they are are good when you have a particular precision thing and you want to read and understand and discover how a feature works.

Starting point is 00:19:56 And they only go so far and sometimes or even out of date, just hypothesize in that whether or not like this, what would be required? One, in terms of a lift, one in engineering power and two, potentially financial power. And then two, what is that costing you by lack of more deep users and, you know, shallow users in comparison? I think that is exactly the right way to frame the question for a business. And I don't know the answer to a lot of those questions. I can talk to some of the more technical costs involved. What the benefits would be to the company

Starting point is 00:20:33 is extremely open-ended to me. Like I don't actually, I can't imagine a way to measure that based on talking to customers of Tailscale who deploy it, thinking about the companies where, and so to go back to something you said earlier about how you use it and you don't pay for it, I think that's great because Tailscale has no intention of making money off individual users.

Starting point is 00:20:56 That's not a major source of revenue for the company. The company's major source of revenue is corporate deployments. And there's a blog post by my co-founder Avery about how the free plan stays free on our website, which sort of explains this, that individual users help bring tailscale to companies who use it for business purposes and they fund the company's existence.

Starting point is 00:21:24 So looking at those business deployments, you do see a tail scale gets rolled out initially at companies for some tiny subset of the things that could be used for. And it often takes quite a while to roll out for more. And even if the company has a really good roadmap and a really good understanding of all of the ways they could use it, it can take a very long time to solve all of their problems with it.

Starting point is 00:21:48 And that's assuming they have a really good understanding of all of the things it can do. And the point you're making, Adam, that people often don't even realize all the great things you can do with it is true. And I'm sure a tool that helps people explore what they could do would have some effect on revenue. In terms of the sort of the technical side of it and the challenges, one of the, there is several challenges in the very broad

Starting point is 00:22:10 sense, the biggest challenge with LLMs is just the enormous amount of what you might call traditional non-model engineering has to happen out the front of them to make them work well. It's surprisingly involved. I can talk to some things I've been working on over the last year to give you a sense of that. Beyond that, the second sort of big technical challenge

Starting point is 00:22:31 is one of sort of Tailsco's core design principles is all of the networking is end-to-end encrypted. And the main thing an LLM needs to give you insight is a source of data. And the major source of data would be what is happening on your network, what talks to what, how does it all work? And that means that any model telling you how you could change your networking layout or give you insight into what you could do would need access to data that we as a company

Starting point is 00:23:00 don't have and don't want. And so we're back to it would have to be a product you run locally and have complete control over, which is absolutely, you know, those sorts of, my favorite sorts of products are that, you know, I like open source software that I can see the source code for, compile myself, run locally. That's how I like all things to be.

Starting point is 00:23:22 But trying to get there with LLMs in the state they are today is actually, I think, pretty tricky. I don't think I've seen an actually shipped product that does that really well for people. There's one. There's a developer tool that I hear a lot of good talk about that I don't. I'm just trying to live search for it for you.

Starting point is 00:23:45 Nope, that's the wrong one. That's magic shell history, which also sounds really cool. I should use that one. Is that A2N? A2N, yeah. That one's awesome. Oh, you've used it? Oh, great.

Starting point is 00:23:55 I'm a daily user, yeah. No LLMs involved on that one. Yeah, I thought that was the LLM. There's another one that is in the sort of agent space for developers as they're writing programs and it helps you. It's like a local Claude effectively. And it's primarily built around helping you construct prompts really carefully for existing open models.

Starting point is 00:24:17 And it's come up several times and I'm sorry, it's falling out of my head. I will look it up later. I'm sorry. But I hear very positive things about it. And that, that's the closest I've seen to sort of a shipped completely local product, uh, uh, that does that sort of thing on which models to use. Uh, I think given the state of models that exist today, open models, the major shipped open models are so amazing that it always makes sense to start with one of those sort of models as a, if nothing else, as a pre-trained base for anything that's happening.

Starting point is 00:24:53 Building a model from scratch is a very significant undertaking. And I don't think it's necessary for most tasks. The available open models are extremely general purpose. And so at worst, you would be fine tuning from one of those to build a product. If you take one of the llamas or I mean, there's a lot of talk about deep seek, which produces terrific results. It's a very large model. It'd be very hard to start with it, though I understand there's some very good distilled work coming from it using other models. Well friends, I am here with a new friend of mine, Scott Deaton, CEO of Augment Code. I'm excited about this. Augment taps into your team's collective knowledge, your code base, your documentation, your dependencies.

Starting point is 00:25:42 It is the most context-aware developer AI, so you won't just code faster, you documentation, your dependencies. It is the most context aware developer AI. So you won't just code faster. You also build smarter. It's an ask me anything for your code. It's your deep thinking buddy. It's your stand flow antidote. Okay, Scott. So for the foreseeable future, AI assisted is here to stay.

Starting point is 00:25:57 It's just a matter of getting the AI to be a better assistant. And in particular, I want help on the thinking part, not necessarily the coding part. Can you speak to the thinking problem versus the coding problem and the potential false dichotomy there? A couple of different points to make. You know, AIs have gotten good at making incremental changes, at least when they understand customer software.

Starting point is 00:26:20 So first and the biggest limitation that these AIs have today, they really don't understand anything about your code base. If you take GitHub Copilot, for example, it's like a fresh college graduate, understands some programming languages and algorithms, but doesn't understand what you're trying to do. And as a result of that,

Starting point is 00:26:36 something like two thirds of the community on average drops off of the product, especially the expert developers. Augment is different. We use retrieval augmented generation to deeply mine the knowledge that's inherent inside your code base. So we are a copilot that is an expert and they can help you navigate the code base, help you find issues and fix them and resolve them over time much more quickly than you

Starting point is 00:27:00 can trying to tutor up a novice on your software. So you're often compared to GitHub Copilot. I gotta imagine that you have a hot take. What's your hot take on GitHub Copilot? I think it was a great 1.0 product, and I think they've done a huge service in promoting AI, but I think the game has changed. We have moved from AIs that are new college graduates

Starting point is 00:27:24 to, in effect, AIs that are now among the best developers in your code base. And that difference is a profound one for software engineering in particular. You know, if you're writing a new application from scratch, you want a web page that'll play tic-tac-toe piece of cake to crank that out. But if you're, you're looking at, you know, a tens of millions of line code base, like many of our customers, Lemonade is one of them. I mean, 10 million line mono repo, as they move engineers inside and around that codebase

Starting point is 00:27:52 and hire new engineers, just the workload on senior developers to mentor people into areas of the codebase they're not familiar with is hugely painful. An AI that knows the answer and is available seven by 24, you don't have to interrupt anybody and can help coach you through whatever you're trying to work on

Starting point is 00:28:10 is hugely empowering to an engineer working on unfamiliar code. Very cool. Well, friends, Augment Code is developer AI that uses deep understanding of your large code base and how you build software to deliver personalized code suggestions and insights. A good next step is co-suggestions and insights. A good next step is to go to augmentcode.com. That's A-U-G-M-E-N-T-C-O-D-E.com. Request a free

Starting point is 00:28:35 trial contact sales or if you're an open source project, Augment is free to you to use. Learn more at augmentcode.com. That's A-U-G-M-E-N-T-C-O-D-E.com. Augmentcode.com. So you've been using those in your day-to-day programming work for the last year and back in early January, you wrote this post, How I program with LLMs, which I found to be refreshingly practical and straightforward, your findings. You said you've been actively trying these things. I feel like I've been passively trying them,

Starting point is 00:29:17 not really trying to optimize my setup, but just like, you know, like a Neanderthal kind of poking at a computer box, you know, like, oh, does this work? No, okay, for the last couple of years. So I do use these things, but I don't think as effectively as most or at least some. And I loved your findings,

Starting point is 00:29:37 and of course you're building something as a result of it, but can you take us on that journey over the last year or so where you started with LLMs and what you found in your day-to-day programming? Yeah, I don't think your experience is unusual, actually. I think almost everyone has your experience. And for most software, I am in the same category.

Starting point is 00:29:56 I try things at a very surface level when they're newish and see if there's any really obvious way they help me, and if they don't, I put them aside and come back later. A great example of that is the Git version control system. It was 10 years before I really sat down and used it. I was using other version control systems. After 10 years, I was like, okay, this thing's probably going to stick around.

Starting point is 00:30:19 I guess I'll get over it's user interface. Fine. Get to this. I was reluctant, but I got there in the end. LLMs really struck me as fascinating. I decided to, you know, I made this active decision to not do that with them. And like set out on a process of trying to actively use them,

Starting point is 00:30:36 which has involved learning just a really appalling amount. Honestly, like it's very reasonable that most engineers haven't done really significant things with LLMs yet because it is It's too much cognitive load Like you know if you if you're writing computer programs, you're trying to solve a problem All right You only have so much of your brain available for the tools you use for solving problems because you have to fit the problem in There as well and the solution you're building and that should be most of what you're thinking about. The tools should take up as little space as possible.

Starting point is 00:31:09 And right now to use LLMs effectively, you need to know too much about them. And that is my sort of big, that was my big takeaway, you know, 11 months ago or so, which is why I started working on tools with some friends to try and figure this out. Because there has to be a way to make this easier. And my main conclusion from all of that is there's an enormous amount of traditional engineering to do in front of LLMs to get there. So the first really effective thing I saw from LLMs

Starting point is 00:31:39 is the same thing I think most engineers saw, which was GitHub Copilot, which is a code completion. Also, actually, GitHub Copilot has which is a code completion, also actually GitHub Copilot has taken on new meanings. It's more than that now, right? Yeah, it's an umbrella brand that means all sorts of products and I actually honestly haven't even used most of those products at this point. The original product is a code completion system built into Visual Studio Code, where as you type, it suggests a completion for the line or a few lines beyond that of where you are,

Starting point is 00:32:11 which is building on a very well-established paradigm for programming editors. Visual Studio 6.0 did this 25 years ago with IntelliSense for completing methods in C++. This is not a new idea. Around the same time, we had eTags for Emacs, or cTags, I should say, which gave us similar things in the Unix world. This is extending that idea by bringing out some of the knowledge of a, of a large language model in the process of completing.

Starting point is 00:32:51 And I, I'm really enamored with the entire model. Like, you know, co-pilots original experience, uh, but it came out was magical, like it was just like, there was nothing like this before. It was really, I think, jumpstarted a lot of interest in the space from people who hadn't been working on it, which was almost all of us. And from my perspective, the thing that really struck me was, wow, this works really well. And wow, it makes really obvious silly mistakes. It uses both sides of this.

Starting point is 00:33:19 It would suggest things that just don't compile in ways that are really obvious to anyone who takes a moment to read it. And it would also make really impressive cognitive leaps where it would suggest things that, yes, that is the direction I was heading and it would have taken me several minutes to explain it to someone and it got there immediately. And so I spent quite a lot of time working on code completion systems with the goal of improving them by focusing on a particular programming language. And we've made some good progress there.

Starting point is 00:33:47 We actually hope to demonstrate some of that publicly soon, like in the next few weeks, probably in this sketch.dev thing that we've been building. We'll integrate it so that people can see it and give it a try. But so those models are interesting because they're not the LLM experience that most users have. Like when everyone talks about AI today, they talk about chat GPT or Claude, or these chat-based systems. And the thing I really, really like about the original copilot code completion model is it's not a chat system. It's a different user interface experience for the knowledge and the knowledge. And that's really a lot of fun. And in fact, the technology is

Starting point is 00:34:23 in an LLAM and that's really a lot of fun. In fact, the technology is a little bit different too. There's a concept in the model space called fill in the middle where a model is taught a few extra tokens that don't exist in the standard chat model. With fill in the middle, which is a lot of fun, a model is taught a few extra tokens. It's taught a prefix token, a suffix token, and a middle token. What you do is you feed in as a prompt to the model the file you're in. All the characters before where the cursor is get fed right after a prefix token. You feed in prefix, all the characters of the file,

Starting point is 00:35:06 then you feed in suffix, and you feed in all the tokens after the cursor, and then you feed in the middle token, and then you feed in whatever goes into the middle to complete it. And that's the prompt structure for one of these models. And then the model keeps completing the thing that you fed in, it writes the next characters. And you train the model by taking existing code out there.

Starting point is 00:35:35 There's a few papers on how these models are trained because Meta published one of these models and Google published one of these models under the Gemma brand. There's a few others out there. There's one from Quinn and some other derived ones. And you take existing code files. You choose some section of code. You mark everything before it as the prefix, everything after it as the suffix.

Starting point is 00:35:59 And you fill in everything after it as the middle. And that's your training data. You generate a lot of that by taking existing code and breaking it up into these files randomly, by randomly inserting a cursor. Then you've taught a model how to use these extra characters and how to complete them. And so it's not a chat model at all.

Starting point is 00:36:17 It's a sort of a sequence to sequence model. It's a ton of fun. And the advantage of these systems is they're very fast compared to chat models. And that's the key to the whole code completion product experience is you want your code completion within a couple hundred milliseconds of you typing a character. Whereas if you actually time Claude or you time one of the open AI models, they're very slow. Like they take a good minute to give you a result.

Starting point is 00:36:47 And there's a lot of UI tricks in hiding that minute. They move the text around on the screen, then they stream it in. Streaming, yeah, it's just very clever. Because you're reading it word by word as it comes out, but it's like, it's basically stalling. You're like, come on, just give me the answer already. That's right.

Starting point is 00:37:03 Yeah, exactly. You can really feel it with the new reasoning models. Oh, one of these things, because this is this is pause at the beginning. It's a thinking phase. I'm like, come on, I don't. And it tells you what it's thinking, which is cool. But it's like, think faster. I don't care what you're thinking. Give me the answer.

Starting point is 00:37:20 I think it's actually kind of cool when you see that. You can come up. I mean, you get to see like, I feel like this is the closest we've glimpsed into the future Then we've ever been able to by watching the reasoning in in real time You see you see the the act of reasoning that it's happening explains the reasoning users ask me this I'm gonna think about that. Okay, I thought about that which causes this and it's like this step process and it reminds me of how I think too So I'm like that's pretty dang cool But it's also a great trick. I agree. Yeah, it's a it is a ton of fun to watch

Starting point is 00:37:53 I agree and it is a lot of insight into how the models work too because the the insides of the models are a large number of floating point numbers holding intermediate state and It's very hard to get insight into those. But the words, you can read them. You can make some sense of them. Right. So code completion is, I think, extremely useful to programmers. It varies a lot depending on what you're writing

Starting point is 00:38:15 and how experienced models are with it and just how sort of out on the edge of programming you are. If you're really out in the weeds, the models can get less useful. I used a model for writing a big chunk of AVX assembly a few months ago. And the model was both very good at it and very bad at it simultaneously. And it was very different from the typical asking a model to help with programming experience. It would constantly get the order operations wrong, um, or over complicate things or misunderstand while it was a,

Starting point is 00:38:54 it's a very different experience than, uh, than typical programming. What model was this? How did you find it? I used all of them for that. Okay. And this is what I meant by, I'm spending a lot of time actively exploring the space. Yeah. I'm putting far too much work into exploring the model space as I do work. It makes sense that there are specific models that are good for autocomplete versus search versus chat.

Starting point is 00:39:19 But have you found the correct one for each particular subtask? Or what's your advice there? Is it like use them all or just stick with this, you'll be good? I can't advise people to use them all. You know, that's too much work. Use a bunch of them. Yeah, and this I think is the big problem.

Starting point is 00:39:37 And you mentioned that most programmers are probably using this. As far as we can tell, not one fifth of programmers are using these tools today. How can you tell that? Through surveys. A couple of people have done surveys of programmers. And it seems to come back that most people are not

Starting point is 00:39:53 using these tools yet. Which is both shocking to me, because they're so useful, and also makes a lot of sense. Because it's a lot of work figuring out how to use them. I have a an analogy that I'd like to share. If you're a runner, you probably wear running shoes, right? You're probably not going to run barefoot. I think it's like admitting to running barefoot.

Starting point is 00:40:20 Like you wouldn't do that. You would run a marathon with rocks on the road and debris and things like that. Barefoot versus running shoes designed to age you in the process of running to make it more speedy, comfortable, agile, etc. I feel like that's where we're at. Like I've I've changed my tune, let's just say, because I feel like it's not going to go away. And to hear that one fifth I'm I haven't dug into these surveys but that's surprising one fifth is using it and it seems like I guess then the the four of the five are saying no or for the time for the moment denying it I mean do either of you disagree with that analogy is it way

Starting point is 00:41:01 off or is it Jared that's why you kind of like shake your head a little bit what your thoughts on that? Now I mean I don't think that these tools have come to the Do you disagree with that analogy? Is it way off or is it, Jared, that's why you kinda like shake your head a little bit. What's your thoughts on that? No? I mean, I don't think that these tools have come to the place that the running shoe has. I also think there's probably plenty of world-class runners who run shoeless and would never run with a shoe on,

Starting point is 00:41:17 because that's for fools. But I'm not gonna go there. Well, would you run a New York City marathon with no shoes? I wouldn't be so foolish. That's a tri- New York City marathon with no shoes? I wouldn't be so foolish as to try a New York City marathon. Okay. You have to admit though that having the shoes on is probably better for you than worse for you. Well, at what point has the shoe proven itself to be useless?

Starting point is 00:41:37 Because these tools routinely prove themselves to not just be wrong, but dramatically wrong in ways that if you follow them, you will be like Michael Scott who drives directly into the pond. No, no, no, no, no, no, look. It means go up to the right, bear right over the bridge and hook up with 307. Make a right turn.

Starting point is 00:41:56 Maybe it's a shortcut Dwight, it said go to the right. It can't mean that, there's a lake there. I think he knows where it is going. This is the lake. The machine knows. This is the lake. Stop yelling at me. No, it's not the lake. Stop yelling at me. No, it's the lake.

Starting point is 00:42:05 Stop yelling. There's no road here. Oh, yes. Because his voice assistant or his GPS tells him to keep going straight and he just keeps going straight. I concur on that one too. I concur. So I can see why you could get frustrated and throw up your hands and say, I'm going

Starting point is 00:42:19 to come back to this in a year or two years, but I'm going to let all the frontiers people like David figure all the stuff out. In like David, figure all this stuff out. In the meantime, I got code to write. I can see a lot of people saying that. I'm not, I could see myself saying that. I haven't because I am curious and I don't wanna fall behind. But I still don't feel like this is a must have

Starting point is 00:42:41 for everybody today. But there are moments where I'm like, that was amazing. So, for sure. I'm actively trying to fit it in into everything I do, is I guess my perspective. I'm actively, if it's Home Lab, it's that. If it's contracts, agreements, proposals, if it's thinking, if it's exploration, if it's coding,

Starting point is 00:43:00 if it's you pick it's your if it's kind of thing. I'm trying to fit it in and I'm just, so I'm sitting down on my bench, I got my socks on and I'm trying to put the shoe on, let's just say. You know, to kind of extend my analogy. You're gonna wear it. Yeah, I believe that, you know, I'm gonna put this shoe on, I'm gonna wear it for every scenario that makes sense

Starting point is 00:43:19 because I can tell you I move faster, I think differently when I'm in those modes. Are they wrong? Do I always check it? Of course. But I know that it's coming for almost everything we do. Every task we do that's productive, coding, thinking, writing, whatever,

Starting point is 00:43:34 it's coming for it in a positive way. I mean, I totally agree that it is coming for it. I also think it's very early days and a great reason to not learn this technology today is that it's changing so fast. Yeah. And that you can spend a very long time figuring out how to make it work and then that can all all of that accumulated skill can be sort of made useless tomorrow. Right. By some new product. If you remember stable diffusion first dropped probably two years ago now.

Starting point is 00:44:05 And we were enamored with prompt engineering. And what was that artist's name that you'd always, if you added it to your stable diffusion prompt, it automatically get awesome. And then he got mad because everyone's using him to like make better pictures. Like that whole technology, you know, that magical incantation is just completely moot

Starting point is 00:44:23 at this point. Like this probably, it's easier now to get better pictures without being such a wizard. And whatever name you're invoking in the past is just that name doesn't do what it did on the last version of stable diffusion, just as one instance, like prompt engineering has changed dramatically. And anybody who is ignoring it all and just listening to us talk about on the change log and just like staying with Their regular life like they've saved themselves a lot of time

Starting point is 00:44:48 Then those of us who dove in and decided they were gonna memorize all the magic words Yeah, absolutely like a year ago a common technique with open models that existed was to offer them money to solve problems You start every prompt by saying I'll give you $200 if you do this. And it greatly improved outcomes. Yeah, or let's take this step by step. Like that phrase was one of those magical things that made it better. All of those techniques are gone now. If you try bribing a model, it doesn't help.

Starting point is 00:45:17 There was a great example I saw of that where someone would kept saying, I'll give you $200 if you do this. And they did it in a single prompt several times. and they got to the nth case and it said but you haven't paid me for the previous ones. Yeah We had a deal No No money means no, there you go. All right, they're very funny Well, so I spent a long time believing and I, I still believe this in the longterm, that chat is currently our primary user interface

Starting point is 00:45:49 on the models, and it's not the best interface for most things. The way to get the most value out of models today when you program is to have conversations with models about what you're writing. And that's, I think, it's quite the mode shift to do that. It's quite taxing to do that. And it feels like a user interface problem that

Starting point is 00:46:12 hasn't been solved yet. And so I've been working a lot with Josh Bleacher-Snyder on these things. And we spent a long time looking for how can we avoid the chat paradigm and make use of models. That's why code completion initially was so interesting because it's an example of using models without chat and it's very effective. We spent a long time exploring this to give you another example of something we built

Starting point is 00:46:38 in this space because we've just been trying to build things to see what's actually useful. We built something called Merd, merd.ai, which I think we put up a few weeks ago. And it does merge commits for you. So if you try and push a git commit or do a rebase, and you get a merge conflict, you can actually use LLMs to generate sophisticated merge commits for you. It turns out that's a much harder problem than it looks.

Starting point is 00:47:08 Like you would think you just paste in all of the files to the prompt and you ask it to generate the correct files for you. Even the frontier models are all really bad at this. You almost never get a good merge commit out of them. But with a whole stack of really mundane engineering out the front, mundane is not the right word because a lot of it's actually really very sophisticated, but it's not, it doesn't involve the LLM itself.

Starting point is 00:47:34 It's about carefully constructing traditional is a much better word. Yeah. It's a, you can actually get very good merge commits out of it. And that user experience seems much better for programmers to me that you could imagine that being integrated into your workflows to the point where you send a PR, there's a merge conflict, it proposes a fix right on the PR for you. And in fact, we attempted a version of that

Starting point is 00:47:58 where there's a little Gitbot that you can at mention on a PR and it sort of generates another PR based on it that fixes the merge conflict for you. And that sort of experience doesn't require the chat interface to be exposed to the programmer to make use of the intelligence in the model. And that is where I dream of developer tools getting so that everyone can use them without having to learn a lot about them. You shouldn't have to learn all the tricks for convincing a model to write a merge commit for you.

Starting point is 00:48:31 It should be a button, or not even a button. It should just do it when GitHub says there's a merge conflict. And so it's actually, it works pretty well. We've seen it generate some very sophisticated merge commits for us. I'd love to see more people give it a try and let us know what the state of that is. But so just because that is such a hard state to get to, we built Sketch, which exposes the traditional chat interface in the process of writing code. Because we're just not,

Starting point is 00:49:04 we don't think the models are at a point yet where we can completely get away from chat being part of the developer's workflow. So at what level of granularity is Sketch working at? And do you imagine it moving up eventually, wherever it is, because the panacea, right, the silver bullet is what some folks are trying to do with Devon, for instance,

Starting point is 00:49:30 where it's like, you describe at a very high level a system, and it goes and builds that system. V0 from Vercell is another one that's doing these things. And they're very much at, in my opinion, the prototype slash demo level of quality, not the production level of quality, in their output. And it seems like they're very difficult. In my limited experience with these things,

Starting point is 00:49:57 they're very difficult to actually mold or, what do you do? I'm losing a word here. A sculpt, I don't know, like a sculpture. To actually like sculpt what they come out with and change it into something that you actually would write or like. But those are like the very high level of like,

Starting point is 00:50:14 well it should have a contact form that submits to this thing. But maybe you're looking down more where the, where I use them currently which is like, yo write me a function that does this particular thing. And at that level, it seems a lot easier to even chat to if I have to. I would rather not chat to it,

Starting point is 00:50:33 but spit out code that I could copy, paste, and modify, versus being like, I'm gonna have to throw this away and rewrite it. Right, yes, I think that lines up really well with the way Josh and I think about these things, where today if you open up a model, a cloud provider's frontier model or a local deep seek or even a Llama 70B, you can ask it to write a Python script that does something. It could be a Python script to go to the GitHub API and grab some data and present it neatly for you.

Starting point is 00:51:06 And it will do a great job. These great models can basically do this in a single shot where you write a sentence and the outcomes of Python script that solves a problem. And like that's an astonishing technical achievement. I really, it's amazing how quickly I've got used to that as a thing, but. Yeah, you're not even impressing me right now. I know. Yes, it can do that. Exactly. It is amazing how quickly I've got used to that as a thing. Yeah, you're not even impressing me right now.

Starting point is 00:51:25 I know. Yes, it can do that. We all know. Exactly. But it is amazing. Exactly. Like five years ago, if you told me that, I would struggle to believe it. And yet now I just take it for granted.

Starting point is 00:51:38 Yes. And so that works. We've got that. We've got a thing that can write really basic Python scripts for us. Similarly, these systems, at least the frontier models, are good at writing a small React component for you. You can give almost any of them like a... You need more than a single sentence.

Starting point is 00:51:55 You need just a few sentences to structure the React component, but out comes some HTML and some JavaScript in the React syntax, the JSX syntax, or the TSX syntax. And it's pretty close. It might need some tweaking. You might have some back and forths to get there, but you can get about that out of it. And clearly, models are going to improve. There's no evidence to suggest we're at the limit here as the models keep improving every month at this rate.

Starting point is 00:52:24 And part of what we're interested in Sketch is getting beyond helping you write a function, which I also use today, right? I get for Frontier Models to write functions for me, to sort of, how can we sort of climb the complexity ladder there? And so the point we chose is a point that, you know, is comfortable for us and what is helpful for us

Starting point is 00:52:44 is the Go package. How can we get a model to help us build a Go package to solve a problem? And there's an implicit assumption here in that the shape of Go packages looks slightly different at the end of this. Packages are a little bit smaller and you have a few more of them than you would in a sort of traditional go program you wrote by hand. But I don't think that is necessarily a bad thing. Honestly, my own programming, as a go program, I tend to write larger packages because there's a lot of extra work involved in me breaking it into smaller packages.

Starting point is 00:53:20 And there's often this thought process going on in my mind of like, oh, in the future, this would be more maintainable as more packages. But it's more work for me to get there today. So I'll combine it all now and maybe refactor it another day. And switching to trying to have LLMs write significant chunks of packages for you makes you do that work up front. That's not necessarily a bad thing. It's perhaps more the way we'd like our code to end up.

Starting point is 00:53:47 And so Sketch is about taking an LLM and plugging a lot of the tooling for Go into the process of using the LLM to help it. So an example is I asked at the other day to write some middleware to broadly compress HTTP responses under certain circumstances because Chrome can handle broadly encoding and it's very efficient. It's not in the standard library, at least it wasn't the last time I looked. And the first thing it did was it included a third party package that Andy had written

Starting point is 00:54:22 that has a broadly encoder in it. And so Sketch Go gets that in the background in a little container as you're working and has a little Go mod there that modifies so that as you're editing the code, you get all the code completions from that module, just like you would in a programming environment. And more importantly, we can take that information and feed it into the model as it's working. If we run the Go build system as part of it, and if a build error

Starting point is 00:54:51 appears, we can take the build error, feed it into the model. It's like, here's the error, and we can let it ask questions about the third-party package it included, which helps with some of the classic problems you see when you ask Claude to write you some go code where it includes a package and then makes up a method in there that doesn't exist that you really wish existed because it would solve your problem. And so this sort of tool, automated tool feedback is doing a lot of the work I have to do manually

Starting point is 00:55:17 when I use a frontier model. And so I'm trying to cut out some of those intermediate steps where I said, that doesn't exist, could you do it this way? Anything like that you can automate saves me time, it means I have to chat less. And so that's the goal is to slightly climb the complexity ladder in the piece of software

Starting point is 00:55:33 we get out of a frontier model and to chat less in the process. Are you achieving that by having a system prompt or are you actually fine tuning? Like how are you as the sketch.dev creators taking a foundation model and doing something to get here? Today it is almost entirely prompt driven. There's actually more than one model in use under the hood

Starting point is 00:55:58 as we try different things. For example, we use a different model for solving the problem of if we want to go get a package, what module do we get to do that? Which sounds like a mechanical process, but it actually isn't. There's a couple of steps there. So a model helps out with that. There's very different sorts of prompts you use for trying to come up with the name of a sketch, then there are for answering questions. But at the moment, it's entirely prompt driven in the sense that a large context window and a lot of careful context construction

Starting point is 00:56:33 can handle this, can improve things. And that can include a lot of tool use. Tool use is a very fun feature of models where you can instruct. So to back up and give you a sense of how the models work, an LLM generates the next token based on all the tokens that come before it. When you're in chat mode and you're chatting with a model, you can at any point stop and have the model generate the next token.

Starting point is 00:57:02 It could be part of the thing you're asking it or its response. That meta information about who is talking is sort of built into just a stream of tokens. So similarly, you can define a tool that a model can call. You can say, here's a function that you can call and it will have a result. And the model can output the specialized token that says call this function, give it a name, write some parameters. And then instead of the model generating the next token,

Starting point is 00:57:35 you pause the stream, you the caller go and run some code. You go and run that function call that it defined, paste the result of that function call in as the next set of tokens and then ask the model to generate the token after it. So that technique is a great way to have automated feedback into the model. So a classic example is a weather function. And so you define a function which says current weather.

Starting point is 00:58:01 The model, then you can ask the model, Hey, what's the weather? And the model can say, call function, current weather, your software that's printing out the tokens pauses calls, current weather says sunny, you pay sunny in there. And then the model generates the next set of tokens, which is the chat response saying, Oh, it's currently sunny. And that's the sort of easy way to plug external systems into a model. This is going on under the hood of the user interfaces you use onto Frontier models. So this is happening in chat, GPT, and Gordon, all these systems. Sometimes they show it to you happening, which is how you know. You see it less now, but

Starting point is 00:58:43 Sometimes they show it to you happening, which is how you know. You see it less now, but about six months ago, you could see in the GPT-4 model, you would ask it questions and it would generate Python programs and run them and then use the output of the Python program in its answer. I had a really fun one where I asked it how many transistors fit on the head of a pin. And it started producing an answer and it said like, well, transistors are about this big, pins are about this big. And so I guess the magic little emoji appeared that this means this many transistors fit on the head of a pin, some very large number.

Starting point is 00:59:19 And if you click on the emoji, it shows you the Python program it generated to do the arithmetic. It executed that as a function called came back with the result. And that saved it the trouble of trying to do the arithmetic itself, which all of Amazon notoriously struggle with doing arithmetic. This is a great thing to outsource to a program.

Starting point is 00:59:38 And so- It's a funny work around, because you know, if you're a calculator for words, you're not necessarily a calculator for numbers. Yeah, they're much better. And you can't do those reliably, then you could just write a program that does it and returns the same thing every time.

Starting point is 00:59:51 Yes, they're very good at writing programs to do the arithmetic, very bad at doing the arithmetic. So it's a great compromise. The thing we do with Sketch is try to give the underlying model access to information about the environment it's writing code in using function calls. So a lot of our work is not fine-tuning the model.

Starting point is 01:00:11 It's about letting it ask questions about not just the standard library, but the other libraries it's trying to use so that it can get better answers. It can look up the Go Doc for a method if it thinks it wants to call it, use that as part of its decision-making process about the code it generates. Can you describe let it ask?

Starting point is 01:00:30 I mean, you've said a couple of times that I've been curious about this. When you say let it ask, what does that mean? Like decompress that compressed definition. So at the beginning in your system prompt or something like your system prompt, depends on the API on exactly how the model works, you say there is a function call which is get method docs

Starting point is 01:00:51 and it has a parameter which is name of method. And in the middle of, and then you can ask a, you can construct a question to an LLM that says, generate a program that does this with the system prompt which explains that there's a tool call there. And so as your LLM is generating that program, it can pause, make a system call, make a function, a tool call that says,

Starting point is 01:01:16 get me the docs for this. And so the LLM decides that it wants to know something about that method call. And then you go and run a program, which gets the result, gets the documentation for that method from the actual source of truth. You paste it into the prompt.

Starting point is 01:01:34 And then the LLM continues writing the program, using that documentation as now part of its prompt. And so this is the model driving the questions about what it wants to know about. And just blocks and waits for that to come back. Yes. Effectively. Yeah.

Starting point is 01:01:55 Yeah, so it's like an embed. If you step back to like running Lama CPP yourself or something like this, you can sort of oversimplify one of these models as every time you want to generate a token, you hand the entire history of the conversation you've had or whatever the text is before it to the GPU to build the state of the model.

Starting point is 01:02:21 And then it generates the next token. It actually generates a probability value for every token in its token set. And then the CPU picks the next token, attaches it to the full set of tokens, and then does that whole process again of sending over the entire conversation and then generating the next token. And so if you think about that very long, that very long, big giant for loop around the outside of every time there's a new token, the token is chosen from the set of probabilities

Starting point is 01:02:54 that comes back is added to the set. And then a new set of probabilities is generated for the next token. You can imagine in the middle of that for loop, having some very traditional code in there that inserts a stack of tokens that wasn't actually decided by the LM, but then become part of the history that the LM is generating the next token from.

Starting point is 01:03:14 And so this is, that's how those embeds work. You can effectively have the LM communicate with the outside world in the middle there by it driving that, or you don't even have to have it drive it. You could have software outside the LM that looks at the tokens set as, as it's appeared and then insert more tokens for it. So this is all the fun stuff you can do by running your models yourself. Yeah, I know. That's so fun. Well, friends, I'm here with Samar Abbas, co-founder and CEO of Temporal.

Starting point is 01:03:47 Temporal is the platform developers use to build invincible applications. But what exactly is Temporal? Samar, how do you describe what Temporal does? I would say to explain Temporal is one of the hardest challenges of my life. It's a developer platform and it's a paradigm shift. I've been doing this technology for almost like 15 years. The way I typically describe it, imagine like all of us when we were writing documents in the 90s,

Starting point is 01:04:13 I used to use Microsoft Word. I love the entire experience and everything, but still the thing that I hated the most is how many documents or how many edits I have lost because I forgot to save or like something bad happened and I lost my document. You get in the habit when you are writing up a document back in the 90s to do control S. Literally every sentence you write. But in the 2000s, Google Doc doesn't even have a save button.

Starting point is 01:04:37 So I believe software developers are still living in the 90s era. Where majority of the code they are writing is there some state which needs to live beyond multiple request response. Majority of the development is load that state, apply an event and then take some actions and store it back. 80% of the software development is this constant load and save. So that's exactly what temporal does. What it gives you a platform where you write a function and during the execution of a function of failure happens,

Starting point is 01:05:07 we will resurrect that function on a different host and continue executing where you left off without you as a developer writing a single line of code for it. Okay, if you're ready to leave the nineties and build like it's 2025 and you're ready to learn why companies like Netflix, DoorDash and Stripe trust Temporal as their secure,

Starting point is 01:05:25 scalable way to build invincible applications. Go to temporal.io, once again temporal.io. You can try their cloud for free or get started with open source. It all starts at temporal.io. Is Go particularly well suited for this kind of tooling because of the nature of the language or is it just your favorite or why Go? Yeah, that's a really good question.

Starting point is 01:05:51 The best programming language for LLMs today is Python and I believe that is a historical artifact of the fact that all of the researchers working on generative models work in Python. And so they spend the most time testing it with Python and judging a model's results by Python output. There was a great example of this in one of the open benchmarks I looked at, and I believe this has all been corrected since then. This is all about a year old. There was a multi-language benchmark that tested how good a model is across multiple languages. I opened up the source set for it and looked at some of the Go code, because I'm a Go programmer, and

Starting point is 01:06:42 it had been machine translated from Python so that all of the variable names in this Go code used underscores instead of camel case. And the models were getting a certain percentage success rate generating these results. So Josh went through, actually, and made these more idiomatic in the go style of using camel case and putting everything

Starting point is 01:07:08 in the right place. And the model gave much better results on this benchmark. And so that's an example of where languages beyond the basic ones that the developers of the models care about are not being paid as much attention to as what you would like. And things are getting a lot better there. The models are much more sophisticated. The teams building them are much larger. They care about a larger set of languages.

Starting point is 01:07:34 And so I don't think it's all as Python centric as it used to be. But that is still very much the first and most important of the languages. As for how well Go works, it seems to work pretty well. Models are good at it by our benchmarks. Like we said, if we took the benchmarks and made them more Go-like, the models actually got better results. They have a real tendency to understand the language. We think it's a pretty good fit. There are definitely times when models struggle, but it's a garbage collected language,

Starting point is 01:08:06 which helps because in just the same way the garbage collection reduces the cognitive load for programmers as they're writing programs, it reduces the load on the LLM in just the same way. They don't have to track the state of memory and when to free it. So they have a bit more thinking time to worry about solving your problem. So in that way, it's a good language. It's not too syntax heavy, but it's also, it doesn't have ambiguities that humans struggle with.

Starting point is 01:08:35 Yeah, it seems to work well. Pretty small. Yeah. There aren't a lot of, I don't, I haven't seen much research into what is the best language for an LLM. It does seem like an eminently testable thing. Like there's some interesting, in fact, it may end up influencing programming language

Starting point is 01:08:52 design in a sense of imagine you are building a new programming language and you develop a training set that's automatically generated based, translating some existing programs into your language and you train models for it. You could imagine tweaking the syntax of your new language, regenerating the training set, and then seeing if your benchmarks improve or not. So you can imagine, yeah, you can imagine driving readability of programming languages based on your ability to train an LLM to write this language. And so, you know, there's lots of really fun things that will happen long-term that, you of programming languages based on your ability to train an LLM to write this language.

Starting point is 01:09:27 So there's lots of really fun things that will happen long term that I don't think anyone started on work like that yet. Right, so the level that you all are working at with Sketch with Go in particular, is the prompting you're doing and the contexting and everything else that you're building, is it at a layer of abstraction where you could replace Go relatively easily with insert general programming language? Or is it like, well, that would be a new product

Starting point is 01:09:51 that we would build? Like how hard is that? Yeah, it's a good question. It's all of the techniques we're applying in general, but they are all very, each technique requires a lot of Go specific implementation. So in much the same way that like a lot of the techniques inside a language server for a programming language, these are the systems inside VS code for generating information

Starting point is 01:10:15 about programming languages. The techniques are general for like what methods are available on those objects are very similar in Go as they would be in Java, for example. But the specifics of implementing them for both languages are radically different. And I think it's a lot like that for Sketch. The tricks we're using for Sketch are very Go specific. And if we wanted to build one for Ruby, we would have to build something very, very different. Okay.

Starting point is 01:10:41 So yes, I consider it very much a Go product right now, and I really like that focus that that gives us. Because Go is a big enough problem on its own, let alone all of programming. Yeah, yeah, yeah. I'm just asking that because I wonder how valuable and important tooling like this would be for each language community to either provide or fund or hope that somebody builds because if the LLM related

Starting point is 01:11:09 tooling for Go, because of Sketch just hypothetically becomes orders of magnitude more useful than just talking to chat GPT about my elixir code for instance, well that's a real advantage for Go and the Go community, I mean it's great for productivity for gophers. And going back to maybe the original question about you know should Tailscale have its own little chat bot built into it? Like does each community need to take up this mantle and say we need better tooling or is it like VS Code should just do it for everybody?

Starting point is 01:11:44 or is it like VS Code should just do it for everybody? I mean that's a really good question. Good job, Drew. Yeah, so to, you know, I very much admire VS Code. I use it, which I don't actually have to admire a program to use it. That's better than Ed Meiering is using, I think. Yeah, that's right, but I actually, I do both. Like I both admire it and use it.

Starting point is 01:12:04 Okay, fair. But to look at the inside of VS Code, which I've been doing a bunch of recently, VS Code didn't actually solve language servers for all programming languages. They built JavaScript and TypeScript, JSON, and I think they maintained the C-Sharp plugin. They started the Go plugin, I think, and then it got taken over by the Go team at Google, who now maintain the Go support in VS Code. I don't think the Microsoft team built the Ruby support in VS Code.

Starting point is 01:12:40 I don't know who did the Python implementation. But a lot of the machinery in VS code is actually community maintained for these various programming languages. And so I'm not sure there is another option than imagining a world where each of these communities supports the tooling in some form. I don't know if each programming language needs to go out and build their own sketch. Maybe there is some generalizable intermediate layer, some equivalent of a language server that can be written to feed underlying models.

Starting point is 01:13:15 Given our... We're just starting to explore this space. Sketch is very new. We basically started it some time near the end of November, so there's not much to it yet. Yeah. at some time near the end of November. So there's not much to it yet. Yeah, but so far what we've found is it's far more than the sort of language server environment that you get with VS code.

Starting point is 01:13:36 More machinery is needed to really give the LLM all the tooling it needs. The language server is very useful. We actually use the Go language server in Sketch. Go Please is a big part of our infrastructure. It's really wonderful software. But there's far more to it than that. To the point where we need to maintain an entire Linux VM

Starting point is 01:13:56 to support the tooling behind feeding the model. So what each community needs to provide, I think that's the research in progress, is figuring that out. Yeah. It's an interesting question and one that I think will be open for a while. I do not wanna see a world where Python continues

Starting point is 01:14:17 to proliferate merely because of its previous position. I do see with tooling like Devvin and Bolt and v0, these are very front-end JavaScript-y companies that are producing these things, which is fine. But it's like, if you are just going to go use that, it's going to produce for you a React and Next.js front-end with a Prisma based back-end. It's all very much like these are the tools it does. And that's all well and good, but that's gonna proliferate more and more of that one thing.

Starting point is 01:14:52 Where I love to see a diversity where it's like, yeah, is there a specific thing for Rails people? Is there one for people who like Zig, moving outside of the world of web development. But you know what I'm saying, and I think your answer might be right, which is like, well, every community's gonna have to provide some sort of greasing of the skids

Starting point is 01:15:15 for whatever editor is popular or used in order to make their tooling work well inside of these LLM-based helpers beyond just being like ChatGPT knows about you, which is kind of like what people are at right now, is like, does ChatGPT know about me? It's the new, am I Googleable? It's the new SEO at this point.

Starting point is 01:15:40 I've heard people talk about that. A startup founder, who I wouldn't know him, mentioned that they were busy retooling their product so that the foundation models under things like V0 and Volt would be more likely to NPM include their package to solve a problem. That's super smart to do that right now. I agree.

Starting point is 01:15:58 Did they devolved any of the how? What are the mechanical steps that you do that? I was actually really happy that they said that their plan was to make it really easy to NPM higher package and not require a separate signup flow to actually get started. Oh, that's nice. Yeah, I've thought it was wonderful.

Starting point is 01:16:18 Like their solution to make their product more chat GPTable, I guess you might say, is just make their product better. Which, you know, if that's. How avant-garde of them. Yeah. Yeah. I'm sure one day we'll end up in the search engine optimization world

Starting point is 01:16:35 of Frontier models, but today. It's definitely gonna be some black magic for sale. You know, here's how you really do it. Yeah. I don't see why a frontier model couldn't run an ad auction for deciding what fine tuning set to bring in to. I had to, again, to talk about experiences,

Starting point is 01:16:56 I was using one of the voice models and talking to it as I was walking down the street. And I asked it some question about WD-40 because I had a squeaky door. And I think I described in my question WD-40 as a lubricant. And it turns out I just didn't understand that it's not a lubricant, it's a solvent. And the purpose of it is to remove grease. It took me years to realize that. I think someone finally told me because I've been using it as a lube all these years.

Starting point is 01:17:24 Yeah. Oh my gosh. Why do you got to keep reapplying it? You know, it's not very good lube. Well, I just had your experience, but it was an LLM that told me. Oh, hilarious. And it mentioned in passing, it's like, yeah, you could also, you know, you could use WD-40 and then use a lubricant like, and then it listed some brand name.

Starting point is 01:17:41 At the moment, I heard the brand name. I was like, oh, I see a Frontier model could run an ad auction on fine tuning which brand name to inject there. That would be a really- 100%. Yeah, it wouldn't require doing it into the pre-training months ahead of time. You could do that sort of on an hour by hour basis.

Starting point is 01:18:00 So that world is coming, and then once there's a world of ads, there's a world of SEO and all the rest of it. Well the more paramount they become, and Adam you can probably speak to this because you're injecting it into every aspect of your life. Like if the answer includes a product right there, like you're just gonna be like, all right I gotta get that.

Starting point is 01:18:17 Sometimes you don't even realize Kleenex is a product. You think that that's a category, but no that's a product. Yeah, absolutely. Hard to tell, honestly. Kleenex is an easy one for me because we don't have Kleenex in Australia where I'm from it So I came here and started calling tissues Kleenex and it was a bit of a bit of a surprise to me It's like coke or coca-cola something like that. You know yeah, right? Exactly. Yeah, you know I don't know if

Starting point is 01:18:43 I've gotten some hallucinations, let's just say on products and even limited information of what's the true good option when it comes to product search. I haven't done a ton of it, mainly on like the motherboard search. I want to do something that has the option for either an AMD to something that has the option for either an AMD rise in a thread ripper with, you know, more of a workstation enterprise class CPU. And I want to maximize some PCI lanes. So I'm just trying to like figure out what's out there.

Starting point is 01:19:16 I'd prefer the chat interface to find things versus the Google interface, which is search by nature to find things. But thus far, it hasn't been super fruitful. I think eventually it'd be cool, but it's not there yet. I imagine in the YouTube video of this, a little Intel Xeon banner will appear just as you say. That's right. Yeah. On the ad. So yeah.

Starting point is 01:19:36 Yeah, exactly. So I'm a fan of Intel Xeons too. I got the Intel Xeon 4210. Well, now it's really popping up. There you go. Bing, bing, bing. It's like in Silicon Valley when Dinesh was talking out loud and they had the, I think they had AI in the VR and it was, yeah it was doing some cool stuff. It was pulling up ads real time.

Starting point is 01:19:54 It was cool, it was cool. But yeah, Intel's cool, AMD's cool, but PCIe lanes are even cooler, you know? Give me the max, buy 16, you know? David, maybe we close with this. For those who aren't gophers out there, of course, brand new, hot off the press, still in development, three months old, sketch.dev, check it out if you're into Go

Starting point is 01:20:17 and those kind of things. But let's imagine you're just a Ruby programmer out there and you came across your blog post about what you've been doing, these three methods of working with AIs. You have autocomplete, you've got chat, you've got search. Where should folks get started if they haven't yet? First of all, is today the day?

Starting point is 01:20:37 Like is it worth it now? Or should I wait? And then secondly, if I am gonna dive in and I just wanna use it in my local environment to like just code better today, what would you suggest? Yeah, good question, especially for non-Gophers. I would suggest trying out the code completion engines because they take a little bit of getting used to,

Starting point is 01:20:58 but not a lot. And depending, if you're writing the sorts of programs they're good at, they're extremely helpful. They save a lot of typing. And it turns out, I was surprised to learn this, but what I learned from Code Completion Engines is a lot of my programming is fundamentally typing limited. There's only so much my hands can do every day. And they're extremely helpful there. The state of Code Completion Engines is they're pretty good at all languages, whether caveat that they're probably not very good at COBOL or Fortran, but all the sort of

Starting point is 01:21:34 general languages, especially like Ruby, I'd expect them to be decent at. I suspect the world of code completion engines will get better at specific languages as people go deeper on the technology. It's a thing I continue to work on and so I feel confident that it can be improved. The other place that I think most programmers could get value today, if they're not a Go programmer, is writing small isolated pieces of code in a chat interface. So you could try out a chat GPT or a Claude,

Starting point is 01:22:06 or if you really want to have some fun, run a local model and ask it to solve problems. Like try, try Llama CPP, try Ollama, try these various local products, grab one of the really fun models. It's especially easy to try on a Mac with their unified memory. If you're on a PC, you might have to find a model that fits in your GPU.

Starting point is 01:22:29 But it's a ton of fun, and use it to, say, write me a Ruby function that takes these parameters and produces this result. And I suspect the model will give you a pretty good result. So those are the places I would start because those require the least amount of learning how to hold the model correctly and you'll get the most benefit quickly. Love it. Good answer. I'm just wondering out loud and feel free not to know but when it comes to prompting I know we're past the age of magical incantations but as you guys have been building

Starting point is 01:23:06 out a product, which is basically sophisticated prompting, are there guides that are useful or are there like, I remember finding a site, I can't remember right now, there's like, people are just sharing their system prompts for certain things they do, like, maybe there's like a Ruby prompting guide, which makes it a little bit easier to get quality results out faster. Does either of you guys know?

Starting point is 01:23:28 I've seen people write guides like that. I would say the guides I've read are now out of date. Like we were saying earlier, guides go out of date. The thing I find most useful is to think of the model I'm talking to as someone who just joined the company. Sometimes I think of them as an intern, though every now and again the models produce much better code than I can. But interns have done that too. That happens. And then as you're writing the question for it, imagine it as like, here's a, you know, I'm talking to a smart person who knows nothing about what

Starting point is 01:24:11 I'm doing and they need some background. And that gets me really far with the current frontier models. And so that would be my general piece of advice that I think applies to any programming. I agree with that, too. The random just give me X is you'll get a result, but you'll have to massage it further. You have to it'll ask you more. I will often give it context. Like you said, this intern, the smart person that's new to context,

Starting point is 01:24:38 they don't have the background awareness that that you want somebody to have. And I'll often like give it a lot of that, have a particular request, but then also say, is there anything else I can give you or any other information you need to give me a successful way, just some version of like be successful with our goal. And it's strange even talking like that too,

Starting point is 01:25:01 as I even say it out loud, like our goal, as if it's, you know, human. And Jared, you know where we stand on this, but Jared and I have some history, so I've been very kind, please and thank you. He's very nice, he talks to me like we're on the same team and stuff. If it gives me a great result, I say fantastic, you know,

Starting point is 01:25:16 I'm like, you know. High fives. Why would I, you have high fives, why would I be any different? You offer it money. No, I haven't done that yet, I gotta try that. I gotta try that, honestly. But I will ask it, like, will, do you, could you be more successful?

Starting point is 01:25:30 Is there anything else I can give you? Any more information I can give you to get us to our goal, you know? And I've found that that's it, like, it's context. It's a full circumference of the problem set, as much as you can, that makes sense. And then you will have a more fruitful interaction. I will also say that I'm more exclusively using ChatGPT.

Starting point is 01:25:53 And so the O1 model, while it's expensive, let's just say I don't have the expensive plan. I still don't feel like I can be the person that spends 200 bucks a month on this thing. I would much rather buy a GPU than spend 200 bucks a month. Somehow that math makes more sense to me. But then like, O1's been pretty successful with thinking and iterating and being more precision,

Starting point is 01:26:18 whereas 4.0 was a bit more brute. But that's just my personal take. I think you might be onto something with being nice to models. I caught myself being pretty curt with models a few months back and discussing this a lot with Josh, he mentioned the conclusion we came to was, one of the challenges of not being nice to models

Starting point is 01:26:41 is it sort of trains you to not be nice to people. Yeah. Because you're using all of the same tools. And so it might just be good for you not being nice to models is it sort of trains you to not be nice to people. Yeah. You're using all of the same tools. And so it might just be good for you to be nice to models. I mean, I just don't, if it's humanistic even, you know, similar to a human, why not just be kind? You know, why not? I hate to break it to you, Adam, but this is not similar to a human.

Starting point is 01:27:02 It's the iteration is, it is, it certainly is. If I were collaborating, if that was a human over there giving me the answers back, it would be very human, volleyball iterative. If. If it was. Right. I get that it's not, but I'm also like, like David, why not?

Starting point is 01:27:22 I think I'm somewhere between your two positions because I do think it's just a machine and it's just a tool. I don't think it's human. Don't let me think that. Well, you kind of just said that you think it is. Did I? I mean, is that what, is I interpreted it? I just meant that if it,

Starting point is 01:27:39 maybe what I mean by that, just to be more clear, is kind of keying off of David's, which is like being kind. Just why not? I don't know. I'm like overly like, thank you so much, you're amazing. I think it's just, I'm a kind person. When it got you a result in the search engine,

Starting point is 01:27:57 like you just. Well this, it is not a prompt where there's an ebb and a flow or a back and a forth. You know, it's just simply return and answer. Yeah. Well, how are we doing? You know, ask you at the end, how are we doing? You know, I'm not, I, that being said,

Starting point is 01:28:12 I'm not like, thank you very much. I'm just, I'm just antagonizing him at this point. Yeah, I know you are. You're really digging into me, but I, I can catch myself saying like, that's awesome or great job or I agree with that. Things, isms like that's awesome or great job or yeah I agree with that things isms like that like you would say to another person if that makes you feel more like a nice human Then you should just keep on doing that but I don't think it's doing anything for the computer

Starting point is 01:28:34 I don't think it is either it's actually costing resources It doesn't help the computer Thank you to the models now so that I remember to say please and thank you to humans You know, I don't want to get into the habit of training yourself. Exactly. It's all about training. That's fair. Self training. Yeah, I don't feel like I have to. I feel like it's just a natural atomism. How I do things, how I operate yourself. It's who I am in my core. I'm a kind person.

Starting point is 01:28:54 Well, David, thanks so much for joining us, man. Thanks for sharing all your knowledge. You've learned a lot and I've learned a lot from you. So I'm excited to be here. I'm excited to be here. I'm excited to be here. I'm excited to be here. I'm excited to be here. I'm excited to be here. I'm excited to be here. I'm excited to be here. I. Don't be yourself. It's who I am in my core. I'm a kind person. Well, David, thanks so much for joining us, man.

Starting point is 01:29:06 Thanks for sharing all your knowledge. You've learned a lot and I've learned a lot from you. So we appreciate your time. This was a ton of fun. Thanks for having me. Okay, so hopefully hearing David's experience has helped you on your programming with LLM's journey. I'm sure you have thoughts on the matter.

Starting point is 01:29:25 Let us hear them in the comments. Yes, there is a Zulip topic dedicated to this episode and I'm sure there's lots of insightful things being posted by ChangeLog community members right after they listen. There's a link in your show notes so you can see what's up and join for $0 at changelog.com slash community.

Starting point is 01:29:43 Let's give one more shout out to our sponsors of this awesome conversation. Thank you to Fly.io, to Retool, retool.com slash changelog, to Temporal, find them at temporal.io, and to Augment Code. Head to augmentcode.com to get started. And thanks of course to our Beat Freak in residence, the one, the only, Breakmaster Cylinder.

Starting point is 01:30:05 Yeah, I like beats. Oh, and we have a little bonus for our favorite kind of listener. That's a ChangeLog++ listener, of course. Stay tuned, friends. We go even one layer deeper on what a potential tail scale AI might look like. If you aren't a Plus Plus member,

Starting point is 01:30:21 head to changelog.com slash plus plus today. Right now even, you're free right now,com slash plus plus today. Right now even. You're free right now, aren't you? Okay, that's it. This one's done, but we'll talk to you again on Change Login Friends on Friday. Bye y'all. So I'm out.

The Changelog: Software Development, Open Source - Programming with LLMs (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.