The Changelog: Software Development, Open Source - ANTHOLOGY — Open source AI (Interview)

Starting point is 00:00:00 Welcome back friends this week on the change that we're taking to the hallway track of the Linux Foundation's Open Source Summit North America 2023 in Vancouver, Canada. This episode is part of our maintainer month celebration along with GitHub and many others. Check it out at maintainermonth.github.com. Today's anthology episode features Byung-Loo, co-founder and CTO at Sourcegraph, Denny Lee, developer advocate at Databricks, and Stella Biederman, executive director

Starting point is 00:00:38 and head of research at Eleuther AI. The common denominator of these conversations is open source AI. Byung-Loo and his team at Sourcegraph are focused on enabling more developers to understand code and their approach to a completely open source model agnostic coding assistant called Cody

Starting point is 00:00:56 has significant interest from us. Denny Lee and the team at Databricks recently released Dolly 2.0. This is the first open source instruction following LLM that has been fine-tuned on a human-generated instruction dataset and is licensed for research and commercial use. And Stella Biederman gave the keynote address on generative AI and works at the base layer doing open source research,

Starting point is 00:01:23 model training, and AI ethics. She trained the eLuther AI Pythia model family base layer doing open source research, model training, and AI ethics. She trained the eLuther AI Pythia model family that Databricks used to create Dolly 2.0. A massive thank you to our friends at dev cycle cto and co-founder jonathan norris so jonathan my main question i guess if i'm handing off my feature flags to you all is my uptime dependent on your uptime? Like if you're down, am I down? We've designed into all the SDKs and all the APIs. APIs fail, right? That's a cardinal rule of the internet. So all the SDKs have been designed with kind of defaults and caching

Starting point is 00:02:18 mechanisms and all that stuff in place so that, yeah, if our CDN is down or APIs are down, it'll sort of fall back to those defaults or those cache values in those SDKs. So that handles for those blips pretty easily. And then we rely on Cloudflare as our sort of main high load edge provider. So all of our edge APIs are through Cloudflare and they're also operating as our CDN for assets. So obviously relying on a large provider like that, that runs such a large percentage of the internet, means that, yeah, you're not relying on our ability to keep AWS instances running properly. You're relying on sort of Cloudflare

Starting point is 00:02:53 and ability to sort of make sure the internet still works as they control such a large percentage of it. So yeah, we've architected it in a way that it doesn't sort of rely on our APIs to be up all the time and our databases to be up all the time to have that good reliability. Well, that's good news. Okay, so how do you accomplish that?

Starting point is 00:03:11 One of the core sort of architectural decisions we made with our platform when we designed it was trying to move the decisioning logic of your feature flags as close to the end user and end device as possible. So we did that with those local bucketing server SDKs that are using sort of a shared WebAssembly core. And then we have edge-based APIs that are also powered by WebAssembly to serve sort of those client SDK usages.

Starting point is 00:03:36 So things like web and mobile apps. So that's one of our core principles is to try to get that decisioning logic as close to the end device as possible. And this is probably one of the only use cases where performance really matters because you want your feature flags to load really, really quickly so you can render your website or you can render your mobile app really quickly. And so, yeah, we definitely understand that your feature flagging tool needs to be fast

Starting point is 00:03:58 and needs to be really, really performant. So if you want a fast feature flagging tool that's performant and is not going to impact your uptime, check out our friends at DevCycle. That's devcycle.com slash changelogpod. And for those curious, they have a free forever tier that you can try out and prove to yourself and your team that this is going to work for you. So check it out, devcycle.com slash changelogpod and tell them we sent you. So, Cody. Yeah, Cody. Yeah, Cody.

Starting point is 00:04:47 Is CODY. Is this a big deal? We think it is. Seems like it. Wasn't it Sourcegraph 4.0 last year was relaunched as the intelligence platform? Yep. Is that right? Because before, not just, but just code search, which was cool, but hard to really map out the ecosystem.

Starting point is 00:05:05 And you want all the space in there, but there was a limit to code search. And you had to expand and the insights and the intelligence. And now, obviously, code is just like one more layer on top of insights. Yeah, totally. So, as you know, Sourcegraph historically has been focused on the problem of code understanding. So, heavily inspired by tools like code search inside Google

Starting point is 00:05:24 or TBGS inside Facebook. Right. These kind of systems that indexed your company-wide code base as well as your open source dependencies and made that easy to search and navigate. And that's what's been powering the business for the past 10 years. This is actually the 10th year of building Sourcegraph. Congratulations. Thank you.

Starting point is 00:05:43 I was just wondering about that. Yeah. When we first met you, it had to be about building Sourcegraph. Congratulations. I was just wondering about that. Wow. Because when we first met you, it had to be about a decade ago. I think Sourcegraph just either didn't exist or just had existed. Sourcegraph existed when we met. This was like GopherCon. I think it was like 2014. The first or second GopherCon. GopherCon. Yeah. And you had

Starting point is 00:06:00 this vision of Sourcegraph. And I'm wondering 10 years later, have you achieved that vision? Has the vision of, you know, source graph. And I'm wondering 10 years later, like, have you achieved that vision? Has the vision changed, et cetera? You know, our mission was always to enable everyone to code. And we actually took a look at our seed deck recently. Was it quaint? It was very quaint.

Starting point is 00:06:23 We were very bad at PowerPoint. You're probably a lot better at it now. Not really. Better at the pitch, maybe. Maybe. You were fine at your pitch. Largely, I could deliver that pitch today off that deck. It's basically the same.

Starting point is 00:06:39 It's just the pitch of Sourcegraph, which is there's never been more code in the world. Most of your job as an engineer or software creator is understanding all the code that already exists in your organization. Yeah. Because that is all upstream

Starting point is 00:06:52 of figuring out what code you want to write. And then once we actually figure out what you need to build, like, that's almost the easy part. It's also the fun part, right? Because you're building new things and shipping stuff. But we help you get to that point of,

Starting point is 00:07:02 you know, creation and enjoyment by helping you pick up all that context. Right. Traditionally, that's been like search, right? Just like Google's been web search, but then these large language models have now come on the scene. Yeah.

Starting point is 00:07:15 In some ways, they're disruptive to kind of like search engines, but in other ways, they're highly complimentary. So anyone who's used ChatDBT- I'm still Googling. I just, less. It's just less right it's more like

Starting point is 00:07:26 the last thing you do when you can't get the answer elsewhere right I guess I'll go Google it yeah although technically

Starting point is 00:07:32 Google's a weird thing because I will search a product and they think I want to buy it not research it right it's like

Starting point is 00:07:39 I want to learn about the thing and those who are teaching about the thing and how it integrates other things not where can I buy it and for how much. Yeah.

Starting point is 00:07:46 So there's like zero context there. Like they're incentivized, it seems, to point you to places that you can purchase it, not learn how to use it. Yeah, yeah. I mean, I think there's an interesting discussion. Which is the opposite of ChatGPT. Yeah. So there's kind of like pluses and minuses to both, right?

Starting point is 00:08:03 Like with Google, you get results to actual web pages and you can kind of judge them based on the domain. It's kind of like more primary source material, which is useful. It's also live. You know, you get results from 2023 rather than 2021. Sure. Whereas ChatGPT... That'll change.

Starting point is 00:08:20 That's a temporary thing, right? I mean, the delay will be temporary. Eventually, it'll catch up. Well, I mean, GPT-4 is still... It came out recently. thing, right? I mean, the delay will be temporary. Eventually, it'll catch up. Well, I mean, GPT-4 is still, it came out recently. It's still 2021. Right, but isn't the plug-ins and all that stuff where it's like, okay, the model is old, but it has access to new data. So the plug-ins is actually where it gets interesting

Starting point is 00:08:37 because that's where things get really powerful, in my opinion. Because if you ask ChatGPT with the plug-ins enabled, it can go and browse the web on your behalf. So it's not just the base model trying to answer your question from memory anymore. It's actually going and essentially Googling for things. It has access to what you would do.

Starting point is 00:08:58 Behind the scenes. Exactly. So it's the best of both worlds. And essentially, we're doing that with Kodi, but in your editor for developers. So basically combining large language models like GPD4 or Anthropix CLOD model, and then combining that power with the most advanced code search engine in the world. So it's the best of all worlds.

Starting point is 00:09:21 It gives you highly context-aware and specific answers about your code, and it can also generate code that's kind of tuned to the specific patterns in your code base, not just the kind of like median stack overflow or open source code. How did you get there? How did you think, wow, I mean, obviously, LLMs are a big deal, right? This new wave of intelligence that we have access to. How far back is this in the making? Has this been years, or has it been like, wow, Chet GPT is crazy.

Starting point is 00:09:50 November. Chet, GPT-3 is in November. It's like, okay, we've got to move. How far back does this go? Yeah, good question. Yeah, so for me personally, it's kind of a bit of a homecoming. So my first interest in computer science, actually, was machine learning and artificial intelligence. That's what I did a lot of my undergrad doing. I was actually

Starting point is 00:10:08 part of the Stanford AI lab doing vision research in those days under Professor Daphne Kohler. She's my advisor. And so I did a lot of work there. It was super interesting and I felt really passionate about it. There's just a lot of elegant math that goes into things and it feels like you're kind of like poking at some of the hidden truths of the universe a little bit. But the technology at that point was just, it was nowhere near commercializable. And so I decided to pursue my other passion, which is developer productivity and dev tools,

Starting point is 00:10:38 and kind of like stayed on top of the research as it was coming along. And I think one of the inflection points for us was the release of GPT-3 because that was kind of a step function increase in the quality of the language models. And we started to see some potential applications to developer tools and code.

Starting point is 00:10:56 And we really started in earnest maybe a little over a year ago, maybe 12 to 18 months ago, experimenting with the kind of like internal representations of language models as a way to enhance code search. So we actually put out an experiment called CodeSearch.ai that uses embeddings to enhance the quality of code search results that you get. And that was pretty successful as an experiment.

Starting point is 00:11:23 I think we released that probably middle of last year, so about a year ago. And that kind of started us down the road, and then of course when ChatGPT came out, that was also another big inflection point, and that's when we started to think very seriously about kind of like a chat-based

Starting point is 00:11:39 interaction that could happen in your editor, have all the advantages of ChatGPT, but know about the specific context of your code. And so for Kodi specifically, I think first commit was December 1 or something like that. And by February, we basically had a version that we're having users and customers try. And then March was when we rolled out to our first enterprise customer. So it's just been like this whirlwind of development activity.

Starting point is 00:12:05 And I don't know, I cannot remember a time where I've been more excited and just eager to build stuff because we're living through interesting times right now. It is. This is the eureka moment that we've all been waiting for, basically, right? I mean, this is the invention of the Internet all over again, potentially the iPhone level invention. I think it's a dramatic paradigm shift in how we think as engineers and software developers.

Starting point is 00:12:32 Like, how do we learn? How do we leverage? How do we augment? You know, it's just insane what is available to somebody who doesn't have an understanding to quickly get understanding and then be, you know, performing in a certain task or whatever, because of the LLMs that are available and how it works. It's so crazy. The chat interface is pretty simple though, right? Like the simple, the simplicity of a chat interface. Did you expect this Eureka moment to be simply chat? Like as you've been, I mean, like what I mean? Like, it's a web app.

Starting point is 00:13:05 Yeah. It's not something else. It's a web interface. It's a chat interface. I think, so, you know, I'm a programmer by background, so I've been, like, pushing, I've been trying to spread

Starting point is 00:13:16 the gospel of textual-based input for, you know, as long as I can remember. Obviously, it's mostly fallen on deaf ears because, you know, the non-programming world is like, you know, command line. That's we in, like the 1980s? But I actually think

Starting point is 00:13:32 philosophically, like textual input, the reason I like it is because if you think about just like the IO, like bit rate of human-computer interaction, it's like we live in a time where we have 4K screens running at 60 or 120 hertz. The sheer amount of data that computers can feed into us through our eyeballs is huge. Whereas in kind of like the point-and-click mouse world, it's like how many bits per second can you really feed into the computer as a human?

Starting point is 00:14:04 And now textual input doesn't get us all the way there to 4K times 60 hertz, but it does basically 10x's or more like the input bit rate of what we can do to instruct machines. I think it's a great win for human agency. We want to be programming computers, not the other way around. And I think a lot of the technology that has emerged over the past 10, 15 years has been computers programming us as humans a little bit in terms of

Starting point is 00:14:31 all the stuff that we consume. And so, yeah, I'm super excited for textual-based inputs. I think chat is kind of like a subset of that. The way we think about Kodi evolving is really it's going to evolve in the direction of just like this rich REPL. So it's not necessarily

Starting point is 00:14:49 going to be like, oh, it's a human-like thing that you talk with conversationally. It's more like if you want to do a search, you type something that looks like a search query, it knows that you want to do a search, shows you search results. If you ask a high-level question, it knows that you're asking a high-level question, it gives you an answer that integrates to the context of your code base. If you want to ask-level question, it knows you're asking a high-level question. It gives you an answer that integrates the context of your code base. If you want to ask a question about your production logs,

Starting point is 00:15:09 or maybe something about something someone said in chat, or an issue or a code review, it should pull context from those sources and integrate that and both synthesize an answer to your specific question, but also refer you back to the primary sources

Starting point is 00:15:25 so that you can go and dig deeper and understand more fully how it got to its answer. So we think chat is just the starting point. It's really just this rich REPL that's going to integrate all sorts of contexts, like whatever piece of information is relevant to you creating software. This is kind of like the thing that focuses that

Starting point is 00:15:43 and pulls it all in. It really seems like that, at least as an interface, you're seeing that as the future of what Sourcegraph is, isn't it? Or is there more to Sourcegraph than that in the future? So the way we think about it is like we spent the past 10 years building the world's most advanced code understanding

Starting point is 00:15:58 tool. So we have the best code search, we have the best code graph, so the global reference graph across all the different languages in the world. We have a large scale code modification, refactoring system, and a system to track high level insights. So there's all these back end capabilities that are really, really powerful. And what language models have done is given us a really, really nice beginner friendly

Starting point is 00:16:22 interface to all that power. And I think you're going to see this across all kinds of software. It's like historically, building power user tools has been difficult because the on-ramp to getting full, taking full advantage of those tools has been a little steep.

Starting point is 00:16:37 Requires education, yeah. Yeah, and so if you're worried about the on-ramp, maybe you end up constraining your product a little bit just to make it simpler, dumb it down for the beginning user, but you lose out on the power. I think that trade-off is no longer going to be as severe now with language models. And so at Sourcegraph, we're basically rethinking the user interaction of the entire experience. The underlying capabilities and underlying tech is not changing.

Starting point is 00:17:04 That's still, if anything, that's gotten more valuable now because you can feed it into the language model and instantly get value out of it. But the entire user interaction layer I think needs to be rethought. And Cody, as your AI editor assistant, is kind of like the first iteration of that

Starting point is 00:17:20 thought process. How did you iterate to the interface you're at now, and is it a constant evolution? Yeah, I mean it's pretty much like, hmm, I think that would be a good idea, let me go hack it together and see how it plays. And you play around with it and then you kind of experience it yourself and you build conviction in your own mind and then you maybe share it with one or two other teammates and see if they have the same wow moment and if they do, that's usually a pretty good sign that

Starting point is 00:17:43 you're on to something and there might be more details to hammer out to make it more accessible to everyone, but if you can convince yourself and at least two or three other smart people out there that there's something worth investigating, I think that's typically a pretty good sign that you're onto something. How do you get access to coding?

Starting point is 00:18:00 Not so much get access, but how do you use it in the SourceCraft world? How does it appear? How do you conjure it? Yeah, so it's just an editor extension. You can download it from the VS Code Marketplace. It's available now and it's free to use. And we have other editors on the way. IntelliJ is very high priority for us. Also NeoVim and of course my editor of choice, Emacs. Of course. And we're developing it completely in the open as well. So Kodi itself is completely open source and Apache licensed.

Starting point is 00:18:37 And to get access to it, to start using it, you just install the extension into your editor and start using it. It opens up in a sidebar. You can chat with it. We also do inline completions. So as you're typing, we can complete code. Again, taking advantage of the kind of, like, baked-in knowledge of the language model plus the context of your specific code base. So generating, like, very high-quality completions.

Starting point is 00:18:59 And, yeah, it's generally just as simple as installing the extension, and then you're off to the races. Probably a Sourcegraph account first, right? Yeah, so you do have to auth through Sourcegraph because that's how we... I mean, we wouldn't be able to provide it for free if you didn't auth through Sourcegraph

Starting point is 00:19:15 because on the back end, we're calling out to different language model providers, and we're also running a couple of our own. Okay, so accessible then, not having to install Sourcegraph, have it scan my repository, like the traditional way you provide intelligence, which is to leverage literally Sourcegraph on my repo.

Starting point is 00:19:34 I can just simply auth through Sourcegraph, have an extension in my VS Coder in the future Emacs. Exactly. Then potentially. They're kind of loosely coupled. We don't believe in strong coupling just for the sake of selling you more software. And I think with Kodi, the design philosophy was like,

Starting point is 00:19:53 look, if you connected to Sourcegraph, it does get a lot better. It's like if you gave a really smart person access to Google, they're going to be a lot smarter about answering your questions. But if you don't give them Google, they're still a smart person, and so Kodi will still fetch context

Starting point is 00:20:07 from kind of like your local code using non-source graph mechanisms if you're just running it standalone. Yeah. How does it get this intelligence as an extension? Like, how does that... Explain how that works. Like, I've got it on my local repo.

Starting point is 00:20:20 It's an extension. How does it get the intelligence from my code base? Yeah, so it's basically... I mean, think of the way that you would understand or build a mental model of what's going on in a code base as a human. You might search for some pieces of functionality,

Starting point is 00:20:35 you might read through the readme, click on a couple search results. It does all that. It's reading my readme right away? Yeah, basically. So when you ask a question, Cody will ping Sourcegraph for, hey, what are the most relevant pieces of documentation or source code in your code base?

Starting point is 00:20:52 And then essentially, quote unquote, read them as a language model and use that as context for answering a question. So if you ask a general purpose question, it'll typically read the readme. If you ask a more targeted question, like, oh, how do you do this, this one specific thing,

Starting point is 00:21:04 like read a PDF or whatever, it'll go find the places in source code where you're, you know, it processes PDFs and read that in and then interpret that through the lens of answering the question. In real time, yeah. Yeah, yeah. Is there a latency to the question, to the gathering, and, like, what's the speed? If I said that example, how does my application, you know, compile a PDF from a Markdown file, for example? Yeah, so it typically gets back to you within like one or two seconds. And most of the latency is actually

Starting point is 00:21:34 just the language model latency. So it depends on what language model you're choosing to use underneath the hood. All the Sourcegraph stuff is super fast because that's just, I mean, there's no like, yeah, Sourcegraph is fast. We've spent the past 10 years making it very fast and

Starting point is 00:21:47 there's no like billions of linear algebra operations happening with Sourcegraph. Sourcegraph is just classical CPU-based code and text. What about privacy? Yeah, so privacy is extremely important to us, both

Starting point is 00:22:03 in terms of individual developers and our enterprise customers. The last thing they want to do is have their private code be used as training data into some general purpose model that's going to leak their sensitive IP to the rest of the world. So we basically negotiated zero retention policies with all our proprietary language model providers, which means that your data is never going to get used as training data for a model. And not only that, the language model providers will forget your data as soon as the request is complete. So there is no persistence in terms of remembering the code that you sent over to complete a request.

Starting point is 00:22:41 That just gets forgotten as soon as the language model generates a request for Kodi. And then for the rest of it, I mean, Sourcegraph has always taken user privacy and code privacy very seriously. It's why we've been able to serve the sorts of enterprise customers that we do. For sure. I know why that's important, but spell it out. Why is that important, this zero attention policy?

Starting point is 00:23:04 What's the real breakdown of that privacy? Why is it important to the main users? So from a company's point of view, it's important because you don't want to leak portions of your code base or have them persist in the logs of some third-party data provider. As an individual developer, I think it's just important to give you control over your own data. And I think that's going to be an especially important thing

Starting point is 00:23:26 in this new world that we're living in, where before private data was valuable, it carries value, it tells you things about a certain person or the way they work, and it can be used for purposes both good and bad. Search history. It's like search history, right? Yeah, exactly.

Starting point is 00:23:46 You can tell a lot about a person by their search history, their watch history, their like history. Totally. But now it's used for a whole other reason, right? Yeah, and I think it's important to grant our users and customers control and ownership over that data because it is your data. And I think with language models,

Starting point is 00:24:00 like language models just, they're like 10x the value and the sensitivity of that data. Because now, instead of, you know, just, like, feeding it into, like, a Gen 1 AI model or exposing it to some other human, you can feed it into one of these large language models that can, you know, kind of, like, memorize everything about you as a person or a programmer. And, you know, in some ways, maybe that's good. Like, if you're open to that, if you're willing to share your data, we could potentially train language models that, you know, emulate some of the best and brightest programmers in

Starting point is 00:24:32 existence. But ultimately, we think that should be your personal... Opt-in. How explicit is that in the sign-up or the acceptance of the Kodi license or the, you know, this GA to now, you know, widespread usage? How do you... How explicit are you with a new sign-up that says, I want to the Kodi license or the, you know, this GA to now, you know, widespread use. How do you, how explicit are you with a new signup that says, I want to use Kodi?

Starting point is 00:24:50 Do you say privacy and all these things you just said, basically? How clear is that? So when you first install it, there is kind of like a terms of use that pops up and you cannot use Kodi unless you read through and accept it. How many words is in that TOS? It fits on like basically one page without scrolling. Okay, so 1,000 words maybe. 500. 250.

Starting point is 00:25:14 Maybe not 250. I think it's probably 250 to 500. I had to go back and check specifically. Digestible in a minute. Yeah, we're not trying to be one of those companies that tries to hide stuff. What I mean by that is, let's try to say are you hiding it, but more how clear are you being? In a minute. Yeah. We're not trying to be one of those companies that tries to hide stuff. Well, what I mean by that is, let's try to say, are you hiding it? But more, how clear are you being?

Starting point is 00:25:31 Because it seems like you care to be clear. Yeah. So is that like a paramount thing for you all to be so clear that you say, hey, privacy matters. Yes. We don't collect. There's zero retention. It's spelled out really clear. It's a bullet list saying, basically saying exactly what you said. Privacy matters.

Starting point is 00:25:44 We don't collect data. I wrote it for you. We're not using. Yeah. Basically. Well, Tammy, our wonderful legal counsel. I didn't write it. I'm just kidding around.

Starting point is 00:25:52 We all know ChatGPT wrote it, okay? Let's be serious here. Actually, that's a great use case for ChatGPT. If you're asked to accept one of these lengthy end users. Paste it in there and summarize it for me. Paste it in there and summarize it. Tell. Paste it in there and summarize it. Tell me if there's anything fishy. Yes.

Starting point is 00:26:08 That would be cool for sure. That's the best. I cannot wait, honestly, for that to come out. What are the loopholes in this contract? I have nefarious action on the other side. What are my loopholes to get out? Right. You know what I mean?

Starting point is 00:26:20 Yep. For bad or good. I guess you could use that in the bad side or the good side. GPT for X, where X is literally everything, is going to be there. There's going to be one specifically trained for lawyering. Yeah, yeah. I think language models will be a huge democratizing force in many domains. It's democratizing understanding of legal concepts,

Starting point is 00:26:42 democratizing access to software creation. I think it's going to be a huge expansion of the percentage of people that's going to be able to access those knowledge domains. So let's say I'm a happy GitHub co-pilot user.

Starting point is 00:26:58 Would I install Kodi alongside this and be happier? Would I be less happy? Is this a zero-sum game? Do I need to go all in on Cody? What are your thoughts on that? I think it's the exact opposite of a zero-sum game. I think there's so much left to build that the market is huge

Starting point is 00:27:16 and vastly growing. We do have features that Copilot doesn't have. So currently, they don't have kind of like a chat-based textual input to ask high-level questions about the code. I think that's coming in Copilot X to some extent. Yeah, I think they announced that, but it's not out yet.

Starting point is 00:27:35 It's not out yet. If you look at the video, the kind of context fetching they're doing, it's basically like you're currently open file, explain that. And Kodi is already doing much, much more than that. It's reading, even if you ask it a question about the current file, it'll actually go and read other files in your code base that it thinks are related and use that

Starting point is 00:27:51 to inform your answer. So we think the power of Sourcegraph gives us a bit of a competitive edge there with the kind of high-level questions and onboarding and kind of like rubber ducking use case. And then for completions, you know, I think Copilot is great. But for completions, we're essentially doing the same thing.

Starting point is 00:28:10 So like the completions that code generates, it takes into account that same context when it's completing code. So that means it's better able to kind of mimic or emulate the patterns and best practices in your specific code base. And again, because we're kind of open source and model agnostic, we are just integrating all the best language models as they come online.

Starting point is 00:28:35 So I think Anthropic, I don't know when this episode's going out, but Anthropic today just- Pretty quick. Okay, pretty quick? The 24th. Yeah, so Anthropic just announced today that they have a new version of Cloud that has an incredible 100,000 token context window. It's just like

Starting point is 00:28:48 I think like orders of magnitude more than what was previously available. And that should be, by the time this episode goes online, it should be available in Kodi. Whereas Copilot, I think they're maybe someone from GitHub can correct me if I'm wrong,

Starting point is 00:29:06 but I think they're still using the Codex model, which was released in like 2021 or something. And so it's a much smaller model that only has around like 2000 tokens of context window and much more basic context fetching. It's already incredibly useful, but I think we're kind of taking it to the next level a little bit.

Starting point is 00:29:24 So open source and model agnostic. Open source, model agnostic. We're not locking you in to a vertical proprietary platform. Proxy friendly. Proxy friendly. Also enterprise friendly. Sourcegraph, we've made ourselves easy to use in both cloud and on-premises environments. So we're just trying to do the best thing for our customers

Starting point is 00:29:46 and for developers at large. So because you're model agnostic, does that mean that you're not doing any of the training of the base layer models? So do you also sidestep legal concerns? Because I know with Codex and Copilot, there's at least one high-profile lawsuit that's pending. There's legal things happening.

Starting point is 00:30:08 There's going to be things litigated. I'm wondering if you're in the target for that now with Kodi, or if you're just not because there's other people's models. No, we're very mindful of that. And we actually integrate models in a couple different ways. So we do it for kind of like the chat-based autocomplete. There's a separate model we use for code completions, and there's another model that we use for embeddings-based code search and information retrieval.

Starting point is 00:30:32 And it's kind of like a mix and match. Sometimes we'll use a proprietary off-the-shelf model. Other times we'll use a model that we fine-tuned. But for the models that we do rely on external service providers for, we're very mindful of the kind the evolving legal and IP landscape. And so one of the things that we're currently building is basically copyright code or copied code detection. And if you think about it, Sourcegraph as a code search engine

Starting point is 00:30:58 is kind of in a great position to build this feature. It's like if you emit a line of code or you write a line of code that is verbatim copied from somewhere else in open source or even in your own proprietary code base, you might be worried about just code duplication. We can flag that for you because we've been building code search for the past 10 years.

Starting point is 00:31:22 Cool stuff, man. So moving fast, what comes next? When are you going to drop Kodi 2? It's probably like a week from now, right? Yeah, that's a great question. I mean, we are just kind of like firing on all cylinders here. We have a lot of interesting directions to explore. One direction or one dimension that we're expanding in

Starting point is 00:31:44 is just integrating more pieces of context. So one of the reasons why we wanted to open source Kodi is because we just want to be able to integrate context from wherever it is and not be limited by a single code host or a single platform. There's so much institutional knowledge that's in many different systems. It might be in Slack.

Starting point is 00:32:03 It might be in GitHub issues. It might be in your code review tool. It might be in Slack. It might be in GitHub issues. It might be in your code review tool. It might be in your production logs. And so we want to build integrations into Kodi that just pull in all this context. And I think the best way to do that is just to make this kind of like platform, this orchestrator of sorts,

Starting point is 00:32:19 like open source and accessible to everyone. The other dimension that is very exciting to us is going deeper into the model layer. So we've already started to do this for the embeddings-based code retrieval. But I think we're exploring some models that are related to code generation and potentially even the chat-based completions

Starting point is 00:32:38 at some point. And that's going to be interesting because it's going to allow us to incorporate pieces of source graph into the actual training training process and there's been some research there that shows that uh incorporating like search engines into training uh language models actually you know yields very nice uh properties in terms of like lower latency but uh higher quality um and it's also important to a lot of our customers because a lot of them are you know large corporations they deploy on premises,

Starting point is 00:33:05 and even the zero retention policy where the code is forgotten as soon as it's sent back over is not good enough for some of our customers. So they want to completely be able to self-host this and we plan to serve them as well. How high up the stack, the conceptual stack, do you think Cody can get or maybe any AI tooling with CodeGen with regards to how I instruct it as a developer? Yeah. You know, because right now we're very much like, okay, it's autocomplete.

Starting point is 00:33:34 There's a function here, right? I can tell it, write me a thing that connects to an API and parses the JSON or whatever. And I can spit that out. But like, how high up the stack can I get? Can I say, you know, write me a Facebook for dogs? And be done?

Starting point is 00:33:49 For instance. Or like user stories? Can I write some user stories and go from there? What do you think? That's a great question. I mean, we've all seen the Twitter demos by now where someone is like, you know, GPD4. Like, build me an app. And, you know, it creates a working app. I think if you've actually gone through and tried that in practice yourself you soon realize like hey you can get to like a working

Starting point is 00:34:10 app pretty quickly just through like instructing it using english or natural language but then you get a little bit further down that path and you're like oh i wanted to do this i wanted to do that can you add this bell muscle there's kind of this like commentorial complexity that emerges as you add like different features and you're kind of diverging from like the common path. And then, and then it falls apart. Like I actually tried this myself. Like I tried to write a complete app. Uh, it was actually a prototype for, for the next version of Cody. Um, I tried to do it by not writing a single line of code just by writing English. And I got like 80% of the way there in like 30 minutes. And I was like, this is amazing. Like this is the future. Like I'm

Starting point is 00:34:49 never going to code again. And then the remaining 20% literally took like four hours and I was banging my head against the wall because I asked it to do one thing and then it did, did it. But then it kind of like screwed up this other thing and it became kind of like this like whack-a-mole problem. So we're not all the way there yet, but I think, I think the way we think about it is like Cody right now is at the point where if you ask it, uh, this is another thing I tried the other day. Like I wanted to add a new feature to Cody. Uh, Cody has these things called recipes, which are kind of like templated interactions with, uh, Cody. So like write a unit test or generate a doc string or, you know, smell my code, you know, give me some feedback. Yeah. I wanted to add a new recipe and I basically asked Cody, Hey, I want to add a new

Starting point is 00:35:28 recipe to Cody. Uh, what parts of the coach I modify. And it basically showed me all the parts of the code that were relevant. And then it generated the code for the new recipe using the existing recipes as like a reference point. Uh, and I basically got it done like five minutes and it was amazing. So like, I was still obviously in the hot seat there. I was still calling the shots, but it turned something that probably would have been at least 30 minutes, maybe an hour, if I got frustrated or distracted

Starting point is 00:35:54 into something that was like five minutes. And that was actually the interview question we were using for interviewing on the AI team. So after that, we had to go back and revamp that. It's like, this is too easy. Too easy now. Everything just got easier. Yeah.

Starting point is 00:36:07 Do you think this is like a step change in what we can do and then we're going to plateau right here for a while and like refine and, you know, do more stuff

Starting point is 00:36:17 but kind of like stay at this level of quote unquote intelligence or do you think it's like just the sky is the limit from here on out? Like, which, I mean, obviously, just conjecture at this point. Challenging to predict. I mean, it's like just the sky is the limit from here on out? Like, which, I mean, obviously it's just conjecture at this point.

Starting point is 00:36:26 Challenging to predict. I mean, it's, it's very challenging to predict. Uh, you know, I might be eating my words, um, in, in another six months, but like, uh, you know, on the spectrum of, you know, oh, it's just like glorified auto-complete and it doesn't really know anything to all, all the way to like, you know, AGI Doomer, you know, let's, let's nuke the GPU data centers. Oh my gosh. Where do you fall? Yeah.

Starting point is 00:36:49 Don't give them ideas. Cancel, cancel, cancel. Honestly, I think a lot of the discourse on that end of the spectrum has just gotten kind of crazy. Yeah. Like, the way I view it is this is a really powerful tool. It's an amazing new technology. And, you know, it can be used for

Starting point is 00:37:05 evil, certainly, as any technology can. But I'm a techno-optimist and I think this will largely be positively impactful for the world. And I don't really see it replacing programmers. It might change the way we think about programming or software creation. There's certainly going to be a lot more people that are going to be empowered to create software now. And I think there'll be kind of a spectrum of people from those who write software just by describing it in natural language all the way to the people who are kind of like

Starting point is 00:37:38 building the core kernels of kind of like the operating systems of the future that form like the solid foundation that pack in the really important data structures and algorithms and core architecture around which everyone else can throw their ideas and stuff. So there'll be like a huge spectrum. I think we'll almost think of it

Starting point is 00:38:00 in terms of like the way we think of like reading and writing now where like you have many different forms of reading and writing. People just like reading, writing stuff on Twitter, that's one form of writing. And then there's other people who write long books that span many years of intense research. And I think the future of code looks something like that. It's the ultimate flattener. You see that book, The World is Flat? It's like that. It's the ultimate flattener. You see that book, The World is Flat? Yeah. Yeah. It's like that. Like for a while there, it was outsourcing and now it's sort of like just accessibility to

Starting point is 00:38:30 everybody. Now, you know, people who don't know much about code can learn about code and level up pretty quickly. And so the access, the catered access to have a patient, whether person or not, like I have conversations with ChatGPT and I swear, I'm like, I tell my wife, I'm like, I'm literally talking to a machine and I get it, but we 30, 40 rounds back and forth through whatever it might be. And it's very much like a conversation I have with Jared. If you would give me the time and patience and if you wouldn't get frustrated, you know what I mean? And so it's a better, very patient. Yeah. Well, not necessarily, but you would give me the time and patience, and if you wouldn't get frustrated, you know what I mean? And so I have this very patient, well, not necessarily,

Starting point is 00:39:08 but the world now has access to a patient sidecar that's quite intelligent, that will get even more intelligent, whether you call it artificial intelligence or not. It has intelligence behind it, some knowledge, and it's accessible right now. I agree. Humans are still necessary, thank the Lord. But wow, it's super flat now. And a lot more people have access to what could be and what might be because of this. And that's a fantastic thing. I think of, you know, there's that Steve Jobs quote where he said, computers are amazing because they're like a

Starting point is 00:39:42 bicycle for the human mind. They allow a much more i think he's drawing comparisons to like you know how different animals get around and like a human walking is like very inefficient but a human on a bicycle is like more efficient than like the the fastest cheetah or whatever right i think like what what language models um are are capable of doing is instead of like a bicycle, now we each have like a race car or a rocket ship. Now we're still in the driver's seat, right? Like we're still steering it and telling it where to go, but it's just, it's way more leverage for any given individual. So great thing if, you know, you love being creative, you love dreaming up, you know, new ideas and ways to solve problems. One more question on the business side of things.

Starting point is 00:40:26 How has growth been because of Cody? That's a great question. Cody is, you almost would not believe it if I described it to you, but Cody is literally like the most magical thing to happen to the source graph go-to-market or sales motion since basically when we started the company ever basically. I've been paying attention for a while. So I asked that question. You've had trouble getting growth because you got to install a server or go cloud and you got to examine the code base. Then you got to learn how to search the code,

Starting point is 00:41:00 which is all like friction points. So one of the, like transparently, one of the challenges that we had as a business is, you know, we had a couple of subsets of the programmer population that were very eager to adopt Sourcegraph. It's basically if you use a tool like Sourcegraph before, you want to use it again. So if you're an ex-Googler, ex-Facebooker, ex-Dropboxer, or, you know, ex-Microsofter

Starting point is 00:41:23 in a couple of teams, you kind of got it immediately. and then everyone else is like, oh, is it like grep or is it like control F? And we would lose a lot of people along the way. I think with Kodi, it's at the point where not only does any programmer get it right away, they're like, oh, holy shit you just asked to explain this very complex code in English and gave me like really good explanation. Um, even like non-technical

Starting point is 00:41:50 stakeholders. So like as we sell to larger and larger companies, a lot of times, you know, in the room is, is someone with like a, uh, I don't know, a CEO or like board of directors or, uh, you know, non-technical someone who who's pretty distant from the code, traditionally speaking. And they get it too because, you know, we were in a pitch meeting the other week where it was like a large kind of Fortune 500 energy company, and there was not a program in the room. It was just kind of like, you know, high-level business owners

Starting point is 00:42:21 who were all very skeptical until we got to Cody. We opened up, you know, high level business owners, um, who are all very skeptical until we got to Cody, we opened up, you know, one of their open source libraries and asked Cody to explain what was going and one person leaned in and they were like, you know, I'm, I haven't coded in like 30 years and even I would get value out of this. So yeah, it's, it's just absolutely incredible. Your total adjustable market got a lot bigger. Yeah. Yeah. Yeah. Cause like what is an engineer now? Um, I think it's like in, in a couple of years, uh, almost every human in the world will be empowered to create software and in some, some fashion. You said before that Cody leverages all that source graph is today, the intelligence. Yep. Will that always be true? I

Starting point is 00:43:03 guess is maybe the more basic way to answer that or ask that question. Because at some point, if this is the largest arc in your hockey stick growth and all the up from here is not so much Kodi related, but Kodi driven really, does what Sourcegraph do at large now eventually become less and less important. And the primary interface really is this natural language coding interface that explains my code. That's a great question. It's like, you know, does AI just like swallow all of programming at some point? Like at some point, do we cease to write kind of like old traditional like systems oriented software in the Von Neumann tradition? You hand wrote thatote that code?

Starting point is 00:43:46 What? You wrote a for loop instead of just asking it nicely to repeat something? Forget code search. I don't even read code. Why are you reading code? Let alone searching.

Starting point is 00:43:58 Right. Yeah. This is still very early days, so it's very difficult to predict, but the way I think about it, I think about it in terms of, like, maybe we have, there are different types of computers that can exist in the world. Like a traditional, you know, like PC, that's one type of computer. You could maybe say, like, the human brain is another type of computer. And then these language models, I think they're a new type of computer.

Starting point is 00:44:26 And they do some things a lot better than the PC type of computer did. And then some things much worse. Like they're far less precise. I think I saw a tweet the other day where someone repeatedly asked GPD-4 whether four was greater than one. And then at some point,

Starting point is 00:44:44 GPD-4 got unsure of itself and said, oh, no, actually, I was mistaken. you know, four was greater than one. And then at some point, GPT-4 got unsure of itself and said, oh, no, actually, I was mistaken. You know, one is greater than four. I apologize. Yeah, exactly. Exactly. Yeah, I apologize. So I think these two types of computers are actually very complementary. And so like the most powerful systems are going to be the ones that combine both and feed the inputs of one and the outputs of the other

Starting point is 00:45:05 and synthesize them in a way that's truly powerful. And we're already seeing early examples of this. Like Kodi is one. We use kind of like the Chomsky style like code understanding tech with the more Norvig style language models. Bing search is another where they're using chat GPT for the

Starting point is 00:45:26 AI part of it, but they're still relying on traditional Bing web search. And so I think we'll see a lot of hybrid systems emerge that combine the best of both worlds. Exciting times. Thanks for talking to us. Yeah, thanks for having me on. Good seeing you again. Good talking. Pleasure chatting with you. Oh, that was fun.

Starting point is 00:45:42 You guys are good at this. I'm excited for you. Yeah, that was fun. You guys are good at this. I'm excited for you. So in the sponsor of Minisoad, here in the breaks, I'm here with Tom Hu, dev advocate at Sentry on the CodeCov team. So Tom, tell me about Sentry's acquisition of CodeCov. And in particular, how is this improving the Sentry platform? When I think about the acquisition, when I think about how does Sentry use CodeCov, or conversely, how does CodeCov use Sentry? I think of CodeCov and I think of the time of deploy. When you're a software developer, you have your lexical, you write your code, you test your code, you deploy, and then your code goes into production and then you sort of fix the bugs. And I sort of think of

Starting point is 00:46:33 that split in time as like when you actually do that deploy. Now, where CodeCup is really useful is before deploy time. It's when you are developing your code. It's when you're saying, hey, like I want to make sure this is going to work. I want to make sure that I have as few bugs as possible. I want to make sure that I've thought of all the errors and all the edge cases and whatnot. And Sentry is the flip side of that. It says, hey, what happens when you hit production, right? When you have a bug and you need to understand what's happening in that bug, you need to understand the context around it. You need to understand where it's happening, what the stack trace looks like, what other local variables exist at that time so that you can debug that and hopefully you don't see that error case

Starting point is 00:47:09 again. When I think of like, oh, what can Sentry do with CodeCover? What can CodeCover do with Sentry? It's sort of taking that entire spectrum of the developer lifecycle of, hey, what can we do to make sure that you ship the least buggy code that you can. And when you do come to a bug that is unexpected, you can fix it as quickly as possible, right? Because, you know, as developers, we want to write good code. We want to make sure that people can use the code that we've written.

Starting point is 00:47:35 We want to make sure that they're happy with the product, they're happy with the software, and it works the way that we expect it to. If we can build a product, you know, the Century Plus Code Cup thing to make sure that you are de-risking your code changes and de-risking your software, then, you know, we've hopefully done the developer community a service. So, Tom, you say bring your tests and you'll handle the rest. Break it down for me.

Starting point is 00:47:57 How does a team get started with CodeCov? You know, what you bring to the table is like testing and you bring your coverage reports. And what CodeCov does is we say, hey, give us your coverage reports, give us access to your code base so that we can, you know, overlay code coverage on top of it and give us access to your CICD. And so with those things, what we do and what CodeCov is really powerful at is that it's not just, hey, like, this is your code coverage number. It's, hey, here's a code coverage number. And your viewer also knows and other parts of your organization know as well. So it's not just you dealing with code coverage and saying, I don't really know what to do with this. Because we take your code coverage, we analyze it, and we throw it back to you into your developer workflow. And by developer workflow, I mean your pull request, your merge request. And we give it to you as a comment so that you can see, oh, great, this was my code coverage

Starting point is 00:48:47 change. But not only do you see this sort of information, but your viewer also sees it and they can tell, oh, great, you've tested your code or you haven't tested your code. And we also give you a status check, which says, hey, like you've met whatever your team's decision on what your code coverage should be, or you haven't met that goal, whatever it happens to be. And so CodeCov is particularly powerful in making sure that code coverage is not just a thing that you're doing on your own island as a developer, but that your entire team can get

Starting point is 00:49:13 involved with and can make decisions. Very cool. Thank you, Tom. So hey, listeners, head to Sentry and check them out, Sentry.io and use our code changelog. So the cool thing is, is our listeners get the team plan for free for three months, not one month, not two months, three months. Yes. The team plan for free for three months. Use the code changelog again. Sentry.io. That's S-E-N-T-R-Y.io andO. And use the code changelog. Also check out our friends over at CodeCove. That's CodeCove dot I-O.

Starting point is 00:49:51 Like code coverage, but just shortened to CodeCove. CodeCove dot I-O. Enjoy. so now we're now we're fine-tuned here we're ready to go okay i see what you did there swine tune i think see what you did there. Swine-tuned, I think, is what you were trying to say. Well, no, I think it was a Dolly reference, fine-tuned. So, yeah. It was a pun. It was a pun.

Starting point is 00:50:33 Work with us, Jared. I mean, Adam and I are already on the same page. What the heck, man? Adam's puns are on point always. He never misses with a pun. All right. Thank you. All right. So, we have Denny Lee from Databricks or Databricks.

Starting point is 00:50:46 Databricks. Databricks. Is that the official stance? It's not a Canadian or American thing. It's just Databricks. It's just Databricks. Here to talk about Dolly 2. But first, I hear you're a just-in-time conference presenter.

Starting point is 00:50:58 Tell us what this means. Well, I think the context was that you were asking me, hey, what's your presentation? That's what you asked me first. I did. And I was actually responding, I don't remember the name, nor do I remember. I do remember the concepts. At least I do have that part. But I don't remember the name.

Starting point is 00:51:15 Nor. Nor are the slides done yet. And this is. Normal. And it starts in 30 minutes. No, no, no, no, no, no, no. Tomorrow. No, no, tomorrow.

Starting point is 00:51:22 Tomorrow. Okay. I'm just simply saying that it is common for me to go ahead and not do a thing until 30 minutes before the actual presentation to create the slides. So you're a procrastinator. Yes. I'm a very good one. No, that's not procrastination. No, efficiency. That's optimization.

Starting point is 00:51:38 Efficiency. Pure efficiency. Why sweat over the details until you have to? Exactly. Exactly. Because what if you start 30 minutes before, but you realize the details required 45 minutes. So I had this one time where actually a bud in mind, Thomas Kaiser, he and I went ahead and did a presentation where he, so he's from Denmark. I'm from Seattle. We're both in, I don't know where, some other city to do the presentation.

Starting point is 00:52:00 Somewhere in the world. Somewhere in the world. So we actually got together, but we realized we actually hadn't done squat on the slides until 30 minutes before the actual session. And guess what? 30 minutes before, put together the slides, bam, we're good to go. Has it ever bit you? I'm sure. Tomorrow.

Starting point is 00:52:18 I'm sure at some point it will bite me. I guess the context is I've gotten away with it so far. So I'm going to go with it. And enough times that you have full confidence. Yes. Fair enough. Yes. Or at least I know how to fake it. So what would you like to know about Dolly? About Dolly 1, how we came about with Dolly 1.0 or Dolly 2.0? Let's start with why.

Starting point is 00:52:38 And then how. Alright, so let's go backwards a little bit. That's when. No, you're talking when. All the way back three weeks ago. Roughly. In the days of yore. Yeah, when? All the way back three weeks ago. Okay? Okay. Roughly. No, sorry. In the days of yore.

Starting point is 00:52:49 Yeah, in the days of yore four weeks ago. All right? Yes. So, one of the things that... And I want to give credit where credit's due. Mike Conover is the guy who actually figured it out. Okay. Now, we were using a much older particular model, and we're going like, eh, would this work?

Starting point is 00:53:04 Right? And what it boiled down to is that there's a supposition that could you take an older model, fine-tune it with good data, and still actually end up getting good results, with the key point being that, hey, we're only going to pay $30 to actually train the data as opposed to, oh, the tens of millions of dollars that you'd have to do.

Starting point is 00:53:23 And could you do it? That was the supposition for Dolly 1.0. And sure enough, we were right. Basically, it was about $30 worth of training time on what is not considered public data. So that's why it's Dolly 1.0. So we could give you the weights, we could give you the model, but we couldn't give you the data

Starting point is 00:53:40 because the data itself was actually not public. But you owned it. No, no, no. In fact, I believe it was the same data that ChatGPT was using. So we could give you the weight. Again, that's open source, but we can't do the data because the data is actually ChatGPT. Gotcha. All right.

Starting point is 00:53:54 And so then we're going, wait, we actually used only a tiny amount of data and it still came out with some pretty decent results. Okay, let's go ahead and say, why don't we generate our own data? So again, take credit where credit is due. Our founders went ahead and said, hey, why don't we just get, we have about 5,000 employees at Databricks now. This is my favorite part. Yeah. Let's just go ahead and generate our own

Starting point is 00:54:16 data. So for two weeks, that's literally all we did. We had basically a bunch of employees dumping in data in a Q&A style format. We had seven different categories. It's all listed out there, so I don't remember all those details anymore. I worked on the t-shirts, so at least I was helpful on that part. Love the t-shirt.

Starting point is 00:54:32 That's a good one. No one's seen this right now, but it is a podcast. That's right. Draw a word picture, Adam. Dude, a sheep. Come on, man. It's a sheep. Dolly.

Starting point is 00:54:41 Dolly. Dolly. Oh, my goodness. I knew you thought he was on point. Oh. Okay. So Dolly the sheep, a clone, right? It's a sheep. Dolly, sheep. Oh my gosh. Oh my goodness. I knew you thought he was on point. Okay. So Dolly, the sheep, a clone, right? It's a clone, right? So that's the whole context. Yes. So we go ahead and actually get that up and running.

Starting point is 00:54:53 And then we're like, hey, now we've got 15,000 plus of Q&A style new information, all brand new, and we're publicly giving it away, right? So the actual data set, if you go to Hugging Face information all brand new and we're publicly giving it away right so not so the actual data set if you go to hugging face or database labs slash dolly or whatever the the github site is yeah basically all that data is there okay

Starting point is 00:55:14 all 15,000 lines oh sorry lines 15,000 QA's okay and then we trained that data set again using the same old model from two years ago. Okay? Okay. And we ran that. And then basically what was really cool about this is that it cost us about $100 worth of training. But it's pretty good. And if you ask some pointed questions on this stuff, it actually responds really, really well.

Starting point is 00:55:38 For example, I've got some examples where I'm actually asking coffee questions. And the coffee questions answers are, okay, I'll give chatgbt4.0 a lot of credit. It is much more verbose than what Dolly 2.0 can provide. But in terms of correctness, it is correct. They both are the same level of correctness between Dolly 2.0 and chatgbt4. I actually have it on my own GitHub somewhere, like a review where I actually explain all that.

Starting point is 00:56:04 Mainly because I was actually running it on an M1 mac too because i was goofing off and which is fine well that's amazing right there yeah let me first just say as a daily user of chat gpt sometimes verbose is not desirable and i'm like dude i actually will tell it to be brief or in one sentence because i'm so sick of the word salad that spits out. I'm like, I just want the answer. The answers are useful. But sometimes you're waiting for it to tell me the whole history of the thing. No. Well, don't you want to know the retrospective while you're at it?

Starting point is 00:56:34 I'm being very sarcastic about it, yes. People can't tell it's a podcast, but we're all eye-rolling each other on that one. We are. That was major eye-rolls. So using it. Let's say I've never used anything but ChatGPT's web UI, but I'm a developer. Sure. And I want my own, I want Dolly to answer my questions.

Starting point is 00:56:55 Yes. What does that process look like for folks? Okay, so you've got two choices, or no, no, I should rephrase it slightly. You've got many choices, in fact. But the most common choices are we have a Databricks notebook that's in the Dolly GitHub that you can just download for free and run it. Now then you're going to tell me, but Denny, I don't want to use Databricks. That's fair. I would prefer you to, but I understand if you don't.

Starting point is 00:57:17 That's fine. Go to Hugging Face. The instructions are all right there on how to use it. In fact, like I was saying, I was actually playing with it so that way I could optimize for an M1 Mac, and so that the answers could come back faster. My only problem was that when I started testing it, there was a obvious bug in PyTorch. Because basically

Starting point is 00:57:36 when we told it to go ahead and use the M1, it was giving us back garbage answers. It wasn't even actual answers. It was literally nonsensical characters. And when we used CPU mode, it worked perfectly fine. But then just as I was about to

Starting point is 00:57:52 create a new issue on PyTorch, they fixed it. No, that's a good thing. I know, but I also had the fix. Oh, you had the fix. Okay, that's it. I get you. You were about to have a control. You wasted my time. But it's fun. I get you. You're about to have a control. You're wasting my time. You're wasting my time. Damn it.

Starting point is 00:58:06 But it's fun. But basically the idea is that obviously, I shouldn't say obviously, you probably don't want to train within M1, but you can definitely do inference within M1. The Q&A, so you got your data. So how do you collect that data and how do you format it so that Dolly can understand it? No joke.

Starting point is 00:58:22 I'm assuming you're saying, so don't use Databricks data. You could do the same thing like you did with the Q&A. Yes, absolutely. Literally, when we ask people to fill out, it was a Google form. Okay. That's literally it. And what were the questions?

Starting point is 00:58:32 Oh, no, no. They could produce the questions and then the answers. They would ask a question, and then it would spit out. Provide a detailed answer for it. I see. So that way, Dolly can train it. So how do you make an espresso? How do you make, since you use coffee, so. It wouldn't even be how do you make an espresso? How do you make, to choose coffee?

Starting point is 00:58:48 It wouldn't even be how do you make an espresso? For example, let's be very specific, okay? It would say, what are the particular features of great espresso? Okay. And then we would talk about, okay, you're required to have a fine grind. You're required to, using a conical burr grinder. There's a religious war between flat burr grinders and conical burr grinders to, using a conical burr grinder. There's a religious war between flatbore grinders and conical burr grinders. I put in conical burr grinders, so yeah, I'm sure the flatbore grinders are pissed off that that's not the answer that they're going to get from Dolly. That's bias. You're putting bias into this. Yes, absolutely. There's absolutely

Starting point is 00:59:15 100% bias. Let's not pretend there isn't, okay? Okay. So, it also requires you to actually have coffee beans roasted in a particular way. It also requires you to have the espresso water boiled in a particular way. It also requires you to have the espresso water boiled at a particular temperature. Okay. So you put all of those details down.

Starting point is 00:59:31 That's the idea. So in other words, it's not just like, okay, hi, what's great espresso? You buy it from Espresso Vavace in Seattle. I mean, while that's true, and I'm basically, I don't own any stock in them, by the way, but they are easily the best coffee.

Starting point is 00:59:44 Who's the brand again? Espresso Vavace in Seattle. Espresso Vavace. Yeah. David Shomer is a magician when it comes to espresso. Okay. But the context is like, well, as much as I want to just provide an answer like that, the reality is no. Obviously, we can't train that bad.

Starting point is 00:59:57 We actually need to have verbosity to provide context, provide proof, if you want to put it that way. Because there's going to be other people putting other answers, too. So, for example, in this case, I'm just going to call a buddy of mine, Rob Reed. He's a fellow cyclist. He's also a fellow coffee addict. I know he also put some coffee answers inside there as well. Okay. So, between everybody that put coffee answers in there, that's actually literally you're getting data from myself,

Starting point is 01:00:24 from Rob, and a few other folks from, well, Databricks. Right. And how many instructions are in there that you guys put in? The 5,000 employees? 5,000 employees put 15,000. 15,000. So it's remarkable. If you think about it, that's remarkably small.

Starting point is 01:00:39 We were always under the impression when we started this process that we would require hundreds of thousands or like millions how does it know you gave it coffee instructions yeah yeah no we know something totally different like I said Dolly 1.0 shocked us like it really shocked us because we thought we would need to put in a lot more data we thought we would need to do a lot more training and then they were like wow this is not bad I mean it's not perfect but it's not bad actually right And so from a business perspective, what ends up happening is if you have your own business,

Starting point is 01:01:09 now your data, you don't need a million things. You've got 15,000 pieces of information. Now, the great thing, and I'm not telling you to use Dolly, by the way. I mean, obviously, go use it if you want to, but I'm saying use any open source model. I don't care which one. That way, you get to go ahead and keep it and have your data as your IP.

Starting point is 01:01:29 So you as a business end up using the data actually in a good way. Right. Where you actually make it advantageous for you, yet also keeping the privacy for the users that make up that data at the exact same time. So the move is you have these, I don't know if this is technically what a foundational model is, or you have these models that are large enough language models. Right. Right. And then each company or each org or each use case says, okay, now we're going to fine tune it. I don't know if that's the right language or not. And apply it to us. Right. And there's all sorts of models out there. There are already, like a a lot of people were asking me originally, like,

Starting point is 01:02:05 hey, okay, well, then, you need to use DALI. I'm like, no, no, no, no. DALI was just us proving that it can be done. That's all it was. So there are a lot of really good companies, whether it's Hugging Face or anybody else, that produces solid, open-source, large language models. Use those, too.

Starting point is 01:02:23 Because the whole point is that you can use it yourself, run it with smaller amounts of data, have really good answers, and you're paying a hundred bucks. At least in our case, we did. A hundred bucks to train it. Right. So we're like, okay, that's actually worth your business. You're protecting the privacy of your users. You're going ahead and actually having relatively solid answers.

Starting point is 01:02:41 And you're not basically giving your data away to another service. Because that's the key thing about when you use a service. Right. That you're basically giving away your data so they can go train against the two. Right. Right? Now, I know Microsoft and OpenAI, for example, you're calling those two out in a positive way, not a negative. Usually, I'm a former Microsoft employee, so I'm allowed to be negative if I want to, but this is actually me being positive.

Starting point is 01:03:03 They actually have introduced concepts saying you can pay more to train and that they'll never actually use your data. But I don't remember the cost, but it is definitely paying more. Yeah. Yeah. Well, it's not as valuable to them, so it makes sense as a transaction. So that becomes more of a transaction that way. Exactly.

Starting point is 01:03:23 Right. So have you seen the Googler's leaked memo about we have no moat? Yeah, everybody talks about that memo. And what's interesting about that whole concept is that, I know it sounds sideways, but I was about to actually give you another context. And this is actually, again, Mike gone over,

Starting point is 01:03:38 I want to give credit attribution to the guy who actually said it. What's really interesting about this whole thing, when they talk about moats and talk about everything else, is that more fundamentally, we could have done this two years ago. We could have taken this concept of basically saying, small amount of data, foundational model, fine-tune it, and actually have good results. So all of us were focusing on, I need a bigger model. I need to bump more data.

Starting point is 01:04:05 I need to scrape the entire freaking Internet and chuck it all into the gigantic model. Spend tens of millions of dollars, warp every single GPU until Azure basically melts in order to go ahead and train this thing. Until the heat death of the universe. Right, exactly. And then meanwhile, it's like, or we literally could have taken a foundational model that was okay to good, $100, and bam, we get something good. So when they talk about there's no moat and all this other stuff between open source and not, literally my attitude toward this whole thing is like, no, just step backwards for a second.

Starting point is 01:04:39 The reality is we could have done this. We all got attracted to the idea, the shiny thing of, ooh, bigger, more, bigger, more, larger, more. That's all we got attracted to. And so in the end, I'm going, I don't care. These companies, the ones that, quote unquote, are trying to build a moat

Starting point is 01:04:58 around themselves, what they're doing, they're trying to make sure that they have a service in which you will give them your data, and then by in which you will give them your data. And then by definition, you will give away your competitive advantage. Right. Simple as that. For the folks that don't want to do that, which I think is the vast majority, then my attitude is quite simple.

Starting point is 01:05:17 Then don't do that and build your own model. Now, how about if I'm the general consumer? I just want to pump out a good blog template for me to work with. Yeah, absolutely. Why not? Seriously, I'm not trying to say these services aren't worthwhile. Quite the opposite. ChatDB is fun.

Starting point is 01:05:35 Very valuable. Oh, yeah. It's extremely valuable. In fact, I've already had it pumping out code for me just for shits and giggles. So my Rust is- It's going to pump out some slides for you here soon, for tomorrow. Oh, that's a good idea. I should test out that.

Starting point is 01:05:47 Yeah, yeah. Take that 30 minutes, turn it into 12. Oh, yeah. That'd be perfect. Yeah, yeah. But see, you get my drift. Yeah, totally. Yeah, so my Rust code is Rusty.

Starting point is 01:05:56 And so basically, I was using ChatGP to basically pump out a bunch of Rust code for me. I'm like, hey, this is a great bowler play. Now I've got something to work with, and boom, now I can start writing again. Right. So what is Databricks' play in this chess game? Like, what's your guys' angle? Our angle's quite simple. You've got a ton of data.

Starting point is 01:06:14 You need to ETL it, process it in the first place. Then you need to have a platform to run machine learning or data science or AI or whatever frickin' wording you want to use. Whether it's LLMs today, deep learning yesterday or tomorrow, image optical

Starting point is 01:06:33 resolutions, object recognition, I don't care. The point is that you have a ton of data. You need to be able to process it. You need to be able to access every single open source system or service. Databricks play is quite simple.

Starting point is 01:06:50 We just make it easy for you to do any of it. That's it. That's our only play. Let's make it easy. Are you for, I guess, then people owning their own data? It seems that that's your... So here's the thing. I'm absolutely for both from a Databricks perspective,

Starting point is 01:07:06 but also from an open source perspective, right? So I'm an open source contributor. I contributed to Apache Spark and MLflow, and I'm also a maintainer for Delta Lake, okay? And so, yeah, by definition, I'm always going to lean toward open source, which means you should own your data. Data should be a competitive advantage.

Starting point is 01:07:24 Everything else should be open source, basically, for all intents and purposes. I'm even for things like differential privacy and privacy preserving histograms to basically protect your data. And I can go on a diatribe on that, so let's not do that. But the context is, I'm not saying, though, these services like OpenAI oring or whatever else aren't worthwhile they are they're cheap they're helpful in fact training other systems isn't necessarily a bad thing either it's some for me it's not about don't do it it's about knowing what you're doing right that's it yeah no transparency exactly that's it that's that's my icon if you want to use open ai uh within a database platform, we make it easy. For crying out loud, we added SQL syntax directly,

Starting point is 01:08:09 so you can literally write Spark SQL, which basically at this point is basically anti-SQL compliant. You literally write SQL to go ahead and access your OpenAI to run an LL model directly against your data. So literally, party hardy, have fun. So it's not, our attitude isn't so much like, don't use one versus the other. Our attitude is very much, no, no, just know what you're doing.

Starting point is 01:08:33 Understand when you're using something like a service. Understand when it makes sense for you to build your own model. And we also make it easy for you to build, maintain, train, infer against that model. That's it. So I mentioned we have our transcripts as open source, right? Yeah. Everything we're saying here, when it hits the podcast, it's going to be transcribed into words.

Starting point is 01:08:52 How are ways we can use Dolly 2.0, this open model that you're talking about, this direction? How can we leverage these transcripts for our personal betterment as a podcast company? For example, as a podcast company, one of the first things, in fact, I'm actually already doing this technically for Delta Lake, is that we also have podcasts ourselves. So what are we doing, though? I'm spending time and effort to generate blogs based off of the podcasts. Why? Because it's better for Google SEO search.

Starting point is 01:09:21 It's not like I'm trying to just repeat the same thing. I'm just trying to summarize because we talked about, we talked about barbecue in the beginning, right? We talked about coffee. We probably don't need all of those details inside the transcript of the podcast of our blog. You want people to go ahead and actually understand what they're talking about when it comes to Dolly. Cool. We generate a blog based off of this conversation. It can summarize it, get to the key points. Boom, there you go. It simplifies the whole process,

Starting point is 01:09:49 so that way you're not spending exorbitant hours trying to figure out how to basically synthesize the key points out of our conversation right now. So it's still time for you to review and look to make sure the model isn't giving you garbage. There's still time for a producer or for any other person who is knowledgeable in this field to validate the statements. Maybe I'm full of BS of all I know, right? And then so you get an extra.

Starting point is 01:10:15 Sometimes. Oh, yeah, yeah. I don't know. Denny's full of it. Forget it. It most likely would be the Conical versus Flatbird grinder. But, again, that's a whole other story. The whole summary will just be Adam and I talking in color.

Starting point is 01:10:24 I'm Conical. I'm on your team. Conical is me. I'm Conical. Team Conical. There you go. Perfect. See? But the context is that we can go ahead and actually use these systems to simplify it. Would it be cheaper and easier if we just went ahead and did like ChatJB to do it?

Starting point is 01:10:38 Yeah. Go for it. Would it be worthwhile to do it in your own dolly model? Absolutely. Because you have your own style, right? So if you have your own style, if DALI or any other open source model, again, I want to be very clear here, is going ahead and be trained against your transcripts, it will then be able to start writing blogs based off of your style, right?

Starting point is 01:11:03 That's the cool thing about it. Is it cool to actually chain like that, or is it better to go with a foundational model and then just our stuff, or to be cool or to be like, well, start with Dolly because it has instructions, and then add our style, and then maybe add something else?

Starting point is 01:11:16 Literally, my answer is all of the above, because we don't know. Just whatever you want. We don't know. We don't know, because that's the whole point. Different foundational models will provide different, will be better at different things. As simple as. Different foundational models will be better at different things. As simple as that.

Starting point is 01:11:27 Some models will be better at, for example, conversations. Some models will be better for writing purposes. Nat.dev. I'm forgetting the guy's name. Nat Friedman. Thank you. Oh, my God. I don't believe I spaced out on that.

Starting point is 01:11:43 He's a nobody. He's a small guy. Okay, so Nat Freeman, former CEO of GitHub. So slightly important guy. Nat.dev is an awesome playground, for example, where you can test out a lot of different models already. And you're literally just chucking, like, hey, let me try with ChatGPT-3.

Starting point is 01:12:00 Let me try with Vicuna. Whatever else. And literally you will see with the same question, especially if we do the compare playground section, different answers from the different models. So, yeah, like literally you got to play a little bit to figure out which model makes sense for you. Yeah. Interesting.

Starting point is 01:12:18 Love it. Well, thanks for talking with us, Denny. Glad to, always. Aside from your opinions on coffee and whatnot, you're pretty good. Pretty solid dude, yeah. You know, those are fighting words. I just want to say that, okay? Those are fighting words.

Starting point is 01:12:32 Oh, that's good. All right. Gentlemen, thank you very much. Yes, thank you. All right. All right. Hey, friends. This episode is brought to you by CIQ, the founding sponsor and partner of Rocky Linux, Enterprise Linux, the open source community way. And I'm here with Gregor Kertzer, the founder and CEO of CIQ and the creator of Rocky Linux.

Starting point is 01:13:07 So Greg, I know that a lot of people are still sort of catching up to some degree with what went down with CentOS, the Red Hat acquisition, and just the massive shift that required everyone using CentOS to do. Can you give me a glimpse into what happened there? We've seen a number of cases in the open source community where projects were pivoted due to business agenda or commercial needs. We saw that happen with CentOS. CentOS was one of the primary, one of the biggest enterprise operating systems ever. People were using it all over the place. Enterprise organizations and professional IT teams were all leveraging CentOS. For CentOS to be stripped away from the community and removed as a suitable option to meet their needs created a massive pain point and a gap within the industry. As one of the founders of

Starting point is 01:13:58 CentOS, I really took this to heart and I wanted to ensure that this does not happen again. And that is what we created with Rocky Linux and the RESF. Okay. You mentioned the RESF. What is that? And what is its relationship to Rocky Linux? The RESF is the Rocky Enterprise Software Foundation. And it is a organization that we created to hold ourselves responsible to what it is that we've promised that we're going to do with the community. It is community run. It is community led. We have a board of directors, which is comprised of a number of people that have a huge amount of experience, both with Linux as well as open source and community. And from this organization, we solidify the governance of how we are to manage Rocky Linux and any other projects that come and join in this vision. Sounds good, Greg.

Starting point is 01:14:54 I love it. So Enterprise Linux, the open source way, the community way has a home at Rocky Linux and the RESF. Check it out and learn more at RockyLinux.org slash changelog. Again, RockyLinux.org slash changelog. All right, Stella Biederman. Yeah.

Starting point is 01:15:32 And you're with, I'm going to also butcher the name of the org. Eleuther AI. Eleuther. Eleuther AI. Yes. Okay, what is this? What is Eleuther AI? Y'all were just talking with Databricks about Dali.

Starting point is 01:15:43 This is right. Yes, correct. So that was built on top of an open source language model. Okay, yes. I trained that. Okay, so you're underneath DALI. Yes. Okay. So you personally trained

Starting point is 01:15:58 it. Yes. Okay. What's the model? It's called Pythia. Pythia. It's a suite of language models, actually, that we put out a couple months ago. Okay. But in general, Luther Eye has trained several of the largest open source language models in the world in the past three years. Okay. Very nice.

Starting point is 01:16:18 So what do you want to tell the world then? What do I want to tell the world? Honestly, didn't think that far in advance. Okay. All right. Well, what should the world know? What should the world know? About what you do in terms of training models that Databricks uses, that's open source, etc.

Starting point is 01:16:33 Honestly, especially like the open source world should really know that the AI world really needs help from the open source community at large. That's actually, broadly speaking, why I'm here at the Linux Open Source Summit. Okay. You know, we're struggling with a lot of issues

Starting point is 01:16:50 about maintainability, issues about licensing, issues about regulation, issues about building sustainable ecosystems that the open source community writ large has been working on for years, if not decades. Yeah. And a lot of people in the AI world

Starting point is 01:17:08 are a little too proud to ask for help from non-AI people, which is definitely a real systemic problem. But there's, I think, a lot of... If people are excited about foundation models, large language models, whatever you want to call them, and want to get involved and don't know, or want to help and don't know that much about AI, there's a ton of open source work that needs to be done that we need help with to build a robust and enduring ecosystem.

Starting point is 01:17:40 Where is the money coming from? Where's the money coming from? Great question. So at Eleuther AI, we recently formed a nonprofit. And we have donations from a number of companies, most prominently Google, Stability AI, and Hugging Face. Okay. And CoreWeave are among our biggest sponsors. We have also been applying for grants from mostly the U.S. government to pay for our forthcoming research and work. In terms of computing resources,

Starting point is 01:18:14 it's actually like training these really large language models is not that expensive, which is like... Is that a secret? I don't know if it's a secret or what, but I think that the CS world kind of got used to the idea that anything can be done on a personal laptop, and that that's kind of what constitutes

Starting point is 01:18:39 a reasonable amount of money to spend on a paper. And that's great. There's a huge accessibility boon for doing that. But training these large language models, it is pricey. It's not something that anyone can do on their own. But it's not ruinously expensive. There are thousands of companies

Starting point is 01:18:58 around the world that can afford to do this. There are dozens of universities that can afford to do this. And by and large, they just haven't been. Okay. So there's a model that you trained. Yeah. How much did that cost? So we trained, so it's part of a suite of models that had like 28 in it total. But altogether, that was like less than $800,000. The largest model, one training run would probably be like $200,000. Not bad.

Starting point is 01:19:28 That's more than a laptop. Which is more than a laptop. But it's less than. It's not like a mind-boggling amount of money. It's less than a Super Bowl commercial. It's true. Yeah. So right now, the largest open source,

Starting point is 01:19:42 well, okay, the second largest open source English language model in the world is called G2B NeoX. We train that, I train that, my organization. And that cost us about $350,000. Or what if we weren't given the compute for free? But like $350,000

Starting point is 01:19:59 for the second largest open source language model in the world. And at the time we released it, it was the largest. Later, someone else trained a bigger model with sponsorship from the Russian government. But it's for... So G2P3 came out in 2020. And for about two years, almost nobody was training in open sourcing language models.

Starting point is 01:20:24 Google was doing it with similar models, but not like the same kinds of models that G2P3 is. And we were doing it. It was really not that expensive. We got into it on compute that we got for free through a Google research computing program called the TensorFlow Research Cloud. And with that, we trained a 6 billion parameter language model,

Starting point is 01:20:47 the one that underpins the first version of DALI that he was talking about. That's been extremely widely used, deployed in a whole bunch of different industry and research contexts and been hugely successful. And it was literally just like Google gave us for free. It ran preemptively on their research, basically the idea of TRC is that they have

Starting point is 01:21:08 a research cluster that they don't always use all of. And so other researchers, independent researchers, academics, non-profits, can apply to be able to run preemptible jobs on their research cluster and just use the compute that they're not using at the time. And using that, we trained this model in like two and a half months. And it was a really big deal when it came out. It was the largest model of its type in the world by a sizable margin.

Starting point is 01:21:35 It was about three times the size of the, four? Four times the size of the largest open source model of its type in the world. Yeah. And the Pythia models we trained on like 120, 800 GPUs for a couple of weeks, which is certainly a lot of computing resources, but it's not like mind boggling amounts of compute. There are lots and lots and lots of companies that have that, that could, you know, it's less about it actually being too expensive and more about kind of having the political will to actually go do it. Yeah. Are you focused on training open source models? Is that your focus? So our focus is on open source AI research in general. Our kind of area of expertise is large scale AI.

Starting point is 01:22:18 And most of what we do is language models. But we've also worked on training and releasing other kinds of large-scale AI models. We were part of the OpenFold project. DeepMind created an algorithm for modeling protein interactions called AlphaFold. That was a really big deal. We helped some

Starting point is 01:22:39 academics scale up their research and replicate that and release it open source. We've done some stuff in the text- space both on our own and some of our staff have uh have kind of gone on and worked at stability on on some of their language uh sorry image models and we are a big proponent of open source research in general so our kind of the reason we decided to start training these large language models was back in the summer of 2020, we thought this G2P3 thing is going to be a major player in the future of AI. And it's going to be really essential if you want to be doing something meaningful in AI,

Starting point is 01:23:21 you probably want to know how these things work. You want to be able to experiment with them. You want to have access to them. And back then, you couldn't even pay OpenAI to let you use the model. They announced that they had it, and that was it. And so we said, well, what the hell? Let's try to train a model like that. We'll learn something along the way. And so we started building an open source infrastructure for training large language models. We created a data set called the pile, which is now kind of the de facto standard for training large language models. We created a

Starting point is 01:23:50 evaluation suite for consistently evaluating language models because everyone runs their evaluations a little differently and there's huge reproducibility issues. So we built a framework that we could release them in source and run on our own

Starting point is 01:24:05 models, run on other people's models, and actually have kind of meaningful apples to apples comparisons. And we started training large language models. We trained a 2.7 billion parameter model, which is like a little bit bigger than G2B2 was at the time. And then we started training larger models. 6 billion parameters was the largest open source G2B3 style language model in the world. 20 billion parameters was the largest open source GTP3 style language model in the world. 20 billion parameters was the largest language model of any sort to be released open source in the world. Since then, there's been a lot more investment and willingness to train and release models. There's several companies that are now doing it. So Mosaic

Starting point is 01:24:42 is a company that released a nine, I want to say, something, a large language model. That seems really excellent, like last week. There is Meta, which has been training and releasing sort of models. They'll tell you that they're open source releasing models, but that's just not actually correct. They're under non-commercial licenses

Starting point is 01:25:05 and they're not open source, despite their retort to the contrary. But there's a whole bunch of companies. Stability AI is training large language models. So now there's a lot more people in this space and doing it and releasing it. And honestly, from my point of view, we got into training large language models

Starting point is 01:25:21 mostly because we wanted to study them. We wanted to enable people to do essential research on interpretability, ethics, alignment, understanding how these models work, why these models work, and what they're doing, so that we can design better models and so that we can know what appropriate and inappropriate deployment contexts for them are. And so now that there's a lot more people working in kind of this open source training space, we're moving more towards doing that kind of scientific research source training space, we're moving more towards, you know, doing that kind of scientific research that we've always wanted to do. So in the past six

Starting point is 01:25:50 months, we've been doing a lot of work in interpreting language models and kind of understanding why they behave the way they do. My personal kind of area of focus is tracing the behavior of language models back to their actual training data. So the models that the DALI 2 is trained on, the Pythia suite, what kind of makes that special is that most language model suites are very ad hoc constructed. I'm calling them suites because you have several models that are similar of different sizes. So like the OPT suite by Meta, for example, ranges from 125 million parameters to 175 billion parameters. But they're not actually very consistent between them. Some of them even have different architectures.

Starting point is 01:26:34 They have different data order. There's a lot of stuff that kind of limits your ability to understand, to do controlled experiments on these models. And so we sat down and we said, if we wanted to design from the ground up a suite of large language models that was designed to enable scientific research, what would it look like? What kinds of properties would it have? What kinds of experiments do we think people are going to want to do that we're going to need to enable? And we built this list of requirements and then created a model suite that satisfies that. So it was trained on entirely publicly available data.

Starting point is 01:27:06 All of the training, it was trained on the same data. Every model in the suite was trained on the same data in the same order. And we have a whole lot of intermediate checkpoints that are safe. So if you want to know, you know, after 10 billion tokens, how each model in the suite is performing, you can go and grab those checkpoints after 10 billion tokens. And then you can say, okay, what's the next data point it saw during training after 10 billion tokens? What was the 10 billion first token? And you can actually use some stuff we've uploaded

Starting point is 01:27:32 to the internet to actually load that data in the same order it's seen by the models. You can study kind of how being exposed to particular training data influences model behavior. So we've been using this right now primarily to study memorization, understanding, because language models have a propensity for reproducing long exact sequences from their training core paths. And we're interested in understanding what causes memorization, why certain strings get memorized and others don't. Right now, I'm wrapping up our kind of

Starting point is 01:28:01 first paper on that. We have some more research in the works, trying to understand, you know, looking at the actual models throughout the course of training and looking at kind of the training data points that they see and trying to reverse engineer what that actual interaction between the model and the data is. And yeah, this is something I'm personally really high on.

Starting point is 01:28:19 Most interpretability research right now is kind of focused on final trained models as like pre-existing artifacts. So you have this trained model and you want to understand what behaviors it has. But, you know, my perspective as someone who trains these models is much more focused on kind of where they come from and what, especially like my overarching goal is to kind of, you know, if I as a person who trains a large language model have a particular desire for a property the model has, a property the model doesn't have, what decisions can I make to actually influence that and to make the model have the properties I want it to have? So if there's data I don't want it to memorize, is there a way that I can know ahead of time what's going to be memorized?

Starting point is 01:28:58 That's the paper that we actually just released on Archive about forecasting what is going to be memorized before you actually train the model. Is that to make it less black box, more like you deploy it and you don't know what it can do so that you can sort of understand, okay, here's the data, here's how it's trained to sort of have a, a more clarity of what the box actually contains versus this black box.

Starting point is 01:29:18 Is that why that's important? That is what the field of interpretability is about in general. Okay. And I would say kind of building on that, that what, what my research is about in general. And I would say kind of building on that, what my research is about in particular is not just opening up that black box and looking inside and understanding

Starting point is 01:29:30 what the model is actually doing, but understanding where it came from and how we can build boxes that are more transparent from the ground up. Predictable maybe even? Yeah. Yeah? Because I mean, that's one of the fears is,

Starting point is 01:29:44 you know, especially with like Bing. When they put that out there, I think, that's one of the fears is, you know, especially with, like, Bing. Yeah. When they put that out there, I think what it threatened the person, like, there was some sort of, like, threat on humanity, essentially. And it's like you deploy this thing out into the world and you don't understand what it can actually do. Is that to be more predictable, more controlled to some degree? Sorry? And even designable, like, say, well, forget these things, remember these things. Yeah.

Starting point is 01:30:07 Designability is a really big component, I think, that's going to become huge in the future. Right. And really, it hasn't been studied primarily because people haven't had the tools. Very few model suites have intermediate checkpoints at all. A lot of publicly released models weren't trained on publicly released data sets.

Starting point is 01:30:23 Or if they were trained on publicly released data sets, they didn't tell you what order it was trained on. And it turns out that matters a lot. What saw early in training, what saw late in training. And so there's really a huge reproducibility issue in terms of, if you want to dig in and really understand how data by data, data point by data point, the model is learning to behave. You need to be able to basically fully reproduce the training. Not actually, because you're not going to spend a couple hundred thousand dollars, but at least in principle, you need to be able to inspect individual data points, know when it's going to get loaded, understand kind of how it works.

Starting point is 01:30:57 And this is something that we've put a huge amount of resources into, both on the training side as well as kind of on the engineering side. It was not easy, but you can actually reproduce our model training exactly. So if you take the code base that we used to train these Pythia models and you pick a checkpoint and you load that checkpoint and you resume training from that checkpoint, you will end up with the same fully trained model that we did. Exactly.

Starting point is 01:31:23 That's important. That is really important. It's important because if you want to understand how to design models, you need to understand how they're changing over the course of training. And that is really persnickety and really sensitive to a lot of implementation specific details that tend to not get released. How far in the future do you think, since you're at the training level, you're like the ground level of if this is the eureka moment for humanity. Yeah. Right? How far in the future do you think and do you have fear, trepidation, hope?

Starting point is 01:31:54 Like where will this take us as humanity? I really don't know. My kind of attitude is that the recent, like there was a really big paradigm shift in 2020 with the release of G2P3 and the aggressive focus on scaling. And people really changed their attitudes towards how to design language models and how they can be used and what they can be used for. In a sense, we got really lucky because it wasn't that dangerous. There were a lot of fears about what G2P3 could do.

Starting point is 01:32:24 And by and large, it turned out to be pretty safe. There wasn't all that much harm done, and a lot of the fears turned out to be not come to fruition. And looking forward, I think the really important thing to think about is we obviously can't predict the next paradigm shift. But building tools that allow us to hopefully more readily adapt and respond to future paradigm shifts in large scale AI. So that one day there probably will be something that gets developed that is dangerous and we want to be able to be, I guess, ready for that. Yeah.

Starting point is 01:32:57 Yeah. Cool. Well, what are some touch points, people who are interested in what you're up to, want to help out, want to give money, want to read more, where can people connect with you? So the best place to connect with us is our Discord server. We are a research institute, but we actually operate basically entirely in the public view. We're distributed all over the world,

Starting point is 01:33:19 and we do our research in a public Discord. And anyone can join, anyone can drop in, read about what we're getting up to, hang out with us, chat with us about AI. So our Discord server is discord.gg slash Eleuther AI. There's also a link on our website, which is Eleuther.ai. Shockingly. We'll link it up in the show notes for sure. Yeah. And, yeah, we're always happy to take on more volunteers.

Starting point is 01:33:44 We have a small professional staff and a large number of volunteers that help out as well. How small is small? Like 10 full-time employees. Okay. And if they go to the Discord server, what can they do there? What can they expect from the Discord server? Like, you're there, others are there. Yeah, so you can chat about AI.

Starting point is 01:34:01 We have a bunch of discussion channels where people talk about kind of cutting edge trends in artificial intelligence. Honestly, I don't really follow AI publication news anymore because I just follow my Discord server and everything that's important shows up for me. There you go. Which is a really nice place to be. You can talk with us. You can talk with other researchers.

Starting point is 01:34:19 We have a large amount of researchers at the cutting edge of AI. I can't count the number of times that someone's posted a paper and been like, hey, this is really cool. Like, does anyone know anything about this? And someone just like tags the guy who wrote the paper. That happens all the time. We have people from open AI, anthropic, meta, like all the major labs who come, DeepMind, come in and chat about language models, give advice, give, you know, perspectives on research and talk about kind of how things are going. You can also get involved with ongoing research projects. So we have a dozen-ish ongoing research projects ranging from learning to train, figuring out how to train better language models

Starting point is 01:34:56 to training language models in other languages. So if you look at like the list of the hundred largest language models in the world, basically all of them are English or Chinese. Yeah. And so if you want to spread the benefits of this technology and the ability to kind of use and understand this technology to the world at large, like not everyone speaks English and Chinese, and even the people who do often also speak other languages that they care about.

Starting point is 01:35:22 So we're training, we've trained and released several Korean language models. We're currently training with the plan of releasing some Indic language models, as well as some Romance language models. So yeah, on the developing new model side, we do research like that. On the interpretability side, we do a lot of different stuff, understanding training dynamics, understanding how to evaluate language models, understanding how to kind of extract the best information from them.

Starting point is 01:35:50 We recently started up some work on kind of red teaming them and trying to understand, you know, there's a lot of stuff out there right now about prompt hacking, about how people are trying to put filters on language models and they're kind of not really very successful and trying to understand like what the dynamics of that is like, whether you can, uh, build meaningful safeguards around these things or whether it's always going to be subverted. We do a lot of work like that as well. Very cool. Well, thanks for coming on the show, Stella. It was awesome having this deep dive with you. I love that. Thank you. Great to meet you guys. Yeah. So if you'd have told me a few years ago that I'd be going to an open source summit

Starting point is 01:36:28 and talking about AI in open source at this level, from Cody, a coding assistant, to Databricks and training models on small data sets, to Stella's work and the Luther AI's work on opening eye research and all these things that'd be real, that'd be touchable, that'd be usable today to transform my work, to transform your work, to transform the world around me. I would not have believed it, but it's true. We're here and this show was awesome.

Starting point is 01:37:02 So hope you enjoyed it. Once again, a big thank you to our friends at GitHub for sponsoring us to go to this conference as part of maintainer month. There is a small bonus for our plus plus subscribers. So stick around for that. If you're not a plus plus subscriber, it's too easy. changelog.com slash plus plus.

Starting point is 01:37:23 We drop the ads. We obviously give you bonus content. We slash plus plus. We drop the ads. We obviously give you bonus content. We bring you a little closer to the metal and the best part, you directly support us. 10 bucks a month, 100 bucks a year. changelog.com slash plus plus. That's it. The show's done.

Starting point is 01:37:40 Thanks for tuning in. We will see you on Friday.

Your Ad Here

The Changelog: Software Development, Open Source - ANTHOLOGY — Open source AI (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.