The AI Daily Brief: Artificial Intelligence News and Analysis - GPT Engineer Generates An Entire Codebase Based On A Prompt

Episode Date: June 19, 2023

GPT-Engineer is the latest AI project to have Github developers going nuts. The project promises to create entire code bases with a single prompt, reigniting excitement around the potential for more p...owerful AI agents. The AI Breakdown helps you understand the most important news and discussions in AI. The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI breakdown, we're looking at GPT Engineer, a new project that promises to generate entire codebases from just a prompt. Before that on the brief, the Grammys announces AI rules, LLMs outperform humans on data labeling, and much more. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Like, subscribe and share, and go to breakdown.network for more information. Welcome back to the AI breakdown brief, all the AI headline news that you need in five minutes or less. We start today with a study that I think will surprise no one, but is still hugely important. Refuel is an AI company that just raised around $5 million, whose focus is on cleaning and labeling data for all sorts of different types of business applications. Now, they recently did a study
Starting point is 00:00:45 across a wide variety of data sets in which they've found that LLMs are able to label data as well as humans in terms of accuracy, but for much cheaper and much faster. In fact, they say LLMs can label text data sets 20 times faster and seven times cheaper than human counter. parts. Now, when it comes to accuracy, GPT4 perform best out of the box with 88.4% agreement with Ground Truth as compared to 86% for skilled human annotators, so actually slightly better than human. The types of data sets that were part of this study included online banking queries, USSEC filings, toxicity detection and public user comments, product data from Walmart and Amazon, company descriptions from Wikipedia, science exam questions, and more. So again, a really
Starting point is 00:01:25 wide-ranging set of data. Open AI, Anthropic, Hugging Face, and Google all had LLMs that were considered as part of the study. Now, while GPT4 was the most performant of these models, numerous other models also achieved strong performance of above 80% agreement with Groundtruth at one-tenth the cost of the GPT4 API. I don't think this is all that surprising, but it is a great example of the type of rote work that is likely to be automated away. For people who are paying attention to this, some will see in this the destruction of an entire category of jobs, but others will see a freeing up of human capacity for other types of higher order thinking. Next up in news that surprises no one 92% of developers, according to a new GitHub survey, are using AI in the workplace.
Starting point is 00:02:07 Not only does this cohort represent early adopters in general, it's also one of the areas where there has been the fastest development of new tooling to open up these possibilities. Remember, before Windows started rolling out copilot across the entire platform, Microsoft had co-pilot in GitHub. As Chris Kastanova puts it, AI isn't programming's future, it's its present. Speaking of AI and programming, our main AI breakdown today is about something called gpte engineer. This is exploding on GitHub right now with tons of developers piling on, and it's basically an AI agent that can write an entire codebase with a prompt. We'll discuss where it came
Starting point is 00:02:41 from, what people are using it for, and what the implications are a little later. For now, let's move over to the policy side of AI. Last week, of course, the European Parliament passed a draft of the AI Act. Now, there are still a number of bureaucratic processes through which the AI Act could change in some details, but by and large it represents the most comprehensive. legislative legislation we have on AI from a global power to date. As we discussed last week, a lot of the EU AI Act was written in a previous time, a pre-chat GPT era. And so much of it has to do with how government, such as law enforcement, uses AI, as opposed
Starting point is 00:03:14 to regulating generative AI specifically. But given the rise of generative AI, they did graft on a few different provisions, and those are some of the most controversial so far. One of them is a rule that foundational models, whether it's closed source or open source, will have to create a comprehensive accounting of the data that they're trained on. Some developers say that this isn't practical or is unworkable and would bring existing LLMs into noncompliance almost immediately. In response, Stanford University researchers have looked into whether the existing foundation
Starting point is 00:03:40 models from companies like OpenAI and Google would actually pass muster. They organized the AI requirements into a number of categories, including data sources, data, copyrighted data, compute, energy, capabilities, and limitations, risks and mitigation, evaluations, testing, machine-generated content, member states, and downstream documentation and then gave each of the models a score out of four for each of those different categories. Hugging Face's Blue Model would be the most compliant right now with 36 out of 48 possible points with this Stanford scoring system. And of the other big guys, Google scored a 27 out of 48, OpenAI had a 25 out of 48, Meta had a 21 out of 48, stability AI had a 22 out of 48,
Starting point is 00:04:19 and pulling up the rear was Anthropics Claude with just a 7 out of 48. Now the researchers note that a lot of places where companies lost points wasn't necessarily because they were out of compliance with the EU regulations, but because there wasn't documentation available to show that they actually were in compliance. They write, our work indicates where each foundation model provider can improve. Our work indicates where each foundation model provider can improve. We highlight many steps that are low-hanging fruit, such as improving the documentation made available to downstream developers that build on foundation models. Overall, they say we find that foundation model providers unevenly comply with the stated requirements of the draft EUAI Act.
Starting point is 00:04:54 We believe that all providers can feasibly improve their conduct independent of where they fall on this spectrum. Meanwhile, over in the UK, Rishi-Soonax government is taking its own steps towards better AI alignment and AI policy, allocating 100 million pounds for a new UK AI task force. Over the weekend, the UK announced that tech entrepreneur Ian Hogarth, who was also a prolific AI investor and frequent commenter on AI risk and AI safety issues, would be leading up that task force. Ian co-founded Songkick and sold that company to Warner Music in 2017,
Starting point is 00:05:24 and for the last five years has authored an annual state of AI report. Ian concludes his announcement thread by saying, I am fundamentally optimistic about the potential for science and technology to transform our lives for the better. The opportunities for AI to be a force for good are truly remarkable, but we need to do it safely. Now, staying on the theme of how AI might change things, the Grammys have come forward with their own policies around AI use and music. On Friday, the Recording Academy announced that artists who use AI in their songs can still submit them for awards. There just has to be what they call, quote, meaningful human authorship.
Starting point is 00:05:55 However, songs that are fully generated by AI are not eligible and will be banned for the purposes of the Grammys. Between the Grammys, then, soft approving AI and Paul McCartney using AI to help the Beatles release their quote-unquote last record, it seems like the trajectory for AI music is firmly focused in one and one direction only. Leslie, this week, the annual Cannes Lions advertising festival happens in France, and so it's only appropriate that we close on an advertising campaign for the real world. Camera maker Nikon, who of course faces some pretty significant headwinds when it comes to their core business model, has just released a new campaign called Don't Give Up on the Real World.
Starting point is 00:06:30 It's actually a really cool campaign. They're taking what is clearly meant to be a mid-jurney prompt and overlaying it on a photo that was shot on a Nikon camera in the real world, but that looks like it was generated by AI. The tagline of the campaign is Don't Give Up on the Real World. Now outside, just a really beautiful reminder of how amazing the world that we live in is, It's notable to me that we're already at the point where companies and industries are now feeling like they have to actively lobby against artificial intelligence in order to secure their future destiny. Fascinating times we live in for sure. That's it for today's AI Breakdown Brief.
Starting point is 00:07:03 If you're enjoying the AI breakdown, please like, subscribe and share. And I'll be back soon with the main AI breakdown. GPT engineer is a new white-hot AI agent that can create entire codebases and which has the attention of basically everyone on Gets. GitHub right now. You may remember Emmett Homm from my show that I did with him about the Andrew Tate chat bot that he had created. Emmett had actually gone from non-coding to building this bot inside the scope of just a couple months, and we talked about that journey. Now, along that journey, Emmett had come to some conclusions about where one can learn about the latest advances in AI. On May 5th, he tweeted, there's serious alpha in just learning about new AI tools before they go
Starting point is 00:07:43 mainstream. YouTube is a lagging indicator by two to three weeks, Twitter by about one week, GitHub, help forums, and tool-specific discords are at the edge of innovation. Now, given that Emmett had said that, I was particularly intrigued to notice another tweet from him on Friday where he wrote, GPT Engineer is blowing up on GitHub right now, prompt an AI agent to write an entire codebase, plus 2,000 stars today alone. I'll be experimenting with it this weekend to improve my workflow. Now, to me, GPT Engineer represents two converging trends.
Starting point is 00:08:15 The first has to do with developers being on the front lines of how workflows are being reimagined and reorganized on the basis of new AI tools. Just this morning on the brief, I talked about a new survey from GitHub that suggested that nine out of ten developers, actually 92 percent, to be precise, were already using AI coding tools at work. This matches anecdotal evidence and basically the presumption that anyone at this point is making. Part of the strategy of companies like Google to catch up to competitors like OpenAI has been
Starting point is 00:08:43 to jump out ahead of them in terms of tools. help people code. And in general, there is just a massive discussion about the increases in productivity that AI-supported coding could unleash. So this is trend number one, AI as a tool for coding. Trend number two is autonomous AI agents. At the beginning of April, Auto-GPT and Baby AGI really started to smash their way into public consciousness. These were tools that before ChatGBTGPT and Bard had access to the internet could go out and search and gather information. They had long in short-term memory management, and the goal was for them to be able to, when given a task, figure out the set of steps that were necessary to complete that task, and then go out and actually
Starting point is 00:09:23 do it. Now, that included the potential of spinning up other AI agents to accomplish specific tasks, and that's really what got people excited. It was the idea that rather than just using chat GPT to help them plan out the set of steps that they themselves needed to take, all they needed to do was prompt using natural language and some autonomous AI agent would be able to figure out how to accomplish a task and then go do it. To understand just how excited people were about this, all one needs to do is look at their chart of GitHub stars. Stars are simply a way the developers can flag that they want to be able to get back to a repository later to interact with it or learn from it in some way. And by the beginning of May, just one month after this project launch, AutoGPT had more than
Starting point is 00:10:04 100,000 stars. Today, that number is over 140,000. What's more, lots of people tried no-code implementations of Auto-GPT-like tools. There was God Mode, for example, as well as Agent GPT. But it wasn't long before the hype wore off, and some people started to think that maybe Auto-GPT wasn't all it was cracked up to be, at least not yet. Now, I've talked before about why I think this has more to do with our expectations and the speed with which we assume that things are going to happen, as opposed to any failure of auto-GPT, but what makes Antonosica's GPT engineer interesting to me is that it takes that same impulse to try to get autonomous AI agents to do more, but puts it within a specific domain in this case coding. It felt to me for some time that the way that we're going to see
Starting point is 00:10:49 AI agents actually come to practice and be useful is by having them focused in on specific types of tasks. The GPT4 engineer Read Me says, specify what you want to build, the AI asks or clarification and then builds it. GPT Engineer is made to be easy to adapt, extend, and make your agent learn how you want your code to look. It generates an entire code base based on a prompt. Anton, who's the founder and CTO at Depict, wrote on Twitter, introducing GPT Engineer. One prompt generates a code base, asks clarifying questions, generates technical spec,
Starting point is 00:11:21 writes all necessary code, easy to add your own reasoning steps, modify, and experiment. Let's you finish a coding project in minutes. The example in the demo that he gave on Twitter is multiplayer snake in the browser. Use a Python backend with MVC components. The view needs to stream the state to all connected players. Please implement also the HTML and JavaScript necessary to run the game with only the code you generate. Now the next step, and this is part of what makes GPT engineer interesting, is that there's a process where it asks for needed clarification. So in this case, one, game rules and mechanics.
Starting point is 00:11:51 How exactly does the snake move grow and interact with other players? Are there any power-ups or special game elements? two player connections. How many players can join a game? Is there a lobby system or matchmaking? And then there's a number of other questions, including game state updates, user interface, game controls, game and conditions, code structures, etc. Anton then answers this set of questions to give GPT engineer the information it needs. And then it's off to the races. What Anton's left with is a complete code base, ready to go. Now, there are two very different categories of responses so far. The first, and definitely the most common that I've seen, is excitement. This is another one
Starting point is 00:12:23 of those projects that to people really shows the possibilities of what generative AI could be. But in this particular domain, there is also some amount of skepticism. Mr. Cadd writes, this is great. I think people are going to be skeptical about it working due to the failure of previous projects like this. But if you look at it as a time saver to producing projects that require multiple files, it makes more sense. Benjamin DeCracker writes, interesting but very skeptical. Even directly supervised GPT4 cannot successfully build things currently, like dynamic websites, interactive apps and games. It can build components which sometimes work and sometimes don't,
Starting point is 00:12:56 but it goes off the rails fast. We'd like to see tests. Tom Gensler wrote a LinkedIn post about this called Conversational Code, an exploration of GPT Engineer. Tom writes, Imagine a future where creating a software project is as easy as a friendly chat. Envision sharing your needs and watching them transform into a well-crafted software project without writing a line of code.
Starting point is 00:13:16 GBT Engineer, Tom says, is more than just a project. It's a glimpse into a future where large language model like OpenAI's GPT play a pivotal role in shaping requirements in orchestrating software development. Though not yet fully featured, it foreshadows a time when software creation is a dynamic dialogue involving human creativity and machine intelligence. Tom clarifies the process of interaction between GPT engineer and the user at the beginning. First, the user supplies a text file with the software requirements. GPT engineer places an initial message to OpenAI's GPT to identify clarifying questions. The GPT engineer system responds prompting the user with those clarifying questions.
Starting point is 00:13:51 and then this loops until all of the relevant questions are clarified to GPT's satisfaction. From there, the refined requirements are packaged up as system prompts, and also GPT engineer adds an additional set of instructions around what it wants to see as an output. From there, the GPT engineer system receives a response from GPT4, and then creates the source code files for the software project that the user provided instructions for. Tom shares a number of potential improvements for how this could go to the next level. One, he suggests is iterative development. quote, any project like GPT Engineer that relies on LLMs is still subject to misinterpretation of a user's
Starting point is 00:14:25 requirements and intent. This would be mitigated if there was the ability to iterate after the initial generation of code. GPT Engineer currently does not have the capability to iterate development, but it's easily imagined how it could. Tom also recommends some changes around the workspace project structure to allow for a multi-level file directory organization and points out that GBT engineer is currently limited by the context size limits of the input for the LLM it's working with. Now, Matthew Berman, who's another great AI YouTuber and who focuses often on technical implementations of new tools, got GPT engineer to work and was really impressed by the first outputs. A new AI coding partner on the block, and it is absolutely incredible.
Starting point is 00:15:02 Let me show you what I've been able to do the first time. No edits, no bug fixes, no nothing. This is the game snake. I have not been able to get any other large language model to create this on the first pass. And this new project called GPT Engineer is absolutely able to do this the first time. I'm going to show you how to install this. I'm going to show you how to use it. And you're going to have your mind blown as well.
Starting point is 00:15:27 Let's go. Now, this is still an incredibly nascent project with a ton of work to be done. And every day, we're getting updates from Anton about new improvements. Anton shared a chart of GitHub stars yesterday afternoon, showing that over the last three days he had gone from around 500 to 12,000, which is already a remarkable jump up. But as of recording this afternoon on Monday, that number was already up to nearly 17,000 stars. As people share what they've actually been able to build with it, I will be sure to share that here as well.
Starting point is 00:15:55 All right, guys, that is it for today's AI breakdown. If you are enjoying this, please like, subscribe, and share. Check out the podcast and the newsletter version. And until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.