The Good Tech Companies - This One Practice Makes LLMs Easier to Build, Test, and Scale

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. This one practice makes LLMs easier to build, test, and scale, by Andrew Prosikin. This is part of an ongoing series, see first and second posts. Principle 3. Modularize the prompts. A hideous monstrosity, every experienced engineer has seen one. Code that is souvest, high-risk, and difficult to understand that no one dares to touch it. There are no unit tests, every change is cause for a minor heart attack. The only ones who venture near it are the old timers.

Starting point is 00:00:35 Those who were around when the monster was built and they only come close when there is no alternative. It's stale, unmodularized and the dependencies are out of date. The component is too dangerous to seriously alter. I remember the first monstrosity I encountered. A 5,000 line function that was central to the operations of a business worth hundreds of millions of dollars. Barely anybody had confidence to touch it.

Starting point is 00:00:58 When it broke, whole teams were woken up in the middle of the night. All development in the company was slowed down because of a dependency on this key component. Millions of dollars were spent trying to manage the monster. What does all of this have to do with LLM prompts? They can become monstrosities too. So scary to change, that no one touches them. Or conversely, teams tree fixing them and cause an avalanche of incidents. What customers need customers don't want to pay for

Starting point is 00:01:25 software that works correctly only on Tuesdays and Thursdays, they demand constant reliability and a stream of new features. When building long-term high-reliability systems it's essential to enable the application to evolve, while constantly keeping the lights on. This applies Togan AI powered applications as much as traditional software. So how do you get a healthy AI-powered application and not a monstrosity? There are over a dozen approaches all covered in this series. They all start with one principle. Instead of one ginormous prompt, you want multiple smaller focused prompts that each aim to solve a single problem.

Starting point is 00:02:01 What is modularization? Modularization is the practice of breaking down a complex system into smaller, self-contained, and reusable components. In traditional software engineering, this means writing functions, classes, and services that each handle a specific task. In the context of prompt engineering for LLMs, modularization means splitting a large, monolithic prompt into smaller, focused prompts, each designed to perform a single, well-defined job. Benefits of modularization Modularization allows you to safely introduce changes to your system over time. Its importance grows when, the length of time the application will be maintained increases.

Starting point is 00:02:39 The number and complexity of features expected to be added increases. The reliability requirements on the system get stricter. All of these dimensions need to be understood when planning out the system. But how specifically does modularization help maintain the system? The main benefits are described below. Risk Reduc Tionllm prompt performance is inherently unstable. Their nature is such that any change can affect output in unpredictable ways. You can manage this risk by breaking big prompts into components, where a change can only affect

Starting point is 00:03:11 the performance of a part of the system. Even if one prompt is broken, the rest of the system will operate as before the change. But what if prompts operate as a chain? Wouldn't breaking one component still break the chain? Yes, it would, but the damage is still reduced in this scenario. An erroneous output in a prompt chain can supply the downstream prompts with faulty inputs, but each component would still operate as before the change on the set of valid inputs. Contrast this with altering a giant prompt

Starting point is 00:03:39 the change can, and will, affect every bit of logic encoded in that prompt. You didn't break one aspect of the system, you potentially broke every part of it. Operating chains of prompts safely is a future chapter in the series. You need to plan for various types of failures and have contingency plans. But this is beyond the scope here, improved testability anyone who has written unit tests knows that a simple function that does a single thing is way easier to test than a complex function that tries to do many different things. The same applies to prompts. A small, focused prompt can be tested much more thoroughly both manually and in a fully automated manner. Better PERFORMANCEA wide body of evidence shows that shorter prompts tend to outperform longer ones.

Starting point is 00:04:25 1, 2, 3. Research on the effects of multitasking on prompt performance is more mixed. 4, 5. A perfectly optimized prompt can, under the right circumstances multitask. In practice though, it is much easier to optimize focused prompts, where you can track performance along a single main dimension. You should aim for more focused prompts wherever possible. Ease of knowledge share explaining the intricacies of a super prompt

Starting point is 00:04:50 with 3000 words to a new team member is a journey. And no matter how much you explain, the only ones who have a feel for this beast will be contributing authors. A system of prompts, with each part being relatively simple can be onboarded to much faster, engineers will start being productive sooner. Cost optimization BY using different models in different parts of the system, you can achieve significant cost and latency savings without affecting response quality. For example, a prompt that determines input language doesn't have to be particularly smart, it doesn't require your latest and most expensive model. On the other hand, the prompt that generates the reply based on documentation could benefit from a built-in chain of thought reasoning embedded in high-end models. When to not modularize most software-powered applications require safely adding features over extended periods of time.

Starting point is 00:05:40 There is, however, an exception. Prototype applications are not intended to be maintained for long, they won't get new features, and are not meant for high reliability. So don't waste time with modularization when building prototypes. In fact, most of the patterns in this series do not apply to prototype applications. When building a prototype, go quick, verify the critical unknowns, and then throw the code away. Another consideration is knowing when to stop modularizing. There is overhead toe-managing extra prompts and if benefits of further modularization

Starting point is 00:06:13 are low, you should stop breaking the system up further. Infrastructure for modularization IF modularizing prompts was trivial, everybody would be doing it. To manage many prompts in a system, you need to invest in infrastructure, without it you will get chaos. Here are the minimal requirements for the LLM prompt infrastructure. Ability to add prompts quickly and pain-free

Starting point is 00:06:35 in a standardized way. Particularly important when prompts are loaded from outside the codebase. See principle two. Load prompts safely, if you really have to. Ability to deploy prompts in an automated way. Ability to log and monitor inputs, outputs of individual prompts. Ability to add automated tests that cover prompts. A way to easily track token, dollar spend on various prompts. Case studyLet's see how building a general AI powered system plays out in practice with and without modularization.

Starting point is 00:07:06 No modularization you are building a tech support app and are determined to implement it with a single prompt. In the simplest version, you can imagine a monolith prompt that generates responses while loading relevant documentation through RAG. Looks nice and easy, right? But as you add features, problems with this architecture emerge. You want to respond to messages in a fixed list of languages, but not handle others. To achieve this you add prompt instructions to only respond in certain languages and get the LLM to return the language field for reporting purposes. You want all conversations classified. Add a field label to the prompt output.

Starting point is 00:07:44 When the user is unhappy, escalate the case to human support. Add, escalate underscore to underscore human, output variable along with instructions in the prompt. Need a translation of all messages sent for internal audit. Return the, translated, field with a message in English. Need protection to make sure that the app never asks users about their location and who they voted for in the last election. Add prompt instructions and test it out manually. Need a summary for every conversation? Add, summary, field to every output. Perhaps you are beginning to see the problem. This prompt now has 6 outputs. Testing it will be a nightmare. You add support for another language, and suddenly your app begins to return the summary in Spanish instead of English. Why? Who knows, LLM outputs are unstable,

Starting point is 00:08:32 so changing the prompt has unpredictable results. Congratulations, you've created a monster. Over time it will grow and cause even more pain, with modularization both prompt chain and an entirely separated classification prompt is used. The original large prompt is modularized as much as practical. One prompt detects the language, one provides translation, one determines if these are as upset and escalates to humans. Response prompt generates the response, guardrail verifies compliance of response. Outputs of one prompt are chain-Toby inputs of the next.

Starting point is 00:09:06 Traditional code can operate between these prompts to, for example, check language eligibility, without involving LLMs. A change can still break a given prompt, but risks are greatly reduced because, a change to one part doesn't risk breaking every part of the application logic. Testing is much easier and the odds of catching failure early are high. Each prompt is relatively simple, so it's easier to understand and you are less likely to do damage with a change. Changes are easier to review. You get all the benefits of general AI, but the risks are greatly reduced. Plus, you can use cheaper models for some components to save money. Conclusion modularization allows you to isolate errors, improve maintainability, and build

Starting point is 00:09:48 a more reliable system. Even moderately sized applications will have dozens, if not hundreds, of component prompts. Break up prompts until they each perform a single task, and until the benefits of further modularization are outweighed by added operational complexity. Modularizing your prompts is a necessity if your AI-driven applications are to remain reliable and continue to add features over the long run. There are plenty of, monster, systems around already. Take care not to create new ones. If you've enjoyed this series, subscribe for more posts. Thank you for listening to this Hacker Noon story, read by Artificial Intelligence.

Starting point is 00:10:25 Visit HackerNoon.com to read, write, learn and publish.

The Good Tech Companies - This One Practice Makes LLMs Easier to Build, Test, and Scale

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.