The Good Tech Companies - This One Practice Makes LLMs Easier to Build, Test, and Scale
Episode Date: April 7, 2025This story was originally published on HackerNoon at: https://hackernoon.com/this-one-practice-makes-llms-easier-to-build-test-and-scale. LLM prompt modularization allow...s you to safely introduce changes to your system over time. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #ai, #ai-prompt-optimization, #modular-prompt-engineering, #reduce-llm-costs, #reliable-prompt-design, #debug-llm-outputs, #llm-production-issues, #good-company, and more. This story was written by: @andrewproton. Learn more about this writer by checking @andrewproton's about page, and for more stories, please visit hackernoon.com. LLM prompt modularization allows you to safely introduce changes to your system over time. How and when to do it is described below.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
This one practice makes LLMs easier to build, test, and scale, by Andrew Prosikin.
This is part of an ongoing series, see first and second posts.
Principle 3. Modularize the prompts.
A hideous monstrosity, every experienced engineer has seen one. Code that is souvest,
high-risk, and difficult to understand that no one dares to touch it.
There are no unit tests, every change is cause for a minor heart attack.
The only ones who venture near it are the old timers.
Those who were around when the monster was built and they only come close when there
is no alternative.
It's stale, unmodularized and the dependencies are out of date.
The component is too dangerous to seriously alter.
I remember the first monstrosity I encountered.
A 5,000 line function that was central to the operations of a business worth hundreds
of millions of dollars.
Barely anybody had confidence to touch it.
When it broke, whole teams were woken up in the middle of the night.
All development in the company was slowed down because of a dependency on this key component.
Millions of dollars were spent trying to manage the monster.
What does all of this have to do with LLM prompts?
They can become monstrosities too.
So scary to change, that no one touches them.
Or conversely, teams tree fixing them and cause an avalanche of incidents.
What customers need customers don't want to pay for
software that works correctly only on Tuesdays and Thursdays, they demand constant reliability
and a stream of new features. When building long-term high-reliability systems it's essential to
enable the application to evolve, while constantly keeping the lights on. This applies Togan AI
powered applications as much as traditional software. So how do you get a healthy AI-powered application and not a monstrosity?
There are over a dozen approaches all covered in this series.
They all start with one principle.
Instead of one ginormous prompt, you want multiple smaller focused prompts that each
aim to solve a single problem.
What is modularization?
Modularization is the practice of breaking down a complex system into smaller, self-contained, and reusable components.
In traditional software engineering, this means writing functions, classes, and services that each handle a specific task.
In the context of prompt engineering for LLMs, modularization means splitting a large, monolithic prompt into smaller, focused prompts, each designed
to perform a single, well-defined job.
Benefits of modularization Modularization allows you to safely introduce changes to
your system over time.
Its importance grows when, the length of time the application will be maintained increases.
The number and complexity of features expected to be added increases.
The reliability requirements on the system get stricter.
All of these dimensions need to be understood when planning out the system.
But how specifically does modularization help maintain the system?
The main benefits are described below.
Risk Reduc Tionllm prompt performance is inherently unstable.
Their nature is such that any change can affect output in unpredictable ways.
You can manage this risk by breaking big prompts into components, where a change can only affect
the performance of a part of the system.
Even if one prompt is broken, the rest of the system will operate as before the change.
But what if prompts operate as a chain?
Wouldn't breaking one component still break the chain?
Yes, it would,
but the damage is still reduced in this scenario. An erroneous output in a prompt chain can
supply the downstream prompts with faulty inputs, but each component would still operate
as before the change on the set of valid inputs. Contrast this with altering a giant prompt
the change can, and will, affect every bit of logic encoded in that prompt.
You didn't break one aspect of the system, you potentially broke every part of it.
Operating chains of prompts safely is a future chapter in the series.
You need to plan for various types of failures and have contingency plans.
But this is beyond the scope here, improved testability anyone who has written unit tests
knows that a simple function that does a single thing is way easier to test than a complex function that tries to do many different things.
The same applies to prompts. A small, focused prompt can be tested much more thoroughly both manually and in a fully automated manner.
Better PERFORMANCEA wide body of evidence shows that shorter prompts tend to outperform longer ones.
1, 2, 3.
Research on the effects of multitasking on prompt performance is more mixed.
4, 5.
A perfectly optimized prompt can, under the right circumstances multitask.
In practice though, it is much easier to optimize focused prompts, where you can track performance
along a single main dimension.
You should aim for more focused prompts wherever possible.
Ease of knowledge share explaining the intricacies of a super prompt
with 3000 words to a new team member is a journey.
And no matter how much you explain, the only ones who have a feel for this beast will be contributing authors.
A system of prompts, with each part being relatively simple can be onboarded to much faster, engineers will start being productive sooner.
Cost optimization BY using different models in different parts of the system, you can achieve significant cost and latency savings without affecting response quality.
For example, a prompt that determines input language doesn't have to be particularly smart, it doesn't require your latest and most expensive model. On the other hand, the prompt that generates the reply based on
documentation could benefit from a built-in chain of thought reasoning
embedded in high-end models. When to not modularize most software-powered
applications require safely adding features over extended periods of time.
There is, however, an exception. Prototype applications are not intended
to be maintained for long, they won't get new features, and are not meant for high
reliability. So don't waste time with modularization when building prototypes.
In fact, most of the patterns in this series do not apply to prototype
applications. When building a prototype, go quick, verify the critical unknowns, and
then throw the code away.
Another consideration is knowing when to stop modularizing.
There is overhead toe-managing extra prompts and if benefits of further modularization
are low, you should stop breaking the system up further.
Infrastructure for modularization IF modularizing prompts was trivial, everybody would be doing it.
To manage many prompts in a system,
you need to invest in infrastructure,
without it you will get chaos.
Here are the minimal requirements
for the LLM prompt infrastructure.
Ability to add prompts quickly and pain-free
in a standardized way.
Particularly important when prompts are loaded
from outside the codebase.
See principle two.
Load prompts safely, if you really have to. Ability to deploy prompts in an
automated way. Ability to log and monitor inputs, outputs of individual prompts. Ability to add
automated tests that cover prompts. A way to easily track token, dollar spend on various prompts.
Case studyLet's see how building a general AI powered system plays out in practice with and without modularization.
No modularization you are building a tech support app and are determined to implement it with a single prompt.
In the simplest version, you can imagine a monolith prompt that generates responses while loading relevant documentation through RAG.
Looks nice and easy, right? But as you add features, problems with this architecture emerge.
You want to respond to messages in a fixed list of languages, but not handle others.
To achieve this you add prompt instructions to only respond in certain languages and get the LLM to return the
language field for reporting purposes.
You want all conversations classified.
Add a field label to the prompt output.
When the user is unhappy, escalate the case to human support.
Add, escalate underscore to underscore human, output variable along with instructions in the prompt.
Need a translation of all messages sent for internal audit. Return the, translated, field with a message in English.
Need protection to make sure that the app never asks users about their location and who they voted for in the last election. Add prompt instructions and test it out manually.
Need a summary for every conversation? Add, summary, field to every output.
Perhaps you are beginning to see the problem. This prompt now has 6 outputs. Testing it
will be a nightmare. You add support for another language, and suddenly your app begins to return the summary in Spanish instead of English.
Why? Who knows, LLM outputs are unstable,
so changing the prompt has unpredictable results.
Congratulations, you've created a monster.
Over time it will grow and cause even more pain,
with modularization both prompt chain and an entirely separated classification prompt is used.
The original large prompt is modularized as much as practical. One prompt detects the language,
one provides translation, one determines if these are as upset and escalates to humans.
Response prompt generates the response, guardrail verifies compliance of response.
Outputs of one prompt are chain-Toby inputs of the next.
Traditional code can operate between these prompts to, for example, check language eligibility,
without involving LLMs.
A change can still break a given prompt, but risks are greatly reduced because, a change
to one part doesn't risk breaking every part of the application logic.
Testing is much easier and the odds of catching failure early are high. Each prompt is relatively simple, so it's easier to understand and
you are less likely to do damage with a change. Changes are easier to review. You get all
the benefits of general AI, but the risks are greatly reduced. Plus, you can use cheaper
models for some components to save money. Conclusion modularization allows you to isolate errors, improve maintainability, and build
a more reliable system. Even moderately sized applications will have dozens, if not hundreds,
of component prompts. Break up prompts until they each perform a single task, and until
the benefits of further modularization are outweighed by added operational complexity.
Modularizing your prompts is a necessity if your AI-driven applications are to remain reliable
and continue to add features over the long run. There are plenty of, monster,
systems around already. Take care not to create new ones.
If you've enjoyed this series, subscribe for more posts.
Thank you for listening to this Hacker Noon story, read by Artificial Intelligence.
Visit HackerNoon.com to read, write, learn and publish.