Designing LLM apps
Designing LLM apps requires an architecture with components that ground models to ensure reliability and cost control.

The meteoric rise of foundation models like GPT-3, PaLM, and Stable Diffusion has ushered in a new era for AI generation. These powerful large language models (LLMs) can now synthesize remarkably human-like text and images when prompted. But simply inserting an LLM into your product via an API is unlikely to succeed. Companies that want to leverage LLMs in their applications need to grapple with major challenges first:
- Crippling inference costs as adoption grows.
- Unpredictable and nonsensical LLM outputs
- Responses unrelated to your data and use cases
Without solutions for these problems, LLMs threaten to become overpriced, unreliable black boxes trapped in the lab. Thankfully, techniques and architectures are emerging to properly ground these models in real-world applications.
The Gold Rush of Generative AI
Generative AI dominated headlines in 2022, from awe-inspiring demos like Anthropic’s Claude assisting at a conference, to concerning issues like LLM bias and misinformation. Silicon Valley responded with an AI arms race, as companies raced to release products powered by the latest LLM research.
Microsoft made a massive investment of billions into OpenAI, while debuting AI-assisted Bing search features. Google rushed out its conversational Bard LLM to compete. AI startups proliferated, fueled by an eager round of VC funding. Clearly no one wants to miss this wave.
But viable business models remain elusive, due to the towering costs of running these models. As AI researcher Luis Ceze described in a VentureBeat article, the inference costs can easily spiral out of control. For example, an interior design app using DALL-E to generate images for users could suddenly face a $5 million yearly bill.
This highlights the need for cost-effective deployment options. Ceze explains that while proprietary models from Big Tech companies have progressed quickly, open-source models are reaching impressive capabilities too. With proper tuning, models like Anthropic’s LLaMA can run efficiently on cheap hardware, opening access to cash-strapped startups.
Taming Unruly Models
However, even if you solve the cost problem, LLMs remain stubbornly difficult to integrate into real applications. If you simply pass user inputs into a model and return its output, you’ll get nonsensical and irrelevant responses. Essentially, you need to build an architecture that can “tame” or constrain the LLM to fit your use case.
In a detailed Medium post, AI researcher Simon Attard provides an excellent overview of techniques and components for robust LLM apps. These include prompt engineering, semantic search over vectorized data, iterative fine-tuning, and more. Together, they ground the model in your specific domain and mitigate unreliable outputs.

Attard gives the hypothetical example of a personalized fitness app. To start, proprietary workout plans and health data are encoded into vectors and indexed for fast retrieval. When a user requests a new fitness program, relevant context data is pulled from this store and supplied in the prompt to the LLM.
This “seeds” the model with application-specific information to shape its response. The LLM develops a workout routine grounded in the provided data. An orchestrator component can break goals down into subtasks, each handled by the most optimal LLM. A response manager checks for hallucinations unrelated to the given context.
Throughout a user session, the context and memory stores are updated to continually fine-tune the model’s behavior. As Attard emphasizes, thoughtfully integrating LLMs in this way facilitates reliable, personalized and cost-efficient experiences.
Choosing Models for the Task
Another key decision is choosing which types of models to leverage. As Ceze points out, open-source and proprietary models can complement each other when used properly. Smaller models may excel at specific tasks like semantic search, while larger models handle complex reasoning.
Encoder-based models are ideal for analysis, while decoder models generate text and code. You can start with an inexpensive model as a baseline, then escalate to a costlier API when necessary. The LLM provider architecture encapsulates these options.
Striking the Right Balance
In closing, effectively incorporating LLMs into applications requires finding the right balance. Semantic search, prompt engineering and iterative fine-tuning ground models in reality. Multi-tenant orchestration, response validation and model selection optimize cost and reliability.
With the right architecture and techniques, companies can overcome LLMs’ current limitations and unleash their power. What creative applications can you envision building upon these still-evolving models? I welcome your thoughts and experiences. Please connect with me on LinkedIn or Twitter to continue the conversation on the future of AI generation.
Resources
- Overview of open-source LLaMA 2 model
- Privacy by design engineering
- Run LLaMA on a Raspberry Pi – makeuseof
- Leveraging Large Language Models in your Software Applications – Simon Attard
- AutoGPT on GitHub for task decomposition
- How to leverage large language models without breaking the bank | VentureBeat