Context Engineering: The Real Work Behind Production AI Systems
Your model isn't broken. Your context is. Lets architect the world the model breahtes!
I’ve spent the last several years building agentic systems — synthetic data pipelines, RAG architectures, multi-agent orchestration layers — and if there’s one lesson I’d tattoo on the forehead of every engineer entering the AI space, it’s this:
The intelligence of your system is not determined by the model you pick. It’s determined by the context you feed it.
This is what the industry is now calling Context Engineering, and after working through a recently published framework by one of the leading vector database teams in the space, I want to distill the key insights, add my own battle scars, and give you a practical playbook for building AI systems that actually work in production.
What Is Context Engineering, Really?
Let me save you from the buzzword soup. Context engineering is the discipline of designing the architecture that feeds an LLM the right information at the right time. It’s not prompt engineering (though that’s a piece of it). It’s not RAG (though that’s a piece too). It’s the entire system — the retrieval pipelines, the memory layers, the tool orchestration, the query augmentation — all working in concert.
Think of it this way: an LLM is a brilliant surgeon, but it’s standing in an empty operating room. Context engineering is building the hospital around it — the instrument trays, the patient records system, the diagnostic equipment, the nursing staff that hands tools at exactly the right moment.
The framework I’m drawing from organizes this around six pillars: Agents, Query Augmentation, Retrieval (Chunking), Prompting Techniques, Memory, and Tools. I’m going to walk through each with what I think matters most for practitioners.
Pillar 1: Agents — The Orchestration Brain
What They Are
An AI agent is a system that can make dynamic decisions about information flow, maintain state across interactions, use tools adaptively, and modify its approach based on results. They’re the coordinators in your context engineering stack — deciding when to retrieve, what to retrieve, and how to synthesize.
The Architecture Choice That Matters Early
You’ll face the single-agent vs. multi-agent decision early. Here’s my rule of thumb:
Start single-agent. A single agent handling all tasks works well for moderately complex workflows. Multi-agent systems distribute work across specialized agents, which sounds elegant in architecture diagrams but introduces coordination overhead that will eat your lunch in production.
Only go multi-agent when you have genuinely independent sub-tasks with different tool requirements and you’ve exhausted what a well-structured single agent can do.
The Key Insight: Context Hygiene
This is the concept I want you to internalize. Agents don’t just need memory and tools — they need to actively monitor and manage the quality of their own context. As your context window fills up, four failure modes emerge:
Context Poisoning: Incorrect or hallucinated information enters the context. Because agents reuse and build upon that context, these errors persist and compound. This is the most dangerous failure mode — it’s a cascading fault.
Context Distraction: The agent drowns in too much past information — history, tool outputs, summaries — and starts repeating past behavior instead of reasoning fresh.
Context Confusion: Irrelevant tools or documents crowd the window, causing the model to pick the wrong tool or follow the wrong instructions.
Context Clash: Contradictory information within the context leaves the agent stuck between conflicting assumptions.
What to do: Implement context pruning (removing irrelevant/outdated context), context summarization (compressing accumulated history periodically), and context offloading (storing details externally and retrieving only when needed). Don’t keep everything in the active window.
What NOT to do: Assume bigger context windows solve this problem. They don’t. Performance degrades well before you hit the token limit. I’ve watched agents with 200K token windows hallucinate more than agents with 32K windows that had well-curated context. A 1M token window is not a license to be lazy about information management.
Pillar 2: Query Augmentation — Garbage In, Garbage Out
Why This Is Your First Line of Defense
Here’s a truth that most tutorials skip: your users will not interact with your system the way you tested it. They’ll send vague, messy, incomplete queries. Query augmentation is how you handle reality.
The best framing I’ve encountered puts it plainly: no amount of sophisticated retrieval, advanced reranking, or clever prompting can compensate for misunderstood user intent.
The Four Techniques You Need
1. Query Rewriting
Transform the raw user query into something the retrieval system can actually work with. A user asking “how do i make this work when my api call keeps failing?” should get rewritten to something like “API call failure troubleshooting: authentication headers, rate limiting, network timeout, 500 error.”
This works by restructuring unclear questions, removing irrelevant noise, and enhancing with domain-specific keywords.
2. Query Expansion
Generate multiple related queries from a single input. This is powerful for vague queries or keyword-based retrieval, but beware of three pitfalls: query drift (expanded queries diverge from original intent), over-expansion (too many terms reduce precision), and computational overhead (multiple queries increase latency).
3. Query Decomposition
Break complex, multi-faceted questions into simpler sub-queries. “Compare the pricing, features, and customer reviews of products X, Y, and Z” becomes three focused retrieval operations that are independently processed and then synthesized.
4. Query Agents
The most advanced form — an AI agent that handles the entire query pipeline. It analyzes the task, constructs queries dynamically based on the data schema, routes across multiple collections, evaluates retrieved information, and iteratively refines if results are insufficient.
My Practical Advice
Do: Start with query rewriting. It’s the highest ROI technique. A simple LLM call that rewrites the user query before it hits your vector database will dramatically improve retrieval quality.
Don’t: Jump straight to query agents for simple use cases. The complexity isn’t worth it until you have multi-collection routing needs or genuinely complex queries that require decomposition.
Do: Build your system to handle the worst queries, not the best ones. Test with typos, incomplete thoughts, and domain-ignorant phrasing.
Pillar 3: Retrieval & Chunking — The Foundation Everyone Gets Wrong
The Chunking Sweet Spot
Chunking is the most important decision you’ll make for your retrieval system’s performance. Get it right and you get surgical precision. Get it wrong and even the best model fails.
The fundamental tension is between two competing priorities:
Retrieval Precision: Chunks need to be small and focused on a single idea for distinct, precise embeddings.
Contextual Richness: Chunks need to be large enough to be self-contained so the LLM can generate meaningful responses.
One of the most useful frameworks I’ve encountered presents this as a 2×2 matrix:
Low Contextual RichnessHigh Contextual RichnessHigh Retrieval PrecisionPrecise but Incomplete — easy to find, lacks contextThe Sweet Spot — findable AND understandableLow Retrieval PrecisionThe Failure Zone — neither findable nor usefulRich but Unfindable — contains the answer but noisy embeddings
Your goal is the upper-right quadrant: semantically complete paragraphs that are focused enough to be found and rich enough to be understood.
Simple vs. Advanced Chunking
Simple techniques (start here):
Fixed-size chunking (fast, simple, use overlap to avoid cutting sentences)
Recursive chunking (splits by paragraphs → sentences → words; solid default for unstructured text)
Document-based chunking (uses inherent structure like Markdown headings or HTML tags)
Advanced techniques (graduate to these when simple isn’t cutting it):
Semantic chunking: splits based on meaning, grouping related sentences and creating new chunks when topics shift
Late chunking: embeds the entire document first, then derives chunk embeddings from context-rich token-level representations
Agentic chunking: an AI agent dynamically selects the best chunking strategy per document
Pre-Chunking vs. Post-Chunking
This is a critical architectural decision:
Pre-chunking (process everything upfront): Fast at query time because all work is done. But the chunking strategy is fixed — changing it means reprocessing your entire dataset.
Post-chunking (chunk after retrieval, in real-time): Highly flexible — you can create dynamic strategies specific to the user’s query. But it adds latency and requires more complex infrastructure.
My recommendation: Start with pre-chunking using recursive or semantic chunking. Move to post-chunking only when you have evidence that query-specific chunking would meaningfully improve your results and you can tolerate the latency.
Pillar 4: Prompting Techniques — The Craft Within the System
The Distinction That Matters
Prompt engineering focuses on how you phrase instructions. Context engineering focuses on structuring the information and knowledge you provide. The techniques below are most effective when combined with well-engineered context. A perfect prompt with garbage context still produces garbage.
The Techniques That Actually Move the Needle
Chain of Thought (CoT): Ask the model to think step-by-step. Essential when retrieved documents are dense or contradictory. Pro tip: make the reasoning highly specific to your use case. Don’t just say “think step by step” — tell it to evaluate the environment, repeat relevant information, and explain its importance to the current request.
Few-Shot Prompting: Provide examples of desired outputs in the context window. Combining CoT with few-shot examples is a powerful combination — you guide both the reasoning process and the output format.
Tree of Thoughts (ToT): Instructs the model to explore multiple reasoning paths in parallel, like a decision tree. Useful in RAG when there are many potential evidence pieces and the model needs to weigh different answers from multiple retrieved documents.
ReAct Prompting: Combines Chain of Thought with actions — the model reasons (thinks) and acts dynamically in an interleaved manner. This is the backbone of most agentic loops.
Prompting for Tool Usage — The Part Most People Botch
When your LLM interacts with external tools, the quality of your tool descriptions is everything. The model’s decision to use a tool depends entirely on its description.
Do:
Start tool names with active verbs (
get_current_weathernotweather_data)Specify exact input arguments and their formats
Describe what the output looks like
Mention limitations explicitly (”only works for US cities”)
Include few-shot examples of correct tool selection
Don’t:
Dump every possible tool into the prompt. Dynamically filter and load only tools relevant to the current task.
Write vague descriptions. “This tool helps with data” tells the model nothing.
Pillar 5: Memory — What Separates Demos from Products
The Karpathy Analogy
Andrej Karpathy compared an LLM’s context window to a computer’s RAM and the model itself to the CPU. The context window is the agent’s active consciousness where working thoughts are held. But like a laptop with too many browser tabs, this RAM fills up fast.
The goal isn’t to shove more data into the prompt. It’s to design systems that maximize the active context window while gracefully offloading everything else.
The Memory Architecture
Short-Term Memory: The agent’s immediate workspace — recent conversations, actions, and data packed directly into the prompt. Constrained by token limits. The challenge is efficiency: keep it streamlined without missing details important for next steps.
Long-Term Memory: Stored externally (typically in a vector database) for retrieval when needed. This is what allows persistent understanding over time. It includes:
Episodic memory: specific events and past interactions
Semantic memory: general knowledge and facts
Procedural memory: learned routines and successful workflows
Working Memory: A temporary holding area for multi-step task information (destination, dates, budget for a trip booking) that doesn’t belong in long-term storage but needs to persist across the current task.
Key Principles I’d Add to Any Design Doc
1. Prune and Refine Relentlessly
Memory isn’t write-once. Periodically scan long-term storage to remove duplicates, merge related entries, and discard outdated facts. If a memory is old and rarely accessed, it’s a deletion candidate. Customer support agent? Auto-prune resolved conversation logs older than 90 days and retain only summaries.
2. Be Selective About What You Store
Not every interaction deserves permanent storage. Have the LLM “reflect” on an interaction and assign an importance score before committing to memory. A bad piece of information in long-term storage leads to context pollution — the agent repeatedly making the same mistakes.
3. Tailor Architecture to the Task
A customer service bot needs strong episodic memory. A financial analysis agent needs robust semantic memory. Start with the simplest approach (conversational buffer with last N exchanges) and layer in complexity as the use case demands.
4. Master Retrieval, Not Just Storage
Effective memory is less about how much you store and more about how well you retrieve the right piece at the right time. Use reranking (LLM re-ordering results for relevance) and iterative retrieval (refining queries over multiple steps).
Pillar 6: Tools — Giving Your Agent Hands
The Evolution Arc
Tools went from hacky prompt-engineered text commands to native function calling (structured JSON output with function names and arguments). This was the real breakthrough that made agents practical.
The Orchestration Loop
Giving an agent a tool is easy. Getting it to use that tool reliably, safely, and effectively is the real engineering challenge. The orchestration follows the Thought-Action-Observation cycle:
Tool Discovery: Agent knows what tools exist via system prompt descriptions
Tool Selection & Planning (Thought): Agent reasons about which tool to use, possibly chaining multiple tools
Argument Formulation (Action): Agent extracts parameters from the query and formats them correctly
Reflection (Observation): Tool output feeds back into context; agent decides if the task is complete or needs another step
MCP: The Integration Standard That Won
The Model Context Protocol (MCP), introduced by Anthropic in late 2024, has become the de facto standard for connecting AI applications to external systems. By late 2025, OpenAI, Google DeepMind, Microsoft, and thousands of production agent builders had all adopted it. Instead of building M×N custom integrations (M applications × N tools), MCP creates an M+N problem — each application and tool integrates once with the standard protocol.
In December 2025, Anthropic donated MCP to the newly formed Agentic AI Foundation under the Linux Foundation, with OpenAI and Block joining as co-founders and AWS, Google, Microsoft, Cloudflare, and Bloomberg as supporting members. This is no longer a vendor-specific bet — it’s an industry standard. Build to it accordingly.
The Do / Don’t Cheat Sheet
DO:
Treat context as a scarce resource. Even with large windows, curate aggressively.
Build for messy inputs. Real users don’t write clean queries. Implement query rewriting as your first line of defense.
Start simple, layer complexity. Single agent → multi-agent. Pre-chunking → post-chunking. Buffer memory → hybrid memory. Earn your complexity.
Monitor context hygiene. Build observability into your context pipeline. Log what’s in the context window. Watch for poisoning, distraction, and clash.
Write tool descriptions like your system depends on it. Because it does. Active verbs, explicit inputs/outputs, stated limitations.
Prune memory actively. Old, unaccessed memories are liabilities, not assets.
Combine prompting techniques. CoT + Few-shot is a baseline. ReAct for agentic loops. ToT when you need to weigh multiple evidence paths.
DON’T:
Don’t assume bigger context windows solve your problems. They introduce new failure modes. Performance degrades well before max capacity.
Don’t skip query augmentation. No retrieval system compensates for misunderstood intent.
Don’t chunk blindly. Understand the precision-richness tradeoff. Test your chunking strategy against real queries, not synthetic ones.
Don’t store everything in memory. Implement importance scoring and filtering before committing to long-term storage.
Don’t dump all tools into every prompt. Dynamically filter to only relevant tools per task. Context confusion from irrelevant tools is a real and common failure mode.
Don’t build multi-agent systems prematurely. Coordination overhead is non-trivial. Exhaust single-agent capabilities first.
Don’t test with ideal queries. Your benchmarks should include the worst queries your users will realistically send.
The Bigger Picture
We are no longer in the era of “prompt engineering” as the primary skill for building with LLMs. We’re in the era of systems engineering for AI. The model is one component. The retrieval pipeline, the memory architecture, the tool orchestration, the query processing — these are the systems that determine whether your application is an impressive demo or a reliable product.
The best articulation of this shift I’ve read captures it simply: we’re moving from being prompters who talk to a model to architects who build the world the model lives in.
The best AI systems aren’t born from bigger models. They’re born from better engineering.
Now go build something that works.

