Wassim Chegham

Posted on Mar 30

Prompt Stuffing Is Killing Your Agent

#ai #llm #rag #agents

Classic RAG is like packing your entier wardrobe for a weekend trip. Sure, you'll have options, but good luck finding what you need.

In Part 1, we talked about why agents fall apart in production: compounding errors, the reliability tax, the gap between demo magic and real-world chaos. Now let's zoom into one of the biggest culprits — how most agents handle retrieval.

Because classic RAG has a mantra: "We always retrieve context." And that's exactly where the problems start.

The Problem: Retrieve Everything, Hope for the Best

Let's go back to our running example: a travel-planning agent. A user asks for a 4-day trip with hiking, a modest budget, and one fancy dinner. Reasonable request. Here's what classic RAG does with it:

It grabs everything. Weather data. Hotel options. Flight details. Restaurant suggestions. Trail maps. Local events. Currency exchange rates. It crams all of that into a single prompt and says, "Hey model, figure it out."

This is fragile for reasons that should be obvous, but let's spell them out anyway:

The model has to reason across too many dimensions at once. Flights, budget constraints, hiking trail difficulty, restaurant dress codes, weather windows, all competing for attention in one context window.
Constraints get missed. When you stuff a 6,000-token context blob into a prompt, the model has to juggle everything simultaneously. Your budget limit? Buried somewhere between the hotel listings and the weather forecast.
It breaks as complexity grows. A simple "book me a flight" query works fine. A multi-day, multi-constraint trip plan? The prompt becomes a minefield.

Here's the uncomfortable truth that a lot of RAG tutorials gloss over:

More context ≠ better answers. It means higher cost, slower responses, and more room for the model to get confused.

You're paying more tokens to get worse results. That's not a tradeoff — that's a bug in your architecture.

Look at the left side. That's a prayer, not a pipeline. Now look at the right side. That's engineering.

The Solution: Agentic RAG or Conditional Retrieval

Agentic RAG flips the model. Instead of "always retrieve," it uses conditional retrieval. The agent fetches information only when it actually needs it, validates what it got, and only then moves on.

The key insight is simple: retrieval should be a decision, not a reflex.

Here's what that looks like for our travel agent:

The user asks for a 4-day hiking trip on a budget with one fancy dinner.
The agent doesn't immediately fire off six API calls. It thinks first: "What do I need to know before I can even start planning?"
It retrieves destination info: where can you hike for 4 days that fits the budget?
It validates that against the constraints. Are there actually trails there? Is it the right season?
Only then does it retrieve flight options. And it checks: does this fit the budget?
Then hotels. Then activities. Each step, validated before moving on.

This is the "conditional retrieval + validation loop" pattern, and it removes a huge class of production bugs.

Notice the loops. The agent isn't just a straight pipeline — it pauses, checks constraints, and only then decides what to do next. Classic RAG has no loops. It's a one-shot gamble. Agentic RAG is iterative and self-correcting.

The Validation Loop in Practice

Let's make this concrete. Say the agent is looking for hotels for our hiking trip:

Each retrieval gets validated against specific constraints — budget, availability, location — before the agent moves on. If validation fails, the agent adjusts and re-retrieves. No silent failures. No hallucinated hotel that doesn't actually exist at that price.

Compare this to classic RAG, where the model gets a list of 20 hotels in one context dump and picks one that looks right. Maybe it's in budget. Maybe it's not. You won't know until the user tries to book it.

The Cost Case for Conditional Retrieval

Here's the part that makes your finance team happy: Agentic RAG is cheaper to run.

This seems counterintuitive. You're doing more steps — reason, retrieve, validate, repeat. But here's why it's actually cheaper:

Fewer tokens per request. You retrieve only the data you need for the current step, not the entire knowledge base dump. A focused hotel query is 200 tokens of context. The "everything at once" approach can easily be 4,000+.
Fewer tool calls overall. Conditional retrieval means you skip irrelevant sources. If the user's destination is already decided, you don't waste a call retrieving "top destinations for hiking."
Fewer retries. When the model gets confused by too much context, it gives bad answers. Bad answers trigger retries, clarifications, follow-up calls. Clean context → right answer the first time.

But here's the piece that most tutorials miss: budgets are part of agent state.

If you're running agents in production, you need to track:

Token usage per step and per session
Number of tool calls
Retry counts
Total execution time

Your supervisor loop (the thing orchestrating the agent's steps) should enforce limits. If the token budget is hit, the agent should stop, summarize what it has so far, or ask the user for confirmation before continuing.

Think about it this way: we don't keep searching hotels forever just to find a $2 cheaper option. At some point, the cost of searching exceeds the savings. A well-designed agent knows when to stop.

This is where the "agentic" part really matters. A classic RAG pipeline has no concept of cost awareness, it retrieves, it stuffs, it's done. An agentic system can reason about whether the next retrieval is worth the cost.

Why This Matters in Production

In a demo, classic RAG works fine. The inputs are controlled. The context is small. The constraints are simple. You show it on stage, everyone claps, you ship it.

Then production happens.

Real users have complex, multi-constraint requests. The context grows. The prompt gets bloated. The model starts missing things. You add more retrieval to "fix" it, which makes the prompt bigger, which makes the model miss more things. It's a death spiral.

Agentic RAG breaks this cycle because:

Each step is scoped. The model reasons about one thing at a time with only the context it needs.
Validation catches errors early. A constraint violation at step 3 gets caught at step 3, not discovered when the user sees the final output.
The agent is cost-aware. It doesn't burn through your API budget retrieving data it doesn't need.
It scales with complexity. Adding a new constraint (e.g., "must be wheelchair accessible") means adding a validation check, not restructuring the entire prompt.

Takeaways

If you're building agents that use retrieval (and most agents do) here's your checklist:

Make retrieval conditional, not automatic. The agent should decide whether to retrieve, not just retrieve by default.
Validate after every retrieval step. Check the results against your constraints before moving on.
Scope your context. Each step should get only the data it needs — not the full knowledge base.
Track your costs as state. Token usage, tool calls, retries, and execution time should be first-class values in your agent's state.
Enforce limits in your supervisor loop. Set budgets and let the agent know when to stop searching and start deciding.
Design for re-retrieval. When validation fails, the agent should be able to adjust parameters and try again, not crash or hallucinate.

For a deeper dive into advanced Agentic RAG patterns, check out Pamela Fox's session on the topic — it's an excellent companion to what we've covered here.

Are you using classic RAG or agentic RAG in your projects? Share your thoughts in the comments below!

Top comments (3)

Nova Elvaris • Apr 1

The "retrieval should be a decision, not a reflex" line nails it. I've been building automation workflows that route across multiple tools, and the single biggest improvement came from adding a planning step that decides which data sources to query before querying any of them. The classic RAG approach is basically SELECT * FROM everything — and we'd never accept that in a database query, so why do we accept it in our retrieval pipelines? One practical pattern that's worked well: have the agent emit a retrieval plan as structured data (which sources, what filters, in what order) before executing any of it. That plan becomes auditable and cacheable, and you can catch bad retrieval strategies before they waste tokens. Have you seen teams adopt that kind of explicit retrieval planning in production, or does most agentic RAG still rely on the model implicitly deciding what to fetch?

Apex Stack • Mar 30

The cost-awareness angle is what makes this click. I run a fleet of agents that do daily SEO audits, content publishing, and data analysis across 89K+ pages — and the biggest lesson was exactly this: retrieval should be a decision, not a reflex.

The validation loop pattern maps directly to what I do for site auditing. Instead of dumping all GSC data + page content + competitor data into one prompt, the agent checks indexing status first, then only crawls pages that actually changed, then only runs the LLM analysis on pages that failed the deterministic checks. Each step filters down the work for the next step.

The budget tracking as first-class state is underrated advice. I learned this the hard way — one agent with no cost ceiling burned through tokens doing unnecessary re-retrieval on data that hadn't changed since the last run. Adding a simple cache-check step ("has this data changed since last audit?") cut token usage by ~70%.

Curious if you've seen patterns for sharing retrieval results across agent steps without re-stuffing them into context. That's my current friction point — step 3 needs something step 1 retrieved, but I don't want to carry the full context forward.

Jakub • Mar 31

The validation loop pattern resonates a lot. We run autonomous agents that handle multi-step workflows (data collection, analysis, then action) and the single biggest improvement was exactly this shift -- making each step validate before triggering the next one.

One thing I'd add: the "scope your context" advice also applies to agent memory between runs. If your agent runs daily (like ours do for monitoring tasks), you don't want it re-ingesting yesterday's full state dump. A quick diff check ("what actually changed?") before any retrieval saves a surprising amount of tokens and keeps the agent focused on what matters.

The budget-as-state pattern is underrated too. We had an agent that would keep retrying failed API calls with increasingly creative workarounds, burning tokens on a problem that needed a human to fix. Adding a simple "if retries > 2, stop and report" rule was trivial but cut wasted spend significantly.