Ravi Teja Reddy Mandala

Posted on Mar 27

Your AI Agent Is Not Failing. Your System Design Is.

#ai #sre #code #developers

Everyone is blaming AI agents.

“They hallucinate.”
“They don’t scale.”
“They can’t handle production.”

That’s not the real problem.

The real problem?

We are treating AI agents like tools.

Instead of systems.

In production, nothing works in isolation.

Not your services.
Not your pipelines.
Not your on-call workflows.

But somehow…

We expect AI agents to just “figure it out.”

Here’s what I’ve seen in real systems:

AI fails when:

Context is fragmented
State is lost between steps
Decisions are not traceable
There are no guardrails

Not because the model is bad.

Most teams are building:

❌ Prompt → Response → Done

But production needs:

✅ Context → State → Memory → Feedback → Control

That’s the difference between:

👉 Demo AI
vs
👉 Production AI

The shift is simple, but most miss it:

AI agents are not features.

They are distributed systems with reasoning loops.

Until we design them that way…

We’ll keep blaming the model
for system problems.

ai #sre #agents #devops #llm

Top comments (10)

Benjamin Nguyen • Mar 27

Nice, Ravi! I am curious. Are you following any other groups on dev.to? They have google. I start to follow snowflake.

Ravi Teja Reddy Mandala • Mar 27

Thanks Benjamin!

Yeah, I’ve been following a mix of AI and SRE-related communities here. Google and Snowflake are definitely good ones. I also keep an eye on posts around agents, devtools, and infra since that’s where most of the real-world discussions are happening.

Open to recommendations if you’ve found any good ones 👍

Benjamin Nguyen • Mar 27

nice! You can check some of my recent post! They have other people who write an excellent post. They have a few of them in my post (comment) section. I am curious.

Are you reading article about the situation of junior developer on dev.to? They have some people who wrote post about the situation at the moment.

Ravi Teja Reddy Mandala • Mar 27

Nice, I’ll check out your posts 👍

Yeah, I’ve seen some discussions around junior dev roles lately, interesting space. Feels like things are shifting more toward how developers use AI effectively rather than replacing them.

Curious to see how that evolves.

Benjamin Nguyen • Mar 27

Thank you! Yeah! I think that companies still need junior developer in the pipeline in the near future. The role is changing very quickly with AI knowledge these days.

Ravi Teja Reddy Mandala • Mar 27

Yeah, completely agree.

Feels like the role isn’t going away, it’s just evolving. Junior devs who learn how to work effectively with AI will probably ramp up much faster than before.

The expectations might shift, but the need for strong fundamentals is still there 👍

PEACEBINFLOW • Mar 27

This hits the nail on the head. We’re currently in a weird phase where we’re giving "Junior Developer" level autonomy to systems with "Zero-Day" infrastructure support.

The Reasoning Loop vs. The Static Script
The biggest mental shift is realizing that an agent isn't a function call; it’s a non-deterministic microservice. When we build a standard API, we map out every exit. With agents, we're basically dropping a traveler into a dark room with a flashlight and getting mad when they trip over the furniture we didn't tell them was there.

Fragmented Context is the Silent Killer
You're so right about state. In a distributed system, if Service A doesn't know what Service B did, the system fails. But with AI, we often expect it to "remember" or "infer" context that isn't explicitly in the current prompt window. That’s not a model hallucination; that’s a telemetry gap.

A Quick Reflection
It makes me wonder: will the next "10x Engineer" be the one who writes the best prompts, or the one who builds the most robust observability and feedback loops around the models?

If we don't design for the "Feedback → Control" part of your equation, we’re just building very expensive, very fast ways to make mistakes at scale.

Ravi Teja Reddy Mandala • Mar 31

“Non-deterministic microservice” is such a clean way to frame it. That shift alone changes how you design everything.

The “telemetry gap vs hallucination” point is 🔥. I think a lot of teams are still debugging agents like black boxes instead of distributed systems.

What’s been interesting to me is:
We over-invest in prompting, and under-invest in observability.

In traditional systems, we’d never deploy without:

traces
logs
metrics
clear state transitions

But with agents, we expect reasoning to just “work” without visibility.

I’ve started thinking about agents as:
Reasoning loops + state + feedback + control, not just prompts.

Your point on “Feedback → Control” is key.
Without that, we’re basically scaling uncertainty.

Curious how you’re thinking about control mechanisms:

Are you using guardrails at each step?
Or more centralized orchestration with global context?

Feels like this is where the next level of agent systems will differentiate.

Apex Stack • Mar 27

This resonates hard. I run a fleet of AI agents that manage different parts of a large Astro site — SEO auditing, content generation, data pipelines — and the single biggest unlock was treating each agent as a node in a system, not a standalone tool.

The state management point is spot on. When I first set up my agents, they'd lose context between runs and make conflicting decisions. Adding persistent state (even just writing to shared files between sessions) and explicit feedback loops made them go from "cool demo" to actually useful in production.

Curious if you've found good patterns for inter-agent coordination — like when one agent's output feeds another's input. That handoff is where most of my early bugs lived.

Ravi Teja Reddy Mandala • Mar 31

This is a great callout. The “agent as a node” mindset is exactly where things start to click.

I’ve been seeing the same pattern, especially around state and handoffs. Most early failures aren’t model issues, they’re coordination issues. We treat outputs like final answers instead of intermediate artifacts.

One pattern that’s been working for me:

Treat every agent output as structured state, not free text
Introduce validation layers between agents (schema + sanity checks)
Add lightweight “contract enforcement” at handoff points

Almost like APIs between agents instead of loose chaining.

Also curious, how are you handling failure recovery?
Like when Agent B receives incomplete or low-confidence input from A — do you retry upstream, fallback, or let B adapt?

Feels like that’s where production systems either become resilient or chaotic.