How I Built a Self-Healing Code Pipeline with NATS, Claude AI, and A/B Testing

#ai #python #automation #devops

I run about 30 projects from a home lab. Two servers, a handful of cron jobs, a few web apps, a couple of bots. Something breaks every day and I don't find out until I check manually.

So I built PrimeBus: a system that watches all of my projects for changes, detects when something breaks, generates two competing AI-powered fixes, scores them against each other, and either merges the winner or escalates to a human when it's not confident.

The Architecture: Signal → Detect → Resolve

Everything flows through NATS JetStream as structured events.

Signal. Watcher agents poll git repos every 30 seconds, check PyPI for dependency updates daily, and listen for application lifecycle events. When a new commit lands, the GitWatcher publishes a change.repo.{project}.push event.

Detect. A TestRunner subscribes to push events, checks out the commit in an isolated git worktree, and runs the test suite. If tests fail, it publishes a digest.alert.

Resolve. When a test failure is detected, the FixGenerator fires two parallel Claude API calls with different strategies.

The A/B Code Tournament

This is the core idea. Every fix attempt produces two competing variants:

Variant A — Minimal strategy. Low temperature (0.2), smallest possible change that makes tests pass.
Variant B — Robust strategy. Higher temperature (0.7), thorough fix that addresses root cause.

Both run concurrently via asyncio.gather. Each gets applied to its own git worktree. The ValidationHarness scores each on three dimensions:

Dimension	Weight
Test pass rate	60%
Diff size (smaller = better)	25%
Clean apply	15%

The ScoringEngine picks the winner. Score ≥ 75 with confidence ≥ 0.7 → auto-merge. Below that → escalate to a human.

The Escalation Path

PrimeBus integrates with HumanRail, a human-in-the-loop task routing API. When confidence is low, it creates a task with the failure context, both variant diffs, and the scoring breakdown. A human reviewer approves, rejects, or provides guidance.

Every escalation outcome feeds back into prompt tuning. The system gets smarter with every fix cycle.

Why NATS JetStream

I evaluated Kafka, RabbitMQ, and Redis Streams. NATS won on three points:

Operational simplicity. Single binary, 128MB Docker container. No Zookeeper, no JVM.
Durable consumers. At-least-once delivery. Crash at 3 AM, restart at 3:01, pick up where you left off.
Subject-based routing. Wildcards like change.repo.*.push mean adding a new telemetry source is just publishing to a new subject.

What Actually Runs in Production

13 agents in a single Python process, ~50MB RAM:

GitWatcher — Polls 4 repos every 30s
TestRunner — Isolated worktree test execution
DepWatcher/DepUpdater — PyPI dependency monitoring + auto-bump
FixGenerator — Dual Claude API variant generation
ValidationHarness — Worktree apply + test + score
ScoringEngine — A/B comparison + decision
PRCreator — Git branch creation or HumanRail escalation
FeedbackAgent — Outcome tracking + learning loop
DigestAgent — Daily Discord summaries
DiscordAlerter — Real-time failure/escalation alerts
PerformanceAnalyzer — Claude-powered anomaly detection on app metrics
ReactorAgent — Event-driven Claude CLI dispatcher with risk gates

Current throughput: ~280 events/day (180 external + 100 internal).

The Feedback Loop

Every fix outcome is tracked in SQLite: merged, reverted, or overridden by a human. This history gets injected into generation prompts. The system remembers what worked and biases toward those patterns.

Before A/B tournaments: ~40% fix success rate. After: ~65%.

The Competitive Gap

No single product builds the full pipeline:

Renovate/Dependabot → dependencies only
Devin/SWE-agent → human-triggered, no A/B
Datadog/Grafana → observe but don't act
AlphaCode → proved A/B works, never productized

The gap is the orchestration layer: telemetry → detection → AI code gen → A/B validation → feedback. That's PrimeBus.

Key Takeaways

A/B beats single-shot. Two competing strategies find better fixes than one attempt.
Telemetry-first design. Build the bus to collect signals. Acting on them is just one subscriber.
Human escalation is a feature. Systems that know their limits are safer than systems that don't.
NATS JetStream is underrated. For hundreds-to-thousands of events/day, it's dramatically simpler than Kafka with the same durability.

If you're running multiple projects and want help building autonomous monitoring and self-healing infrastructure, Prime Automation Solutions builds exactly this kind of system.