I run about 30 projects from a home lab. Two servers, a handful of cron jobs, a few web apps, a couple of bots. Something breaks every day and I don't find out until I check manually.
So I built PrimeBus: a system that watches all of my projects for changes, detects when something breaks, generates two competing AI-powered fixes, scores them against each other, and either merges the winner or escalates to a human when it's not confident.
The Architecture: Signal → Detect → Resolve
Everything flows through NATS JetStream as structured events.
Signal. Watcher agents poll git repos every 30 seconds, check PyPI for dependency updates daily, and listen for application lifecycle events. When a new commit lands, the GitWatcher publishes a change.repo.{project}.push event.
Detect. A TestRunner subscribes to push events, checks out the commit in an isolated git worktree, and runs the test suite. If tests fail, it publishes a digest.alert.
Resolve. When a test failure is detected, the FixGenerator fires two parallel Claude API calls with different strategies.
The A/B Code Tournament
This is the core idea. Every fix attempt produces two competing variants:
- Variant A — Minimal strategy. Low temperature (0.2), smallest possible change that makes tests pass.
- Variant B — Robust strategy. Higher temperature (0.7), thorough fix that addresses root cause.
Both run concurrently via asyncio.gather. Each gets applied to its own git worktree. The ValidationHarness scores each on three dimensions:
| Dimension | Weight |
|---|---|
| Test pass rate | 60% |
| Diff size (smaller = better) | 25% |
| Clean apply | 15% |
The ScoringEngine picks the winner. Score ≥ 75 with confidence ≥ 0.7 → auto-merge. Below that → escalate to a human.
The Escalation Path
PrimeBus integrates with HumanRail, a human-in-the-loop task routing API. When confidence is low, it creates a task with the failure context, both variant diffs, and the scoring breakdown. A human reviewer approves, rejects, or provides guidance.
Every escalation outcome feeds back into prompt tuning. The system gets smarter with every fix cycle.
Why NATS JetStream
I evaluated Kafka, RabbitMQ, and Redis Streams. NATS won on three points:
- Operational simplicity. Single binary, 128MB Docker container. No Zookeeper, no JVM.
- Durable consumers. At-least-once delivery. Crash at 3 AM, restart at 3:01, pick up where you left off.
-
Subject-based routing. Wildcards like
change.repo.*.pushmean adding a new telemetry source is just publishing to a new subject.
What Actually Runs in Production
13 agents in a single Python process, ~50MB RAM:
- GitWatcher — Polls 4 repos every 30s
- TestRunner — Isolated worktree test execution
- DepWatcher/DepUpdater — PyPI dependency monitoring + auto-bump
- FixGenerator — Dual Claude API variant generation
- ValidationHarness — Worktree apply + test + score
- ScoringEngine — A/B comparison + decision
- PRCreator — Git branch creation or HumanRail escalation
- FeedbackAgent — Outcome tracking + learning loop
- DigestAgent — Daily Discord summaries
- DiscordAlerter — Real-time failure/escalation alerts
- PerformanceAnalyzer — Claude-powered anomaly detection on app metrics
- ReactorAgent — Event-driven Claude CLI dispatcher with risk gates
Current throughput: ~280 events/day (180 external + 100 internal).
The Feedback Loop
Every fix outcome is tracked in SQLite: merged, reverted, or overridden by a human. This history gets injected into generation prompts. The system remembers what worked and biases toward those patterns.
Before A/B tournaments: ~40% fix success rate. After: ~65%.
The Competitive Gap
No single product builds the full pipeline:
- Renovate/Dependabot → dependencies only
- Devin/SWE-agent → human-triggered, no A/B
- Datadog/Grafana → observe but don't act
- AlphaCode → proved A/B works, never productized
The gap is the orchestration layer: telemetry → detection → AI code gen → A/B validation → feedback. That's PrimeBus.
Key Takeaways
- A/B beats single-shot. Two competing strategies find better fixes than one attempt.
- Telemetry-first design. Build the bus to collect signals. Acting on them is just one subscriber.
- Human escalation is a feature. Systems that know their limits are safer than systems that don't.
- NATS JetStream is underrated. For hundreds-to-thousands of events/day, it's dramatically simpler than Kafka with the same durability.
If you're running multiple projects and want help building autonomous monitoring and self-healing infrastructure, Prime Automation Solutions builds exactly this kind of system.
Top comments (0)