← All Posts
How We Built It10 min read

I Built an OS for AI Agents — Here's What I Learned

After 15 years as a solution architect and a catastrophic data loss that wiped my entire codebase, I rebuilt an agent runtime from scratch. 2,936 tests, 13 execution topologies, and a 7-agent adversarial audit later — here's the honest story.

Varun Pratap Bhardwaj·

I spent the last six months building an operating system for AI agents. Not a framework. Not a wrapper around OpenAI's API. A runtime that handles routing, quality, memory, cost management, and execution topology — the 80% of agent infrastructure that every team rebuilds from scratch.

It's called Qualixar OS, it just went public, and I want to be honest about what works, what doesn't, and what I learned along the way.

Qualixar: seven open-source primitives · seven peer-reviewed papers · one reliability platform

The problem nobody warned me about

I'm a solution architect with 15 years of enterprise IT experience. When I started building multi-agent systems, the agents themselves were the easy part. Getting Claude to analyze a document or GPT-4o to classify a ticket took an afternoon.

Then I spent the next three months building everything around them.

Routing logic that could pick the right model based on cost, latency, and quality constraints — not just hardcoded model names. A judge pipeline that could evaluate whether an agent's output was actually good. Memory that persisted between sessions. A way to run agents in parallel, in pipelines, in hierarchies, in debate configurations, without rewriting the orchestration layer each time. Cost tracking that told me we'd burned through $400 before the pipeline even finished its second run.

The agents were maybe 15% of my codebase. The infrastructure was the rest.

I looked at LangGraph, CrewAI, AutoGen, OpenAI Swarm. They're good at what they do. But none of them solved the full operating problem: routing + quality + cost + memory + execution topology + security, in one coherent system. So I built one.

The core insight: agents need what programs got

Programs got operating systems because running directly on hardware was unsustainable. You needed process scheduling, memory management, I/O abstraction, security isolation.

Agents have the same problem. When you're orchestrating 6 agents across 3 providers with different cost profiles, quality requirements, and failure modes — you need the same abstractions. Scheduling (which agent runs when, in what topology). Memory management (what context survives between sessions). I/O abstraction (HTTP, MCP, CLI, webhooks — agents shouldn't care). Security (credential vaults, PII sanitization, SSRF protection).

That analogy drove the architecture. Qualixar OS isn't a library you import. It's a runtime you start. Agents register with it, and it handles everything else.

What shipped in v2.2.0

Here are the concrete numbers. I'm listing these because I'm tired of agent framework launches that say "powerful" and "scalable" without showing what that means.

Execution topologies: 13 built-in patterns — sequential, parallel, hierarchical, DAG, debate, mesh, star, grid, forest, circular, mixture-of-agents, maker, and hybrid. You declare the topology, the system handles the execution semantics. Debate topology, for example, runs N agents on the same input and synthesizes the outputs through a judge.

Model routing: Cost-quality-latency constraint solver. You tell it "I need quality above 0.8, latency under 2 seconds, cost under $0.01 per call." It picks the model. When a provider goes down or changes pricing, routing adapts. Backed by a POMDP-based belief model (Forge AI) that can design agent teams automatically from a task description.

Judge pipeline: Multi-judge consensus with few-shot calibration, built on research from AgentAssert — our contract-based reliability testing framework (arXiv:2602.22302). Not "did the agent return a response" but "is this response actually correct, complete, and safe." Few-shot examples in the judge prompts reduced calibration drift significantly.

Memory: SLM-Lite cognitive memory, powered by SuperLocalMemory — backed by 3 peer-reviewed papers (arXiv:2604.04514, arXiv:2603.14588, arXiv:2603.02240). SQLite-backed, fully local. Episodic memory, semantic recall via embeddings (new in v2.2.0), working memory with decay. No cloud dependency. Your agent's memory stays on your machine.

Security: RBAC middleware on all enterprise routes. Credential vault with no plaintext exposure. PII sanitization in the output pipeline. SSRF protection on the new HTTP request tool. CSP headers. Request ID propagation for audit trails.

Communication: 7 channels — HTTP API, MCP protocol (native), CLI, Discord, Telegram, Webhook, Slack. Same agent, accessible from anywhere.

The boring stuff that matters: 2,936 tests across 213 test files. 761 source files, 161,810 lines. 852KB npm package. 25 CLI commands. 25 MCP tools. 24 dashboard tabs. 9 built-in tools. Native A2A protocol support. Programmatic API via createQosInstance(). Task execution streaming via SSE. Part of a research ecosystem with 7 peer-reviewed papers on arXiv.

Install it with npx qualixar-os.

The 7-perspective audit story

This is the part I'm most proud of, and it has nothing to do with writing code.

Before launch, I ran 7 independent AI agents (Claude Opus) against the entire codebase. Each agent got a different persona and a harsh audit prompt. Zero prior context about the repo. They'd never seen the code before. The perspectives:

1. Industry Architect — enterprise readiness, integration patterns

2. Agentic AI Specialist — framework design, agent orchestration

3. Academic Reviewer (PhD caliber) — algorithmic rigor, citation accuracy

4. Market Researcher — positioning, adoption barriers

5. Veteran AI/ML Architect (20 years hands-on) — production hardening

6. Competitive Intelligence — GitHub landscape analysis

7. Hardcore QA Tester — edge cases, failure modes, security

They came back with 154 raw findings. After deduplication, 76 unique issues.

The initial scores averaged 5.99 out of 10. Range: 5.0 to 7.05. They were brutal.

Some highlights of what they found:

  • RBAC middleware existed but wasn't wired to enterprise routes. Security theater.
  • The credential vault had a code path that could return plaintext secrets in an API response.
  • PII sanitization was implemented but not plugged into the chat output pipeline.
  • A race condition in the chat system: two concurrent messages to the same conversation could corrupt state because streams were keyed by conversation ID instead of message ID.
  • The README claimed framework adapter support that didn't match the actual source code.
  • Documentation called the strategy scoring system "RL Training" when it's actually weighted averaging — not reinforcement learning.
  • SSRF protection didn't exist on HTTP-based tools.

None of these would have shown up in unit tests. Every single one would have been a production incident.

We fixed all 76 findings. Not "addressed." Fixed. RBAC wired. Vault plaintext removed. PII sanitization plugged in. Race condition resolved. Claims corrected. SSRF protection added.

Then we re-audited. Post-fix scores averaged 7.76 out of 10 (range: 7.0 to 8.5). Five GO verdicts, one Minor Revisions, one Conditional-GO. Not perfect — the Academic Reviewer wanted more formal verification, and the Competitive Intelligence agent noted the FSL license and zero-star fresh repo as adoption risks. Both fair points.

I think this process — adversarial multi-perspective audit by independent AI agents before launch — should be standard practice. It cost me one evening and caught issues that would have taken months to surface through user reports.

What doesn't work well yet

I'd be lying if I called this production-ready for everyone. Here's what's rough:

Dashboard UX is functional, not polished. 24 tabs is a lot of surface area. Some tabs feel like developer tools, not user interfaces. The streaming visualization works but needs design attention.

Some topologies are less battle-tested than others. Sequential, parallel, and hierarchical get the most exercise. Grid and circular topologies exist and pass tests, but I haven't run them on demanding real-world workloads yet.

Documentation has gaps. 72+ files sounds like a lot until you realize the system has 761 source files. The three new tutorials cover the common paths. The uncommon paths — custom topology creation, extending the judge pipeline, writing your own memory providers — still need proper guides.

Tested mainly on macOS. I develop on a Mac. The test suite runs on macOS. Linux should work (it's Node.js), and Docker is available, but I haven't done exhaustive cross-platform testing.

Fresh repo, early community. There are no Stack Overflow answers yet, no community plugins. You'd be an early adopter, with everything that implies.

The data loss story

On March 24, 2026, a catastrophic rm -rf command deleted my entire home directory. All code. All projects. All memory systems. Everything.

I won't go into the details of how it happened. What matters is what happened next.

I had architecture documents — 47 design decisions, interface specifications, database schemas — that survived because they'd been synced to a different location. No code survived. Just the blueprints.

I rebuilt from those blueprints. The second version came out cleaner. When you lose everything and rebuild from architecture docs, you don't carry forward the accumulated technical debt. You don't preserve the workarounds from when you didn't understand the problem yet. You build what you now know you should have built the first time.

The irony isn't lost on me: a system designed to be the memory and runtime backbone for AI agents was itself rebuilt from memory. Architecture survived code.

It also made me paranoid about safety in ways that shaped the product. The filesystem sandbox, the credential vault, the PII sanitization — these aren't checkboxes. They're scars.

The technical choices, briefly

TypeScript + Node.js. I know Rust would be faster. I chose developer accessibility over raw performance. If you can write JavaScript, you can extend this system. The 852KB package size suggests the abstraction cost is reasonable.

SQLite for everything. Memory, configuration, marketplace registry, agent state. One dependency. No database server. Runs everywhere.

FSL-1.1 (Functional Source License). Source available, free for non-competing use, converts to Apache 2.0 after 2 years. I know this limits adoption compared to MIT. It's a conscious trade-off while the project is young.

MCP as a first-class protocol. Every tool, every agent capability is accessible via the Model Context Protocol. This means any MCP-compatible client (Claude, Cursor, and a growing ecosystem) can use Qualixar OS natively.

What I'd do differently

Start with the judge pipeline, not the orchestrator. I built execution topologies first because they were architecturally interesting. In practice, the judge pipeline is what makes agent output trustworthy. If I were starting over, quality evaluation would be day one.

Write the paper earlier. The arXiv paper (2604.06392) forced me to formalize my thinking and cite related work properly. Several architectural improvements came directly from writing the paper, not from writing the code.

Run the adversarial audit earlier. The 7-perspective audit found issues I'd been blind to for months. Running it before launch was good. Running it every month would have been better.

Try it

npx qualixar-os

That starts the runtime with the dashboard on port 3000. From there you can define agents, pick a topology, and run tasks — through the CLI, the HTTP API, or MCP.

The GitHub repo has the source, quickstart guide, and the full architecture. The arXiv paper (DOI: 10.5281/zenodo.19454219) has the formal treatment — and it's one of 7 papers across the Qualixar research ecosystem covering agent orchestration, reliability testing, memory, evaluation, and skill verification.

If you build agent systems and have opinions about what's missing, I want to hear them. File an issue, start a discussion, or just tell me what's broken. The 7 AI auditors found 76 things. I'm sure humans will find more.

---

Qualixar OS is open source under FSL-1.1 (converts to Apache 2.0 after 2 years). Built by a solo developer. Backed by 7 peer-reviewed papers across agent orchestration, reliability, memory, evaluation, and skill testing. Not funded, not affiliated with any company. Just trying to solve the agent infrastructure problem properly.

---

The Qualixar AI Reliability Engineering Platform

19K+ monthly downloads · 154 GitHub stars · zero cloud dependency.

Start here → qualixar.com

ai-reliability-engineeringai-agentsqualixar-osbuilder-storyadversarial-auditarchitecture

This post is about qualixar-os

Enjoyed this post?

Subscribe to get weekly AI agent reliability insights.

Subscribe to Newsletter