Skip to content
EngineeringApril 18, 2026·12 min read

Our Multi-Agent Pipeline Explained

WarpFix uses six specialized agents — parser, classifier, patcher, validator, scorer, and shipper — working in sequence. Here is why we chose this architecture over a single monolithic LLM call.

W

WarpFix Engineering

Platform Architecture Team

Why Not Just One Big LLM Call?

The simplest approach to automated CI repair would be: take the entire CI log, send it to GPT-4 with the prompt "fix this," and apply whatever it returns. Some tools do exactly this.

It does not work reliably, for three reasons:

  1. Context window limits: CI logs can be thousands of lines. Stuffing everything into one prompt wastes tokens on irrelevant noise and often exceeds context limits.
  2. Compound errors: A single failure often has multiple root causes. One prompt cannot reliably triage, prioritize, and fix them all.
  3. No verification: The LLM might produce a syntactically valid patch that introduces a new bug. Without validation, you are shipping untested code changes.

WarpFix solves this with a multi-agent pipeline — six specialized agents, each responsible for one stage of the repair process.

The Six Agents

1. Log Parser

The Log Parser receives raw CI output and produces structured, noise-free diagnostic data. It handles:

- Stripping ANSI codes, timestamps, and progress indicators

- Identifying error boundaries (where one error ends and another begins)

- Extracting stack traces, file paths, and line numbers

- Detecting the build system (GitHub Actions, CircleCI, Jenkins, GitLab CI)

- Normalizing output format across CI providers

The parser uses deterministic rules, not an LLM. This keeps it fast (under 100ms) and predictable.

2. Classifier

The Classifier takes the parsed output and categorizes the failure:

  • Type: compilation error, test failure, dependency conflict, runtime exception, linting violation, infrastructure issue
  • Severity: critical, high, medium, low
  • Affected files: which source files are implicated
  • Suggested approach: direct fix, dependency update, configuration change, or "needs human investigation"

The classifier uses a lightweight LLM call (Claude Haiku or equivalent) because it needs semantic understanding of error messages, but the output is structured JSON — not free-form text.

3. Fingerprint Engine

Before generating any fix, the fingerprint engine checks whether this exact error pattern has been seen before. If a high-confidence match exists, the pipeline can skip the expensive patch generation step entirely and use the cached fix.

This agent is purely database-driven — no LLM involved.

4. Patch Generator

For novel failures (no fingerprint match), the patch generator creates a fix. This is where the heavy LLM work happens:

- It receives the classified error, the affected source files (fetched from the repository via GitHub API), and any similar fingerprints as context

- It produces a complete file-level patch with the minimum changes needed to fix the error

- It follows strict guardrails: no new files, no test modifications, no debug statements, no configuration changes without explicit approval

The patch generator uses the most capable model available (currently Claude Sonnet) because fix quality directly affects user trust.

5. Sandbox Validator

The validator runs the generated patch in an isolated Docker container that mirrors the repository's CI environment. It:

- Applies the patch to a fresh clone of the repository

- Runs the failing CI command

- Verifies the specific error is resolved

- Checks that no new errors were introduced

- Measures execution time and resource usage

If validation fails, the pipeline can retry with adjusted parameters or escalate to human review. The sandbox is destroyed after each run — no state leaks between validations.

6. Confidence Scorer and Pull Request Agent

The final agent combines signals from all previous stages to produce a confidence score:

- Fingerprint match strength (+30 points for exact match)

- Sandbox validation result (+25 points if tests pass)

- Patch complexity (-5 points per file changed, -10 for large diffs)

- Historical success rate for this error type

- Repository-specific learning (does this org tend to accept or modify similar fixes?)

Based on the score, the agent decides how to ship the fix:

  • High confidence (85+): Open PR with auto-merge label (if org has opted in)
  • Medium confidence (60-84): Open PR for human review with detailed explanation
  • Low confidence (below 60): Comment on the failing workflow with analysis and suggested fix, but do not open a PR

Why This Architecture Works

The multi-agent approach provides several advantages over a monolithic system:

Cost efficiency: Only the patch generator uses expensive LLM calls. The parser, fingerprint engine, and validator are deterministic. For fingerprinted failures (70%+ of cases), the total LLM cost is near zero.

Debuggability: When a fix is wrong, we can trace exactly which agent made the mistake. Was the error misclassified? Was the patch correct but the sandbox environment misconfigured? This granularity is impossible with a single prompt.

Independent improvement: Each agent can be upgraded independently. We can swap the classifier model without touching the patch generator. We can improve the sandbox without changing the parser.

Parallelism: Some agents can run concurrently. While the patch generator works, the fingerprint engine can pre-load similar patterns. While the sandbox validates, the confidence scorer can compute partial scores.

Real-World Performance

Across our production workload:

  • Median repair time: 47 seconds (from webhook trigger to PR opened)
  • Fingerprint cache hit rate: 72%
  • Sandbox pass rate: 89% (patches that pass validation on first try)
  • Human acceptance rate: 94% (PRs merged without modification)
  • LLM cost per repair: $0.03 average (due to fingerprint caching)