v0.1.0 50 tests MIT

Recursive Execution Harness Lab

A CLI-first Python research instrument that compares two agent architectures on multi-document synthesis tasks.

When does recursive, reference-indexed bounded execution outperform naive long-context prompting for long-running agent tasks?

Long-running AI agents should not stuff everything into growing context windows. They should execute over document references, bounded source slices, evidence stores, verification, and policy steps. This project measures whether that claim holds — and under what conditions it breaks.

Execution Modes Compared

Mode A: Long-Context

Concatenate all documents → one giant prompt → model answers → verifier checks the answer → revise once if needed → deterministic policy decides allow, revise, or deny.

Simple, common, but degrades with scale. Risk: attention dilution, lost-in-the-middle, and no intermediate evidence artifacts before verification.

Mode B: Recursive

Ingest → planner → bounded source slices → evidence cards → synthesizer → verifier → optional revision → deterministic policy gate → final answer with provenance trace.

More calls. More inspectable. Workers operate on bounded source slices, invalid provenance is rejected, and the delivered answer passes verification before policy.

Pipeline Architecture

The recursive mode externalizes state into a sequential, reference-indexed execution graph:

Recursive execution pipeline: Corpus → Document Refs → Planner → Source Slices → Evidence Cards → Synthesizer → Verifier → Revision if needed → Deterministic Policy Gate → Final Answer

The checked-in diagram still shows the high-level flow. The current harness adds one bounded revision pass after failed verification and applies deterministic policy rules over verifier output.

Quickstart

git clone https://github.com/rmax-ai/recursive-execution-harness-lab
cd recursive-execution-harness-lab
uv sync --extra dev

uv run rxh run --task benchmarks/research_synthesis/tasks/recursive_execution.yaml \
  --corpus benchmarks/research_synthesis/corpora/sample \
  --mode long-context --model gpt-5.5-thinking --out runs/baseline

uv run rxh run --task benchmarks/research_synthesis/tasks/recursive_execution.yaml \
  --corpus benchmarks/research_synthesis/corpora/sample \
  --mode recursive --model gpt-5.5-thinking --out runs/recursive

uv run rxh compare runs/baseline runs/recursive

If uv is unavailable on PATH after setup, use .venv/bin/rxh in place of uv run rxh.

Stack: Python 3.12+ · Typer CLI · Pydantic v2 · httpx · JSONL traces · Local filesystem · OpenAI-compatible providers

Key Metrics

Claim Support Rate

Fraction of major claims backed by evidence cards

Unsupported Claims

Claims the verifier flags as lacking source evidence

Source Attribution Errors

Citations that don't match actual documents

Evidence Coverage

Cited sources / relevant sources in the corpus

Policy Decision

Post-verification allow, revise, or deny decision

Token Usage

Total input + output across all LLM calls

Trace Completeness

Required trace events present / total required

Threats to Validity

Every credible measurement instrument documents its limitations. These are ours:

  1. The verifier is model-based and may share blind spots with the generator.
  2. The corpus may favor one architecture over another.
  3. Recursive execution uses more explicit scaffolding — prompt clarity may improve independently of architecture.
  4. The baseline may be disadvantaged if context limits force document truncation, even though both modes now use the same verifier step.
  5. The revision loop is currently capped at one pass, so some fixable answers may still end in revise or deny.
  6. Results from research synthesis may not generalize to coding, customer support, or enterprise workflows.
  7. Token cost may vary by provider and caching strategy.
  8. Better long-context models may reduce the observed gap.
This project does not propose a new foundation model or a new agent framework. It proposes a measurement harness for an architectural question: when should long-running agents rely on larger context, and when should they externalize state into recursive execution, bounded source slicing, evidence stores, explicit verification, and runtime policy gates?

Inspiration & References

This project is informed by the emerging consensus that long-running agents need architectural patterns beyond raw context scaling.

Continual Learning for Long-Running Agents

Jackman Ong · Founding Research Engineer @ Prime Intellect

NVIDIA GTC 2026

Makes the case for Recursive Language Models (RLMs): agents that pass references into context instead of raw text, write code to access data programmatically, and delegate to sub-agents via control flow. Documents 'context rot' — 1M-token models that lose ~50% reasoning capability. Frames RLMs as 'the next thinking': what chain-of-thought was to 2022, programmatic context manipulation will be to agent architectures.

Key Concepts → Manifestation in This Harness

ConceptTalk InsightIn This Project
Context RotModels drop from ~80% → ~36% on information retrieval as context grows to 1M tokens (MRCR benchmark)The long-context baseline mode measures a single-prompt workflow under a fixed budget; it does not prove degradation by itself
Reference-Indexed ExecutionPass document identifiers through the workflow instead of assigning the whole corpus to every stepPlanner-assigned refs drive bounded source-slice retrieval, and EvidenceCard.source_ref preserves provenance back to the original document
Compaction Avoidance"Every time you end up with a compaction, the agent gets lost"Recursive mode delegates to bounded sub-agents; no compaction needed
Programmatic Control FlowUse for loops for 10,000 docs instead of 10,000 sequential tool callsPlanner → sequential bounded workers → synthesizer → verifier → optional revision
Verification GatesLLM-as-judge on trajectories; detect unsupported claimsBoth baseline and recursive runs are verified against source snippets keyed by source_ref, and failing answers get one revision pass before final policy
Runtime Policy GatesEvaluate whether a verified answer is still safe to deliverBoth modes write a post-verification policy_decision.json artifact, but the decision itself is deterministic code over verifier output
Continual LearningHarvest traces + feedback → train models on their specific harnessJSONL trace output captures every LLM call for future training loops