A CLI-first Python research instrument that compares two agent architectures on multi-document synthesis tasks.
Long-running AI agents should not stuff everything into growing context windows. They should execute over document references, bounded source slices, evidence stores, verification, and policy steps. This project measures whether that claim holds — and under what conditions it breaks.
Concatenate all documents → one giant prompt → model answers → verifier checks the answer → revise once if needed → deterministic policy decides allow, revise, or deny.
Ingest → planner → bounded source slices → evidence cards → synthesizer → verifier → optional revision → deterministic policy gate → final answer with provenance trace.
The recursive mode externalizes state into a sequential, reference-indexed execution graph:
The checked-in diagram still shows the high-level flow. The current harness adds one bounded revision pass after failed verification and applies deterministic policy rules over verifier output.
git clone https://github.com/rmax-ai/recursive-execution-harness-lab
cd recursive-execution-harness-lab
uv sync --extra dev
uv run rxh run --task benchmarks/research_synthesis/tasks/recursive_execution.yaml \
--corpus benchmarks/research_synthesis/corpora/sample \
--mode long-context --model gpt-5.5-thinking --out runs/baseline
uv run rxh run --task benchmarks/research_synthesis/tasks/recursive_execution.yaml \
--corpus benchmarks/research_synthesis/corpora/sample \
--mode recursive --model gpt-5.5-thinking --out runs/recursive
uv run rxh compare runs/baseline runs/recursive
If uv is unavailable on PATH after setup, use .venv/bin/rxh in place of uv run rxh.
Stack: Python 3.12+ · Typer CLI · Pydantic v2 · httpx · JSONL traces · Local filesystem · OpenAI-compatible providers
Fraction of major claims backed by evidence cards
Claims the verifier flags as lacking source evidence
Citations that don't match actual documents
Cited sources / relevant sources in the corpus
Post-verification allow, revise, or deny decision
Total input + output across all LLM calls
Required trace events present / total required
Every credible measurement instrument documents its limitations. These are ours:
This project does not propose a new foundation model or a new agent framework. It proposes a measurement harness for an architectural question: when should long-running agents rely on larger context, and when should they externalize state into recursive execution, bounded source slicing, evidence stores, explicit verification, and runtime policy gates?
This project is informed by the emerging consensus that long-running agents need architectural patterns beyond raw context scaling.
Jackman Ong · Founding Research Engineer @ Prime Intellect
Makes the case for Recursive Language Models (RLMs): agents that pass references into context instead of raw text, write code to access data programmatically, and delegate to sub-agents via control flow. Documents 'context rot' — 1M-token models that lose ~50% reasoning capability. Frames RLMs as 'the next thinking': what chain-of-thought was to 2022, programmatic context manipulation will be to agent architectures.
| Concept | Talk Insight | In This Project |
|---|---|---|
| Context Rot | Models drop from ~80% → ~36% on information retrieval as context grows to 1M tokens (MRCR benchmark) | The long-context baseline mode measures a single-prompt workflow under a fixed budget; it does not prove degradation by itself |
| Reference-Indexed Execution | Pass document identifiers through the workflow instead of assigning the whole corpus to every step | Planner-assigned refs drive bounded source-slice retrieval, and EvidenceCard.source_ref preserves provenance back to the original document |
| Compaction Avoidance | "Every time you end up with a compaction, the agent gets lost" | Recursive mode delegates to bounded sub-agents; no compaction needed |
| Programmatic Control Flow | Use for loops for 10,000 docs instead of 10,000 sequential tool calls | Planner → sequential bounded workers → synthesizer → verifier → optional revision |
| Verification Gates | LLM-as-judge on trajectories; detect unsupported claims | Both baseline and recursive runs are verified against source snippets keyed by source_ref, and failing answers get one revision pass before final policy |
| Runtime Policy Gates | Evaluate whether a verified answer is still safe to deliver | Both modes write a post-verification policy_decision.json artifact, but the decision itself is deterministic code over verifier output |
| Continual Learning | Harvest traces + feedback → train models on their specific harness | JSONL trace output captures every LLM call for future training loops |