proof/b11
Benchmark B11

Context Recovery
5/6 with hint. Auto-detection restored in v0.2.16.

An agent writes 5 project facts to Iranti during Session 1. Session 2 starts with zero in-context knowledge — no system prompt, no prior conversation. The agent calls iranti_observe and attempts to reconstruct full working context from persistent memory alone.

Executed 2026-03-215 facts · 3 conditionsv0.2.16: auto-detection fixed, 1/6 gap unexplained

Results at a glance

5/5Facts recovered in full-recovery condition (exact entity)
3/5Facts recovered in partial-recovery condition (partial pattern)
0/5Cold start baseline — no observe() call, no memory
5/6Auto-detection (v0.2.16) — no hints required
FindingWhen the entity name matched exactly, iranti_observe injected all 5 facts and the agent answered correctly. Partial entity patterns yielded partial recovery. No observe call meant no memory — the agent fell back to training data.

What this measures

LLM agents have no persistent memory across sessions by default. When a session ends, everything in context is gone. The agent in the next session starts from scratch — it may hallucinate, pull from training data, or simply ask the user to re-explain what was already worked out.

B11 tests whether iranti_observe can serve as a reliable context recovery mechanism. The scenario is realistic: a project called Aurora has an ongoing technical workstream. 5 structured facts are written during Session 1. In Session 2, a new agent instance starts cold and must reconstruct working context without any human re-briefing.

Three conditions are tested: full recovery (exact entity match), partial recovery (pattern match returning a subset), and a cold start baseline (no memory call at all). The baseline is definitional — an agent that does not query memory cannot recover context from persistent storage.

Two named failure modes — entity drift and over-injection — were identified as sharp edges. Both are documented below.

The context recovery flow

Session 1 writes 5 facts via iranti_write. After the session break, Session 2 starts cold and calls iranti_observe to reconstruct context from the KB. No shared context exists between sessions except what is stored in Iranti.

Session 1iranti_write ×5Iranti KBproject/auroraSession 2iranti_observewrite (5 facts)observe (cold session)session break — zero in-context knowledge at Session 2 start

The written context — project/aurora

5 structured facts written to project/aurora during Session 1. These are the facts the agent in Session 2 must recover via iranti_observe.

Key (under project/aurora)Value written
lead_architect"Marcus Chen"
stack"Rust + WebAssembly"
target_platform"browser-native execution"
current_milestone"M3: performance benchmarking"
risk_flag"WASM memory model edge cases"

Recovery conditions

Three conditions tested across the same fact set. Each condition varies what observe() was called with — or whether it was called at all.

Conditionobserve() callFacts matchedAgent correctScore
Full recoveryYes, exact entity5/5Yes5/5
PASS
Partial recoveryYes, partial pattern3/5Partially3/5
PARTIAL
Cold start (baseline)No0/5No0/5
FAIL
Auto-detection (v0.2.16)Yes, no hint (auto)5/6Partially5/6
PARTIAL

Facts recovered per condition

Visual comparison of how many of the 5 project facts were injected into context under each recovery condition.

Full recovery5/5
Partial recovery3/5
Cold start (baseline)0/5

Bar width = fraction of facts recovered. Full width = 5/5. Empty bar = 0/5 baseline (no observe call).

Named failure modes

Two sharp edges were identified during B11. Both are structural properties of how iranti_observe works, not transient bugs.

F1Entity drift

If the agent uses a slightly different entity name at retrieve time — for example, project/aurora at write time but project/aurora_v2 at observe time — iranti_observe returns nothing. There is no fuzzy matching.

This is a sharp edge for long-running projects where entity naming evolves. The agent must use a consistent, stable entity identifier at both write and observe time. Any drift in naming convention — a suffix, a version tag, a case difference — will silently produce zero context injection with no error signal.

F2Over-injection

iranti_observe injects all matching facts regardless of their relevance to the current task. In a large knowledge base with many stored facts under a broad entity pattern, this can flood the agent's context window with unrelated information.

In B11's controlled scenario (5 facts, single project), this was not a problem. In production use with hundreds of stored facts per project entity, over-injection may consume significant context budget and degrade response quality by introducing irrelevant facts that the agent must filter mentally. Task-scoped observe patterns are advisable in dense KBs.

Honest limitations

LimitationSmall n. Three conditions is enough to establish that observe()-based recovery works and to identify structural failure modes. It is not enough to characterize reliability across diverse project types, large knowledge bases, or adversarial entity naming patterns.
LimitationSingle session simulation. The session break was simulated within a single benchmark program, not tested across genuinely separate processes or API sessions. True cross-session isolation — where each session is a separate network call with no shared state — was not verified here.
LimitationEntity naming sensitivity is a sharp edge. The partial recovery condition (3/5) was produced by a pattern match returning a subset — not by a naming mismatch. A full entity drift scenario (F1) would have produced 0/5 with no error signal. This risk is real and undercommunicated by the headline score.
NoteDefinitional baseline. The cold start 0/5 result is not an empirical failure of Iranti — it is the expected result when the tool is not called. This baseline exists to establish what context recovery without memory looks like, not to benchmark a competing system.

Key findings

FindingExact entity match yields full recovery. When the agent observed project/aurora with the exact entity name used at write time, all 5 facts were injected and the agent answered correctly. Context recovery worked as designed.
FindingEntity naming is load-bearing. The difference between 5/5 and 0/5 recovery is a single string match. There is no fuzzy resolution, no intent-based lookup, no synonym handling. Stable, explicit entity identifiers are a first-class operational requirement for observe()-based workflows.
FindingOver-injection is a production concern, not a test concern. In B11's small KB, inject-all behavior caused no harm. In production KBs with hundreds of facts per project, context budget management requires task-scoped entity patterns or explicit observe filters.
FindingCold start baseline confirms the value of the tool. Without iranti_observe, the agent had 0/5 factual grounding and answered from training data. The observe call is not optional for cross-session continuity — it is the mechanism.

v0.2.16 Update: Auto-Detection Fixed

2026-03-21
Capabilityv0.2.12v0.2.14v0.2.16
iranti_attend classifierBroken (silent fail)FixedFixed
Entity auto-detectionBroken (0 candidates)BrokenFixed (confidence 0.82)
observe + hint5/65/65/6
observe (auto, no hint)0/60/65/6
sla_uptime — 1/6 gap (unexplained)Slash retracted; gap unexplained
FindingAuto-detection is the most significant improvement in v0.2.16. Previously, the system could not identify the correct entity from raw context text — it returned 0 candidates, forcing every observe call to carry an explicit hint. Now the classifier resolves the entity automatically at confidence 0.82, returning 5/6 facts with no hint required. The iranti_attend pipeline now works end-to-end for the first time.
Note1/6 gap: unexplained. The benchmark recorded a parse_error/invalid_json debug entry for one dropped fact (sla_uptime). Fresh v6.0 revalidation confirmed that slash-bearing values return correctly through query, search, observe, and attend — slash handling is retracted as a product defect. The 1/6 gap remains and its root cause is currently unexplained.
NoteAttend noise from typescript_smoke: resolved. A test-artifact entry (user/main/favorite_city, source: typescript_smoke) surfaced during v0.2.16 benchmarking under forceInject with no entityHints. This has since been resolved upstream — it is not an open product issue.
Raw data

Full trial execution records, session logs, entity payloads, and methodology notes in the benchmarking repository.