proof/b12
Benchmark B12

Interrupted Session Recovery
Data always survives. Recovery depends on retrieval.

B12 tests whether an AI agent can recover its working state after a session interruption. An agent wrote 8 facts to Iranti during a session, which was then interrupted. A fresh session opened and recovery was tested four ways. The finding: write durability is solid — all 8 facts survive — but how many you get back depends entirely on how the new session retrieves them.

Executed 2026-03-21n=8 facts, 4 modalitiesv0.2.16

Results at a glance

8/8Facts surviving session break — write durability is solid
5/8Recovery via iranti_observe + hint — setup facts crowd out progress facts
8/8Recovery via explicit iranti_query — perfect, but requires knowing entity IDs upfront
FindingThe data is never the problem — write durability is solid. Recovery is a retrieval design question. The modality you choose determines what comes back.

What this measures

Long-running AI agent tasks are rarely uninterrupted. Sessions time out, contexts are reset, processes crash, or work is handed off to a different agent instance. When that happens, any working state that was not persisted is gone. Session recovery is the ability to reconstruct that working state — goals, progress, intermediate findings, open questions — in a fresh session, fast enough to continue without starting over.

B12 sets up a realistic scenario: an agent analyzing LLM multi-hop reasoning performance writes 8 facts to Iranti across the session — 4 describing the evaluation setup (high confidence) and 4 describing in-progress work (slightly lower confidence). The session is then interrupted. A new session opens, and recovery is tested four ways: no retrieval, handshake only, observe with a semantic hint, and explicit key-based query.

The benchmark isolates two distinct questions: first, did the data survive the session break at all (write durability)? Second, how much can a recovery session actually retrieve, and does the retrieval strategy matter? These turn out to have very different answers.

The four recovery modalities

Each modality represents a distinct retrieval strategy. Same underlying data in all four cases — the only variable is how the recovery session asks for it.

Recovery methodTotalSetup (4)Progress (4)
No Iranti0/80/40/4
Handshake only0/80/40/4
Observe + hint5/84/41/4
Explicit query8/84/44/4

No Iranti and Handshake-only both score 0/8 for different reasons. No Iranti has no persistent storage at all. Handshake-only has the data but does not retrieve it — the handshake returns session metadata, not stored facts.

Session break and recovery flow

Session 1 writes 8 facts to Iranti, then is interrupted. The Iranti KB holds all 8 facts intact across the break. Session 2 opens fresh and attempts recovery via one of four modalities. The KB is the bridge — the question is only which retrieval path Session 2 uses.

Session 18 facts written⚡ interruptedsession breakIranti KB8/8 facts persistwrite durability solidSession 2fresh context4 recovery modesiranti_write (8 facts)recovery retrievalNo IrantiHandshake onlyObserve + hintExplicit querymodalities tested ↓

The 8 facts: setup vs. progress

Setup facts describe the evaluation configuration — stable, high-confidence (95). Progress facts describe in-flight work — findings, next steps, open questions — written at confidence 90. The observe+hint column shows which facts were returned when the recovery session used iranti_observe with a semantic hint.

CategoryKeyValueConf.Observe+hint
setupevaluation_goalcompare GPT-4o, Claude 3.5, Gemini 1.5 Pro on multi-hop reasoning95
setupdatasetHotpotQA bridge subset (questions 1–100)95
setupmodels_under_testGPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro95
setupprimary_metricexact_match on final answer95
progresspreliminary_findingGPT-4o outperforming others on bridge questions by ~8% EM90
progressnext_steprun questions 81–100 to complete bridge subset90
progressopen_questionWhether bridge advantage holds on comparison questions (questions 101–200)90
progresspartial_resultquestions 1–80 processed; GPT-4o EM=0.74 on bridge subset so far90

All 8 facts were confirmed present in the KB after the session break. The observe+hint column reflects what the recovery session actually received — 5 of 8, not 8 of 8.

Why progress facts are harder to recover

When iranti_observe is called with a semantic hint, it returns results ranked by confidence. Setup facts were written at confidence 95; progress facts at confidence 90. With a bounded result set, the higher-confidence setup facts rank first and fill the available slots. Progress facts are present in the KB but are ranked lower — they get crowded out before the result window closes.

This is not a data loss problem. Every fact is retrievable with an explicit query. It is a retrieval design problem: if you rely on observe to surface everything, confidence ranking will consistently deprioritize lower-confidence entries even when those entries represent the most critical in-progress work.

FindingConfidence ranking is load-bearing. Setup facts (95) crowd out progress facts (90) in bounded observe results. The confidence delta is small but the effect is deterministic — all 4 setup facts come back, and only 1 of 4 progress facts makes the cut.
FindingProgress facts are the most critical to recover. Setup facts describe a stable configuration that can often be reconstructed from other context. Progress facts — partial results, next steps, open questions — are the unique product of work done and cannot be reconstructed without the data.

Practical recommendations

Write early

Write facts to Iranti as soon as they are established, not at the end of a session. If the session is interrupted before a final write, facts written mid-session survive. Facts written only at cleanup do not.

Match confidence levels

If you want iranti_observe to surface progress facts alongside setup facts, write them at the same confidence level. The 5-point gap (90 vs 95) was enough to produce a lopsided recovery result. When in-progress work is equally critical, mark it equally confident.

Design for handoff

Before any long-running task, write a manifest fact that lists the entity IDs and key names the recovery session will need. Store the manifest at high confidence so observe surfaces it reliably. A recovery session that finds the manifest can then run explicit queries for everything else.

Use explicit query for complete recovery

iranti_query with known entity IDs and key names gives 8/8 perfect recovery. If the recovery session has or can discover the entity IDs, prefer explicit query over observe for mission-critical state. Observe is useful for exploration; explicit query is the right tool for known recovery targets.

Honest limitations

Limitationn=1 trial. This was a single test run, not a distribution. The 5/8 observe result and the 8/8 explicit query result are directionally meaningful but should not be read as precise rates. Confidence ranking effects may vary across different KB states, hint qualities, and result window sizes.
LimitationSession break was simulated. The interruption was a clean context clear — not a crash, timeout, or hard process termination. Real-world interruptions may involve partially written state, inflight writes that did not commit, or other failure modes not covered here.
LimitationControlled scenario. The 8 facts were purpose- written to test recovery. Real agent sessions write facts with varying structures, entity distributions, and confidence levels that may change observe ranking behavior in ways this test did not exercise.
NoteExplicit query requires prior knowledge. The 8/8 explicit query result is only achievable if the recovery session knows which entity IDs and keys to query. This knowledge must come from somewhere — a manifest, documentation, or an observe call that surfaces the right entry first.

Key findings

FindingWrite durability is solid. All 8 facts survived the session break intact. Iranti is a reliable persistence layer for agent working state — data is not the failure point in session recovery.
Findingiranti_handshake is not a recovery tool. It returns session metadata, not stored facts. Agents that call only the handshake after a session break will recover 0/8 facts, even though all 8 are present and retrievable in the KB.
FindingObserve + hint gives partial recovery, biased toward setup. Confidence ranking causes high-confidence setup facts (95) to crowd out lower-confidence progress facts (90) in bounded observe results. This produces a predictable and deterministic gap — not random, but structural.
FindingExplicit query gives perfect recovery. With known entity IDs and key names, iranti_query retrieves all 8 facts without degradation. The tradeoff is that this requires the recovery session to know what to ask for — which must be planned for at write time.
FindingRecovery is a design question, not a product bug. The gap between 5/8 and 8/8 is not caused by data loss or retrieval failure — the data is there. It is caused by retrieval strategy. Teams that design their agents with explicit handoff manifests and consistent confidence levels will get full recovery. Teams that rely on observe will get partial recovery, weighted toward their most confident (typically setup) facts.
Raw data

Full trial execution records, fact tables, per-modality recovery traces, and methodology notes in the benchmarking repository.