Benchmark B5

Knowledge Currency
Updating a fact is harder than writing one.

Long-lived agents depend on KB facts staying accurate as the world changes. B5 tests whether Iranti supports updating facts that already exist. The finding: update behavior is complex and depends on source reliability, not just confidence scores.

Executed 2026-03-21n=5 test casesLLM-arbitrated writes: transaction timeout

Results at a glance

2/6Updates accepted (T1b deterministic, T5 large-gap cross-source)
2/6Correct rejections (lower-conf or duplicate updates blocked)
2/6Errors — LLM-arbitrated writes timed out (T1, T4)
FindingThere is no simple “update this fact” operation. All writes go through conflict detection. When the score gap is ≥10 points, resolution is deterministic and reliable. When the gap falls below the threshold, LLM arbitration is triggered — and under a real provider, the DB transaction times out before the result can persist.

What this measures

A KB that only supports writing new facts is useful for accumulation but fragile for long-lived agents. The world changes. Decisions get revised. Facts go stale. An agent system that cannot update its own knowledge will eventually act on wrong information with high confidence.

B5 probes Iranti's update semantics directly: can a higher-confidence or more current value replace an existing fact? The answer depends on whether the conflict resolves deterministically (gap ≥10 points in weighted score) or routes to LLM arbitration. When the gap is large enough, the write succeeds regardless of source history. When it is not, LLM arbitration fires — and under a real provider, the DB transaction times out before the result can persist.

T2 and T3 produce correct rejections: a lower-confidence update should not displace a high-confidence fact, and a duplicate value with lower score should be deduped. T1b and T5 confirm that deterministic resolution works for both same-source and cross-source writes when the score gap is sufficient. T1 and T4 expose the transaction timeout defect on the LLM-arbitrated path.

The six test cases

Each case attempts to update an existing KB fact with a new value. Teal = accepted or correctly rejected. Amber = error (transaction timeout). Gap = weighted score delta between update and existing value.

CaseDescriptionGapOutcome
T1New source, higher raw conf (92 vs 85), gap 2.9 pts
LLM arbitration — DB transaction timed out (~10s API, 5s window)
2.9 ptsERROR
T1bSame source, higher conf (97 vs 85), gap 10.4 pts
Deterministic resolution — gap exceeded threshold
10.4 ptsACCEPTED
T2Lower-confidence update to high-confidence fact
Correct behavior — lower confidence lost
negativeREJECTED
T3Same value, lower confidence (duplicate detection)
Duplicate value with lower score — deduplicated
same valueREJECTED
T4New source, small confidence increase (80 → 85), gap 4.25 pts
LLM arbitration — DB transaction timed out (~16s API, 5s window)
4.25 ptsERROR
T5New source, forced high confidence (70 → 99), gap 24.6 pts
Deterministic resolution — large gap bypasses LLM arbitration entirely
24.6 ptsACCEPTED
Teal = accepted or correctly rejected. Amber = error (LLM arbitration timed out; incumbent preserved by rollback).

Confidence gap vs. outcome

Each bar shows the weighted score gap between the incoming update and the existing fact. The dashed vertical line marks the ~10-point threshold above which Iranti resolves conflicts deterministically — bypassing LLM arbitration entirely. Below that line, arbitration runs and the DB transaction times out under a real provider.

0 pts
10 pt threshold (deterministic)28 pts
T1
2.9 pts
T1b
10.4 pts
T2
negative
T3
same value
T4
4.25 pts
T5
24.6 pts

Bar length = weighted score gap between update and existing fact. Teal = accepted or correctly rejected. Amber = LLM arbitration triggered a timeout. The vertical dashed line marks the 10-point threshold above which resolution is deterministic. T1b and T5 both clear this threshold and both were accepted. T1 and T4 fall below it and both errored.

Why source reliability matters

Iranti tracks source reliability as an accumulated signal: the more facts a source has written that were accepted and stable, the higher its reliability score. This is a sound design for preventing noisy or adversarial sources from overwriting high-quality facts.

The problem emerges when a new, correct source attempts to update a fact originally written by an established source. The new source has no accumulated reliability — even if its confidence on this specific fact is higher. When the score gap is below the deterministic threshold, LLM arbitration is triggered. Under a real LLM provider, the API latency (8–16 s observed) exceeds the 5,000 ms DB transaction window, causing a transaction timeout before the result can persist. The incumbent is preserved by rollback.

T5 (new) shows the workaround: with a confidence gap large enough to score 24.6 points above the existing fact, deterministic resolution fires without any LLM call — and the update succeeds even across different sources. The reliable update path is a large gap, not a same-source write.

The accumulation mechanic
  • Every accepted write increases the writing source's reliability score.
  • A new source starts with no history — it has no reliability advantage even if its value is better.
  • Deterministic resolution (gap > ~10 pts) bypasses this bias and lets a sufficiently superior value win regardless of source history.
  • Below the threshold, LLM arbitration runs — and under a real provider, the DB transaction times out before the result persists. Incumbent is preserved by rollback. The write silently fails from the caller's perspective.

The stale fact problem

LimitationSilent persistence of wrong facts. When a legitimate update is rejected by LLM arbitration, the KB retains the original value with no flag that a challenge was attempted or that an alternative exists. A downstream agent querying that fact will receive a confident, high-score answer that may be outdated. Stale facts can persist silently. There is no “contested” or “possibly outdated” marker on rejected updates.

This is not a bug in the traditional sense — the conflict detection worked as designed. But the design has a gap: it treats an update rejection as “fact unchanged” rather than “fact challenged.” For long-lived agents, that distinction matters. A fact that was challenged and survived arbitration is epistemically different from a fact that was never challenged.

Until a flagging or versioning mechanism exists, teams using Iranti for knowledge that changes over time should periodically rewrite facts from the original source with an elevated confidence score to force deterministic resolution, or use the same source identifier for all updates to avoid the established-source bias.

Honest limitations

LimitationSmall test set (n=6). Six test cases document the behavior pattern and surface the transaction timeout defect. They are not enough to characterize the LLM arbitration boundary precisely or measure how the timeout rate varies with model latency distribution.
LimitationSingle session. All cases were tested within one session. Cross-session source reliability accumulation — where an established source built its history across many prior sessions — was not tested.
LimitationLLM arbitration outcome is unobservable. T1 and T4 triggered LLM arbitration but timed out before results could persist. Whether the “established source bias” observed in v0.2.12 (mock LLM) still holds with a real provider is unknown — the test cases that would expose it cannot complete under current infrastructure.
NoteT2 and T3 are correct behavior. Rejecting a lower-confidence update and deduplicating a same-value lower-score entry are the intended semantics. These are not limitations — they are confirmation that basic conflict detection works correctly.

Key findings

FindingNo simple update operation. Every write to an existing key goes through conflict detection. There is no way to force an update without either exceeding the deterministic threshold or using the same source identifier as the original write.
FindingLLM-arbitrated writes time out under a real provider (v0.2.16). In v0.2.12 (mock LLM), T1 and T4 returned a clean REJECTED with an LLM reason string. In v0.2.16 with a real OpenAI call, the same cases produce a transaction timeout error — the incumbent is preserved by rollback, but no reason is returned. Whether the “established source bias” from v0.2.12 still holds in v0.2.16 is unknown: the LLM result never persists.
FindingDeterministic resolution requires a ~10-point gap. T1b (same source, gap 10.4 pts) and T5 (cross-source, gap 24.6 pts) both accepted deterministically. Updates below this gap will route to LLM arbitration — which currently times out under a real provider. For cross-source updates, a large confidence gap is the only reliable path.
FindingDuplicate detection works correctly. T3 confirmed that a same-value, lower-confidence write is correctly identified and rejected as a duplicate. The deduplication logic operates as expected.
FindingNo contested-fact flag exists. Rejected updates leave no trace in the KB. Agents querying updated keys receive the original value with no indication that a more recent alternative was proposed and rejected. This is the primary operational risk for long-lived agents.
Raw data

Full trial execution records, conflict resolution logs, and methodology notes in the benchmarking repository.