Facts change. Memory systems must keep up.
10 fact pairs, each written twice: v1 (original value) then v2 (updated value). The correct answer is always v2. Four verdicts are possible: v2-only (correct), both v1+v2 (mixed), v1-only (stale), or no match (miss). Scoring: any response containing v2 passes.
Test design
Each conflict pair covers a real-world configuration update: budget approvals, rate limit changes, timeout extensions, capacity scaling, and compliance-driven policy changes. The values are structurally simple (one numeric field changes) so there is no ambiguity about what the correct answer is.
Write sequence: v1 is written first. v2 is written second. For Graphiti, v1 is timestamped one hour before v2 to provide temporal ordering context. For all other systems, writes happen sequentially in the same session.
Each namespace is isolated to one conflict pair — the same isolation strategy as C1. This ensures the conflict verdict reflects only the pair in question, not cross-contamination from other facts.
Verdict definitions
Response contains v2 value and does not contain v1 value. Clean replacement.
Response contains both v1 and v2. Caller receives contradictory context and must disambiguate.
Response contains v1 value only. System returned outdated information.
Response contains neither v1 nor v2. System failed to retrieve any relevant context.
Per-conflict results
v1 → v2 transition for each conflict pair. Correct answer is always the v2 value.
| ID | Change | v1 → v2 | Iranti | Shodh | Mem0 | Graphiti |
|---|---|---|---|---|---|---|
| C01 | Project budget | $50,000→$75,000 | v2 ✓ | both | v2 ✓ | both |
| C02 | API write rate limit | 60 rpm→100 rpm | v2 ✓ | both | v2 ✓ | v2 ✓ |
| C03 | Max file upload size | 10 MB→25 MB | v2 ✓ | both | v2 ✓ | miss |
| C04 | Redis cache TTL | 900s→1800s | v2 ✓ | both | v2 ✓ | stale |
| C05 | JWT token expiry | 3600s→7200s | v2 ✓ | both | v2 ✓ | miss |
| C06 | Background workers | 4 procs→8 procs | v2 ✓ | both | miss | miss |
| C07 | Log rotation | 7 days→14 days | v2 ✓ | both | miss | stale |
| C08 | PostgreSQL max connections | 20→50 | v2 ✓ | both | v2 ✓ | miss |
| C09 | Webhook max retries | 3→5 | v2 ✓ | both | v2 ✓ | miss |
| C10 | Webhook timeout | 15000ms→30000ms | both | both | v2 ✓ | v2 ✓ |
System behavior analysis
Iranti uses entity+key addressing. When v2 is written to the same entity and key as v1, the write deterministically replaces the stored value. There is no accumulation — the old value is overwritten at the storage level.
9 of 10 pairs return v2-only. C10 (webhook timeout) returns both — this is the one case where the Iranti LLM arbitration layer was invoked on a same-entity, same-key update with a small confidence delta, which triggered accumulation rather than replacement. This maps to the B5 regression: conservative arbitration on close-gap updates.
Shodh scores 100% because v2 is present in every response. But it never actually replaces v1 — it accumulates. Every query for a conflict pair returns both the original and the updated value, regardless of write order.
From the caller's perspective, the response contains contradictory information and the caller must apply their own disambiguation logic. For configuration-critical facts (e.g., "what is the current rate limit?") this means the LLM consuming the context has to choose between two values with no signal about which is authoritative.
Mem0 handles 7 conflicts cleanly with v2-only returns and 1 with both values. The 2 misses (C06: workers, C07: log rotation) returned neither v1 nor v2 — vector similarity did not surface either version of the fact.
Mem0 uses semantic deduplication on write — when v2 is semantically similar to v1, it may update or replace the stored representation. When the similarity is above threshold, v1 is replaced. When it falls below, both are stored. The 2 misses are facts where neither version was returned, possibly due to collection indexing latency between writes.
Graphiti uses temporal ordering for conflict resolution — v1 is timestamped at t-1h, v2 at t-0. Despite this, the results show 2 stale returns (v1 value wins), 2 both (both values surfaced), and 4 complete misses.
The core issue is the same as C1: entity extraction rephrases fact content into edge facts during ingestion. When v2 is extracted, if the numeric value is lost during extraction, the edge fact for v2 no longer contains the answer. The temporal ordering of the episode is correct — the extracted edge fact content is the problem.
Verdict distribution
Key findings
Iranti uses entity+key addressing — v2 write deterministically replaces v1 at the same key. 9/10 clean v2-only returns.
Shodh scores 100% technically but returns BOTH old and new values on every query — the caller must disambiguate.
Mem0 misses 2 conflicts entirely (none verdict) — semantic similarity surfaces neither v1 nor v2 on those queries.
Graphiti shows 2 stale returns (returns old v1 value) and 4 total misses — temporal ordering only partially helps.