Two systems scored 100%.
One of them is wrong.
Iranti and Shodh both pass this benchmark — because the scoring counts any response containing the updated value as correct. But Shodh returns the old value too. On every single query. Your agent receives contradictory context and has to guess which value is authoritative.
What your agent actually receives from Shodh
You've just updated the API write rate limit from 60 to 100 requests per minute. You write the updated fact to your memory system. Later, your agent needs to enforce the limit and queries for the current value.
In Iranti, the write deterministically replaced the old value. The agent gets back 100 rpm. Done.
In Shodh, both facts are stored. The agent gets back both — and nothing in the response signals which one is current. It may apply the wrong limit, log a misleading rate, or produce non-deterministic behavior depending on which value the LLM picks from the context.
This happened on all 10 conflict pairs in the test — not as an edge case, but as the consistent behavior. Shodh accumulates; it does not replace.
Test design
10 fact pairs, each covering a real-world configuration update: budget approvals, rate limit changes, timeout extensions, capacity scaling, compliance-driven policy changes. The values are structurally simple — one numeric field changes — so there is no ambiguity about what the correct answer is.
Write sequence: v1 is written first, v2 is written second, same namespace. For Graphiti, v1 is timestamped one hour before v2 to give temporal ordering context. All namespaces are isolated per conflict pair — same isolation as C1.
Scoring is lenient: any response containing v2 counts as a pass. This is why Shodh scores 100% — the correct value is present. The "both" verdict is the footnote that makes the score misleading.
Verdict definitions
Response contains the updated value only. Clean replacement — no ambiguity for the caller.
Response contains both old and new values. Passes the benchmark. Fails in production — caller must guess which is authoritative.
Response contains only the outdated value. Fails the benchmark and returns wrong information.
Response contains neither value. System failed to retrieve any relevant context.
Per-conflict results
v1 → v2 for each pair. Correct answer is always v2. Note Shodh's column: 10/10 "both" is not the same as 10/10 "v2 only."
| ID | Change | v1 → v2 | Iranti | Shodh | Mem0 | Graphiti |
|---|---|---|---|---|---|---|
| C01 | Project budget | $50,000→$75,000 | v2 ✓ | both | v2 ✓ | both |
| C02 | API write rate limit | 60 rpm→100 rpm | v2 ✓ | both | v2 ✓ | v2 ✓ |
| C03 | Max file upload size | 10 MB→25 MB | v2 ✓ | both | v2 ✓ | miss |
| C04 | Redis cache TTL | 900s→1800s | v2 ✓ | both | v2 ✓ | stale |
| C05 | JWT token expiry | 3600s→7200s | v2 ✓ | both | v2 ✓ | miss |
| C06 | Background workers | 4 procs→8 procs | v2 ✓ | both | miss | miss |
| C07 | Log rotation | 7 days→14 days | v2 ✓ | both | miss | stale |
| C08 | PostgreSQL max connections | 20→50 | v2 ✓ | both | v2 ✓ | miss |
| C09 | Webhook max retries | 3→5 | v2 ✓ | both | v2 ✓ | miss |
| C10 | Webhook timeout | 15000ms→30000ms | both | both | v2 ✓ | v2 ✓ |
Why each system behaves this way
Iranti uses entity+key addressing. Writing v2 to the same entity and key as v1 deterministically overwrites the stored value at the storage level — there is no accumulation by design. 9 of 10 pairs return v2-only.
The one "both" (C10: webhook timeout) is the known B5 regression: conservative LLM arbitration on a close-confidence update treated v2 as a challenger rather than a replacement, accumulating instead of overwriting. Direct writes (same entity+key, same source) are unaffected — this is an LLM arbitration edge case only.
Shodh is an accumulative memory system. It does not replace facts — it appends them. A second write of the same information creates a second memory record alongside the first. Recall returns all matching records, regardless of recency.
This behavior is consistent and predictable — it is not a bug. But it means the caller is responsible for disambiguation. In an LLM-driven pipeline with no post-processing, the agent receives contradictory values and must choose, with no signal about which was written more recently or which is authoritative.
Mem0 uses semantic deduplication on write. When v2 is semantically similar enough to v1, Mem0 updates the existing record rather than creating a new one — producing clean v2-only returns. This works correctly on 7 of 10 pairs.
The 2 misses (workers, log rotation) returned neither value — the semantic similarity between v1 and v2 was too low to trigger deduplication, but recall also failed to surface either record for those queries. These are retrieval gaps, not conflict handling failures.
Graphiti was given the best possible setup: v1 timestamped at t−1h, v2 at t−0, so temporal ordering was explicit. Despite this, 2 pairs returned the stale value and 4 returned nothing.
The root cause is entity extraction: when Graphiti's LLM extracts edge facts from v2, numeric values are often rephrased or dropped. If the v2 edge fact no longer contains the updated number, the temporal ordering is irrelevant — the answer was lost at ingestion. See C1 for the full extraction analysis.
Verdict distribution
Shodh's bar is entirely amber — 100% "both". That is a different result from Iranti's 90% teal.
Key findings
Iranti uses entity+key addressing — v2 write deterministically replaces v1 at the same key. 9/10 clean v2-only returns.
Shodh scores 100% technically but returns BOTH old and new values on every query — the caller must disambiguate.
Mem0 misses 2 conflicts entirely (none verdict) — semantic similarity surfaces neither v1 nor v2 on those queries.
Graphiti shows 2 stale returns (returns old v1 value) and 4 total misses — temporal ordering only partially helps.