C3 — Conflict Resolution

Two systems scored 100%.
One of them is wrong.

Iranti and Shodh both pass this benchmark — because the scoring counts any response containing the updated value as correct. But Shodh returns the old value too. On every single query. Your agent receives contradictory context and has to guess which value is authoritative.

100%

Iranti

9 v21 both

100%*

Shodh

10 both

80%

Mem0

7 v21 both2 miss

40%

Graphiti

2 v22 both2 stale4 miss

The silent failure

What your agent actually receives from Shodh

You've just updated the API write rate limit from 60 to 100 requests per minute. You write the updated fact to your memory system. Later, your agent needs to enforce the limit and queries for the current value.

In Iranti, the write deterministically replaced the old value. The agent gets back 100 rpm. Done.

In Shodh, both facts are stored. The agent gets back both — and nothing in the response signals which one is current. It may apply the wrong limit, log a misleading rate, or produce non-deterministic behavior depending on which value the LLM picks from the context.

This happened on all 10 conflict pairs in the test — not as an edge case, but as the consistent behavior. Shodh accumulates; it does not replace.

Iranti — injected context

entity: api/rate-limits

key: writeRpm

value: 100

confidence: 0.95

→ Agent enforces 100 rpm. Correct.

Shodh — injected context

memory 1:

API write rate limit is 60 requests per minute per key.

memory 2:

API write rate limit increased to 100 requests per minute per key.

→ Agent sees both. No signal about which is current.

Test design

10 fact pairs, each covering a real-world configuration update: budget approvals, rate limit changes, timeout extensions, capacity scaling, compliance-driven policy changes. The values are structurally simple — one numeric field changes — so there is no ambiguity about what the correct answer is.

Write sequence: v1 is written first, v2 is written second, same namespace. For Graphiti, v1 is timestamped one hour before v2 to give temporal ordering context. All namespaces are isolated per conflict pair — same isolation as C1.

Scoring is lenient: any response containing v2 counts as a pass. This is why Shodh scores 100% — the correct value is present. The "both" verdict is the footnote that makes the score misleading.

Verdict definitions

v2 ✓

Response contains the updated value only. Clean replacement — no ambiguity for the caller.

both

Response contains both old and new values. Passes the benchmark. Fails in production — caller must guess which is authoritative.

stale

Response contains only the outdated value. Fails the benchmark and returns wrong information.

miss

Response contains neither value. System failed to retrieve any relevant context.

Per-conflict results

v1 → v2 for each pair. Correct answer is always v2. Note Shodh's column: 10/10 "both" is not the same as 10/10 "v2 only."

ID	Change	v1 → v2	Iranti	Shodh	Mem0	Graphiti
C01	Project budget	$50,000→$75,000	v2 ✓	both	v2 ✓	both
C02	API write rate limit	60 rpm→100 rpm	v2 ✓	both	v2 ✓	v2 ✓
C03	Max file upload size	10 MB→25 MB	v2 ✓	both	v2 ✓	miss
C04	Redis cache TTL	900s→1800s	v2 ✓	both	v2 ✓	stale
C05	JWT token expiry	3600s→7200s	v2 ✓	both	v2 ✓	miss
C06	Background workers	4 procs→8 procs	v2 ✓	both	miss	miss
C07	Log rotation	7 days→14 days	v2 ✓	both	miss	stale
C08	PostgreSQL max connections	20→50	v2 ✓	both	v2 ✓	miss
C09	Webhook max retries	3→5	v2 ✓	both	v2 ✓	miss
C10	Webhook timeout	15000ms→30000ms	both	both	v2 ✓	v2 ✓

Why each system behaves this way

Iranti9 v2-only · 1 both

Iranti uses entity+key addressing. Writing v2 to the same entity and key as v1 deterministically overwrites the stored value at the storage level — there is no accumulation by design. 9 of 10 pairs return v2-only.

The one "both" (C10: webhook timeout) is the known B5 regression: conservative LLM arbitration on a close-confidence update treated v2 as a challenger rather than a replacement, accumulating instead of overwriting. Direct writes (same entity+key, same source) are unaffected — this is an LLM arbitration edge case only.

Shodh10 both · 0 v2-only

Shodh is an accumulative memory system. It does not replace facts — it appends them. A second write of the same information creates a second memory record alongside the first. Recall returns all matching records, regardless of recency.

This behavior is consistent and predictable — it is not a bug. But it means the caller is responsible for disambiguation. In an LLM-driven pipeline with no post-processing, the agent receives contradictory values and must choose, with no signal about which was written more recently or which is authoritative.

Mem07 v2-only · 1 both · 2 miss

Mem0 uses semantic deduplication on write. When v2 is semantically similar enough to v1, Mem0 updates the existing record rather than creating a new one — producing clean v2-only returns. This works correctly on 7 of 10 pairs.

The 2 misses (workers, log rotation) returned neither value — the semantic similarity between v1 and v2 was too low to trigger deduplication, but recall also failed to surface either record for those queries. These are retrieval gaps, not conflict handling failures.

Graphiti2 v2 · 2 both · 2 stale · 4 miss

Graphiti was given the best possible setup: v1 timestamped at t−1h, v2 at t−0, so temporal ordering was explicit. Despite this, 2 pairs returned the stale value and 4 returned nothing.

The root cause is entity extraction: when Graphiti's LLM extracts edge facts from v2, numeric values are often rephrased or dropped. If the v2 edge fact no longer contains the updated number, the temporal ordering is irrelevant — the answer was lost at ingestion. See C1 for the full extraction analysis.

Verdict distribution

Shodh's bar is entirely amber — 100% "both". That is a different result from Iranti's 90% teal.

Iranti

v2 only9/10

both values1/10

Shodh

both values10/10

Mem0

v2 only7/10

both values1/10

miss2/10

Graphiti

v2 only2/10

both values2/10

stale2/10

miss4/10

Key findings

Iranti uses entity+key addressing — v2 write deterministically replaces v1 at the same key. 9/10 clean v2-only returns.

Shodh scores 100% technically but returns BOTH old and new values on every query — the caller must disambiguate.

Mem0 misses 2 conflicts entirely (none verdict) — semantic similarity surfaces neither v1 nor v2 on those queries.

Graphiti shows 2 stale returns (returns old v1 value) and 4 total misses — temporal ordering only partially helps.

← All benchmarks ← C2: Pool efficiency C4: Cross-session persistence →

Two systems scored 100%.One of them is wrong.