Every claim has a script behind it.
Including where we fall short.
Iranti runs a public benchmark suite across recall accuracy, injection efficiency, conflict resolution, and cross-session persistence — against Shodh, Mem0, and Graphiti on the same corpus. Results include honest weakness disclosure.
Four systems. Four dimensions. Same corpus.
20 config-heavy facts, 40 recall questions. Identical input written to every system. Scored deterministically — no LLM judge.
| Benchmark | Iranti | Shodh | Mem0 | Graphiti |
|---|---|---|---|---|
| C1Recall accuracy | 100% | 100% | 80% | 57% |
| C2Pool efficiency | 5.0 | 1.39 | 4.44 | 1.22 |
| C3Conflict resolution | 100% | 100%* | 80% | 40% |
| C4Cross-session | 100% | 100% | 75% | 57% |
Shodh scores 100% on conflict resolution — but never replaces old values. Every query returns both v1 and v2, leaving the caller to disambiguate. This is accumulation, not resolution. See C3 for full breakdown →
Cognee excluded — Python 3.14 incompatible (requires <3.14). Will be re-evaluated when a compatible release is available.
Recall accuracy
Isolated namespace — 1 fact per query scope
Pool efficiency
All 20 facts in one namespace — find the needle
efficiency = accuracy% / avg_tok_per_query
Conflict resolution
Write v1 then v2 — correct answer is v2 (latest)
Cross-session persistence
Write → kill process → recall in fresh subprocess
Core capability coverage.
13 benchmarks covering retrieval, persistence, conflict, discovery, relationships, recovery, and continuity. Each page has full methodology and raw trial data.
Null accuracy gap vs. long-context reading at 2,000 entities, ~107k tokens. Structured retrieval at fraction of the token cost.
Facts written by one agent retrieved by a completely independent process with a different identity. Provenance preserved.
3/3: deterministic resolution, close-gap escalation, and equal-confidence contradictory escalation all pass. High-confidence challengers win cleanly; ambiguous conflicts escalate to human review.
Oracle lookups, multi-hop entity chains, and vector-backed search all pass. Foundation for structured KB reasoning.
Direct write path works. LLM arbitration on ambiguous updates is a regression in v0.3.2 — conservative scoring silently rejects same-source updates that previously resolved. Only large confidence gaps trigger updates.
Write-then-query is solid: 6/6 writes, provenance intact, zero contamination. Bulk ingest endpoint regressed in v0.3.2 — crashes or extracts nothing. Direct write path is the reliable surface.
9/9 episodic recall tasks pass on v0.3.2, plus partial temporal ordering. Substantial improvement over prior bounded findings — episodic memory via structured KB is a viable pattern.
6/6 coordination tests pass. Zero missed cross-agent writes. Shared KB as coordination layer holds up.
9/9: relationship writes, one-hop traversal, and deep graph traversal all pass cleanly.
Source and confidence visible on all reads. Agent/writer identity attribution and whoKnows are MCP-only — not exposed on the REST API. Core lineage works; full attribution is bounded.
5/5 full recovery with explicit hints. 3/5 partial recovery. Cold-start without hints: 0/5 — bounded.
8/8 full session recovery. 5/8 partial session context. Recovery quality scales with available prior state.
4/5 facts preserved across versions, 3/3 post-upgrade writes, conflict state intact, API surface stable.
Honest disclosure.
Every system in this suite has real weaknesses. We include Iranti's own as prominently as the competitors'. The benchmark data for each finding is linked.
B11: 5/5 full recovery with explicit entity hints. 0/5 without hints. Autonomous cold-start recovery is a known open problem — the system does not know what it does not know without a starting anchor.
B5: Conservative LLM scoring silently rejects same-source updates that previously resolved. Only large confidence gaps trigger overwrites. Direct writes are unaffected — this is a bounded LLM path regression.
B6: The bulk /ingest endpoint crashes or extracts nothing. The direct write path (iranti_write per fact) is solid and is the recommended surface. The endpoint regression is documented and tracked.
C3: Shodh scores 100% on conflict resolution — but only because it returns both v1 and v2 on every query. It never replaces a fact; it accumulates them. The caller receives mixed context and must disambiguate.
C2: When all 20 facts share one namespace, Shodh returns full memory text per query — averaging 66 tokens versus Iranti's 20. Accuracy holds at 92% but the injection volume collapses efficiency from competitive to 1.39.
C1: Mem0 misses F03, F11, F14, F17 across HIGH and LOW risk tiers. The pattern is structured config facts with specific numeric values where vector similarity finds semantically related but not exact context.
C4: Mem0 scores 80% same-session (C1) but 75% in a fresh subprocess. The 5-point drop suggests minor Chroma read variance or initialization inconsistency across processes — not a fundamental persistence failure.
C1/C4: Graphiti's LLM entity extraction rephrases facts into semantic relationships. "JWT expiry is 3600 seconds" becomes "JWT token expiry is issued by myapp.prod" — the number is lost. 57% recall on config facts.
C3: Graphiti uses temporal ordering (v1 at t-1h, v2 at t-0) but still returns stale v1 on 2/10 pairs and misses 4/10 entirely. The graph's temporal awareness does not fully surface the latest value in LLM-rephrased edge facts.
Want the product story behind the benchmarks?
The product page translates these results into the buyer-facing case: structured injection, exact retrieval, deterministic conflict resolution, and operator-visible behavior.