Benchmarks

Every claim has a script behind it.
Including where we fall short.

Iranti runs a public benchmark suite across recall accuracy, injection efficiency, conflict resolution, and cross-session persistence — against Shodh, Mem0, and Graphiti on the same corpus. Results include honest weakness disclosure.

Competitive Suite — vs. the field

Four systems. Four dimensions. Same corpus.

20 config-heavy facts, 40 recall questions. Identical input written to every system. Scored deterministically — no LLM judge.

BenchmarkIrantiShodhMem0Graphiti
C1Recall accuracy100%100%80%57%
C2Pool efficiency5.01.394.441.22
C3Conflict resolution100%100%*80%40%
C4Cross-session100%100%75%57%
*

Shodh scores 100% on conflict resolution — but never replaces old values. Every query returns both v1 and v2, leaving the caller to disambiguate. This is accumulation, not resolution. See C3 for full breakdown →

Cognee excluded — Python 3.14 incompatible (requires <3.14). Will be re-evaluated when a compatible release is available.

C1

Recall accuracy

Isolated namespace — 1 fact per query scope

Full methodology →
Iranti
100%
Shodh
100%
Mem0
80%
Graphiti
57%
C2

Pool efficiency

All 20 facts in one namespace — find the needle

Full methodology →
Iranti
5
Shodh
1.39
Mem0
4.44
Graphiti
1.22

efficiency = accuracy% / avg_tok_per_query

C3

Conflict resolution

Write v1 then v2 — correct answer is v2 (latest)

Full methodology →
Iranti
100%
Shodh
100% (returns both)
Mem0
80%
Graphiti
40%
v2 onlyboth values
C4

Cross-session persistence

Write → kill process → recall in fresh subprocess

Full methodology →
Iranti
100%
Shodh
100%
Mem0
75%
Graphiti
57%
Internal benchmarks — B1 through B13

Core capability coverage.

13 benchmarks covering retrieval, persistence, conflict, discovery, relationships, recovery, and continuity. Each page has full methodology and raw trial data.

PASSPARTIAL
B1PASS
Entity retrieval at scale

Null accuracy gap vs. long-context reading at 2,000 entities, ~107k tokens. Structured retrieval at fraction of the token cost.

Read →
B2PASS
Cross-process persistence

Facts written by one agent retrieved by a completely independent process with a different identity. Provenance preserved.

Read →
B3PASS
Conflict resolution

3/3: deterministic resolution, close-gap escalation, and equal-confidence contradictory escalation all pass. High-confidence challengers win cleanly; ambiguous conflicts escalate to human review.

Read →
B4PASS
Multi-hop discovery

Oracle lookups, multi-hop entity chains, and vector-backed search all pass. Foundation for structured KB reasoning.

Read →
B5PARTIAL
Knowledge update

Direct write path works. LLM arbitration on ambiguous updates is a regression in v0.3.2 — conservative scoring silently rejects same-source updates that previously resolved. Only large confidence gaps trigger updates.

Read →
B6PARTIAL
Ingest pipeline

Write-then-query is solid: 6/6 writes, provenance intact, zero contamination. Bulk ingest endpoint regressed in v0.3.2 — crashes or extracts nothing. Direct write path is the reliable surface.

Read →
B7PASS
Episodic memory

9/9 episodic recall tasks pass on v0.3.2, plus partial temporal ordering. Substantial improvement over prior bounded findings — episodic memory via structured KB is a viable pattern.

Read →
B8PASS
Agent coordination

6/6 coordination tests pass. Zero missed cross-agent writes. Shared KB as coordination layer holds up.

Read →
B9PASS
Relationship traversal

9/9: relationship writes, one-hop traversal, and deep graph traversal all pass cleanly.

Read →
B10PARTIAL
Knowledge provenance

Source and confidence visible on all reads. Agent/writer identity attribution and whoKnows are MCP-only — not exposed on the REST API. Core lineage works; full attribution is bounded.

Read →
B11PARTIAL
Context recovery

5/5 full recovery with explicit hints. 3/5 partial recovery. Cold-start without hints: 0/5 — bounded.

Read →
B12PARTIAL
Session recovery

8/8 full session recovery. 5/8 partial session context. Recovery quality scales with available prior state.

Read →
B13PASS
Upgrade continuity

4/5 facts preserved across versions, 3/3 post-upgrade writes, conflict state intact, API surface stable.

Read →
Where things fall short

Honest disclosure.

Every system in this suite has real weaknesses. We include Iranti's own as prominently as the competitors'. The benchmark data for each finding is linked.

Automatic context recovery without hints fails cold

B11: 5/5 full recovery with explicit entity hints. 0/5 without hints. Autonomous cold-start recovery is a known open problem — the system does not know what it does not know without a starting anchor.

LLM-arbitrated updates regressed in v0.3.2

B5: Conservative LLM scoring silently rejects same-source updates that previously resolved. Only large confidence gaps trigger overwrites. Direct writes are unaffected — this is a bounded LLM path regression.

Bulk ingest endpoint broken in v0.3.2

B6: The bulk /ingest endpoint crashes or extracts nothing. The direct write path (iranti_write per fact) is solid and is the recommended surface. The endpoint regression is documented and tracked.

Conflict resolution returns both old and new values

C3: Shodh scores 100% on conflict resolution — but only because it returns both v1 and v2 on every query. It never replaces a fact; it accumulates them. The caller receives mixed context and must disambiguate.

Token bloat in shared-pool retrieval (66 tok/query)

C2: When all 20 facts share one namespace, Shodh returns full memory text per query — averaging 66 tokens versus Iranti's 20. Accuracy holds at 92% but the injection volume collapses efficiency from competitive to 1.39.

Semantic search misses 20% of high-risk config facts

C1: Mem0 misses F03, F11, F14, F17 across HIGH and LOW risk tiers. The pattern is structured config facts with specific numeric values where vector similarity finds semantically related but not exact context.

Cross-session consistency drops 5 points vs same-session

C4: Mem0 scores 80% same-session (C1) but 75% in a fresh subprocess. The 5-point drop suggests minor Chroma read variance or initialization inconsistency across processes — not a fundamental persistence failure.

Entity extraction loses numeric values in config-heavy facts

C1/C4: Graphiti's LLM entity extraction rephrases facts into semantic relationships. "JWT expiry is 3600 seconds" becomes "JWT token expiry is issued by myapp.prod" — the number is lost. 57% recall on config facts.

Conflict resolution returns stale values

C3: Graphiti uses temporal ordering (v1 at t-1h, v2 at t-0) but still returns stale v1 on 2/10 pairs and misses 4/10 entirely. The graph's temporal awareness does not fully surface the latest value in LLM-rephrased edge facts.

Want the product story behind the benchmarks?

The product page translates these results into the buyer-facing case: structured injection, exact retrieval, deterministic conflict resolution, and operator-visible behavior.