Benchmarks

Every claim has a script behind it.
Including where we fall short.

Iranti runs a public benchmark suite across recall accuracy, injection efficiency, conflict resolution, and cross-session persistence — against Shodh, Mem0, and Graphiti on the same corpus. Results include honest weakness disclosure.

C4 — fresh subprocess

Competitive Suite — vs. the field

Four systems. Four dimensions. Same corpus.

20 config-heavy facts, 40 recall questions. Identical input written to every system. Scored deterministically — no LLM judge.

Benchmark	Iranti	Shodh	Mem0	Graphiti	metric
C1Recall accuracy	100%	100%	80%	57%	pct correct
C2Pool efficiency	5.0	1.39	4.44	1.22	acc% / avg_tok
C3Conflict resolution	100%	100%*	80%	40%	pct returns latest
C4Cross-session	100%	100%	75%	57%	pct recalled fresh

Shodh scores 100% on conflict resolution — but never replaces old values. Every query returns both v1 and v2, leaving the caller to disambiguate. This is accumulation, not resolution. See C3 for full breakdown →

Cognee excluded — Python 3.14 incompatible (requires <3.14). Will be re-evaluated when a compatible release is available.

Recall accuracy

Isolated namespace — 1 fact per query scope

Full methodology →

Iranti

100%

Shodh

100%

Mem0

80%

Graphiti

57%

Pool efficiency

All 20 facts in one namespace — find the needle

Full methodology →

Iranti

Shodh

1.39

Mem0

4.44

Graphiti

1.22

efficiency = accuracy% / avg_tok_per_query

Conflict resolution

Write v1 then v2 — correct answer is v2 (latest)

Full methodology →

Iranti

100%

Shodh

100% (returns both)

Mem0

80%

Graphiti

40%

v2 onlyboth values

Cross-session persistence

Write → kill process → recall in fresh subprocess

Full methodology →

Iranti

100%

Shodh

100%

Mem0

75%

Graphiti

57%

Internal benchmarks — B1 through B13

Core capability coverage.

13 benchmarks covering retrieval, persistence, conflict, discovery, relationships, recovery, and continuity. Each page has full methodology and raw trial data.

PASSPARTIAL

B1PASS

Entity retrieval at scale

Null accuracy gap vs. long-context reading at 2,000 entities, ~107k tokens. Structured retrieval at fraction of the token cost.

Read →

B2PASS

Cross-process persistence

Facts written by one agent retrieved by a completely independent process with a different identity. Provenance preserved.

Read →

B3PASS

Conflict resolution

3/3: deterministic resolution, close-gap escalation, and equal-confidence contradictory escalation all pass. High-confidence challengers win cleanly; ambiguous conflicts escalate to human review.

Read →

B4PASS

Multi-hop discovery

Oracle lookups, multi-hop entity chains, and vector-backed search all pass. Foundation for structured KB reasoning.

Read →

B5PARTIAL

Knowledge update

Direct write path works. LLM arbitration on ambiguous updates is a regression in v0.3.2 — conservative scoring silently rejects same-source updates that previously resolved. Only large confidence gaps trigger updates.

Read →

B6PARTIAL

Ingest pipeline

Write-then-query is solid: 6/6 writes, provenance intact, zero contamination. Bulk ingest endpoint regressed in v0.3.2 — crashes or extracts nothing. Direct write path is the reliable surface.

Read →

B7PASS

Episodic memory

9/9 episodic recall tasks pass on v0.3.2, plus partial temporal ordering. Substantial improvement over prior bounded findings — episodic memory via structured KB is a viable pattern.

Read →

B8PASS

Agent coordination

6/6 coordination tests pass. Zero missed cross-agent writes. Shared KB as coordination layer holds up.

Read →

B9PASS

Relationship traversal

9/9: relationship writes, one-hop traversal, and deep graph traversal all pass cleanly.

Read →

B10PARTIAL

Knowledge provenance

Source and confidence visible on all reads. Agent/writer identity attribution and whoKnows are MCP-only — not exposed on the REST API. Core lineage works; full attribution is bounded.

Read →

B11PARTIAL

Context recovery

5/5 full recovery with explicit hints. 3/5 partial recovery. Cold-start without hints: 0/5 — bounded.

Read →

B12PARTIAL

Session recovery

8/8 full session recovery. 5/8 partial session context. Recovery quality scales with available prior state.

Read →

B13PASS

Upgrade continuity

4/5 facts preserved across versions, 3/3 post-upgrade writes, conflict state intact, API surface stable.

Read →

B14PASS

Context economy

37% fewer input tokens at turn 15 vs. a baseline agent that re-reads files on recall. Measured via Anthropic countTokens API (exact, not estimated) over a 15-turn DebugAuth session.

Read →

Where things fall short

Honest disclosure.

Every system in this suite has real weaknesses. We include Iranti's own as prominently as the competitors'. The benchmark data for each finding is linked.

IrantiB11 — Context recovery →

Automatic context recovery without hints fails cold

B11: 5/5 full recovery with explicit entity hints. 0/5 without hints. Autonomous cold-start recovery is a known open problem — the system does not know what it does not know without a starting anchor.

IrantiB5 — Knowledge update →

LLM-arbitrated updates regressed in v0.3.2

B5: Conservative LLM scoring silently rejects same-source updates that previously resolved. Only large confidence gaps trigger overwrites. Direct writes are unaffected — this is a bounded LLM path regression.

IrantiB6 — Ingest pipeline →

Bulk ingest endpoint broken in v0.3.2

B6: The bulk /ingest endpoint crashes or extracts nothing. The direct write path (iranti_write per fact) is solid and is the recommended surface. The endpoint regression is documented and tracked.

ShodhC3 — Conflict resolution →

Conflict resolution returns both old and new values

C3: Shodh scores 100% on conflict resolution — but only because it returns both v1 and v2 on every query. It never replaces a fact; it accumulates them. The caller receives mixed context and must disambiguate.

ShodhC2 — Pool efficiency →

Token bloat in shared-pool retrieval (66 tok/query)

C2: When all 20 facts share one namespace, Shodh returns full memory text per query — averaging 66 tokens versus Iranti's 20. Accuracy holds at 92% but the injection volume collapses efficiency from competitive to 1.39.

Mem0C1 — Recall accuracy →

Semantic search misses 20% of high-risk config facts

C1: Mem0 misses F03, F11, F14, F17 across HIGH and LOW risk tiers. The pattern is structured config facts with specific numeric values where vector similarity finds semantically related but not exact context.

Mem0C4 — Cross-session persistence →

Cross-session consistency drops 5 points vs same-session

C4: Mem0 scores 80% same-session (C1) but 75% in a fresh subprocess. The 5-point drop suggests minor Chroma read variance or initialization inconsistency across processes — not a fundamental persistence failure.

GraphitiC1 — Recall accuracy →

Entity extraction loses numeric values in config-heavy facts

C1/C4: Graphiti's LLM entity extraction rephrases facts into semantic relationships. "JWT expiry is 3600 seconds" becomes "JWT token expiry is issued by myapp.prod" — the number is lost. 57% recall on config facts.

GraphitiC3 — Conflict resolution →

Conflict resolution returns stale values

C3: Graphiti uses temporal ordering (v1 at t-1h, v2 at t-0) but still returns stale v1 on 2/10 pairs and misses 4/10 entirely. The graph's temporal awareness does not fully surface the latest value in LLM-rephrased edge facts.

Want the product story behind the benchmarks?

The product page translates these results into the buyer-facing case: structured injection, exact retrieval, deterministic conflict resolution, and operator-visible behavior.

Product story Install guide →

Every claim has a script behind it.Including where we fall short.

Four systems. Four dimensions. Same corpus.

Recall accuracy

Pool efficiency

Conflict resolution

Cross-session persistence

Core capability coverage.

Honest disclosure.

Want the product story behind the benchmarks?

Every claim has a script behind it.
Including where we fall short.