Multi-hop Entity Reasoning
Oracle path: 4/4. Search path: 1/4.
B4 tests whether Iranti supports chained entity lookups — hop 1 resolves an entity, hop 2 uses that result to find another. The finding is a clear split: when entity IDs are known upfront, chains work perfectly. When hop 2 requires search-based discovery, it consistently fails.
Results at a glance
What this measures
Multi-hop entity reasoning is the ability to chain lookups: resolve entity A, extract a property from A, use that property to locate entity B, then continue. This pattern appears constantly in real-world agent work — traversing org charts, tracing dependency graphs, following relationship chains across a knowledge base.
B4 constructs two-hop chains and tests them three ways. The baseline arm gives the model a plain text document and lets it read the answer directly — no Iranti calls, just context. The oracle arm provides entity IDs at each hop so Iranti can do exact lookups. The search arm withholds the hop-2 ID and forces the model to discover the target entity by querying on attribute values.
The gap between oracle (4/4) and search (1/4) isolates the exact failure: not the chaining logic, not the model, not the hop-1 retrieval — just search-based entity discovery at intermediate hops.
The baseline outperforming search-based Iranti is expected at small KB sizes. As KB size grows beyond what fits in context, the ordering is expected to invert. This benchmark does not test that crossover point.
The two-hop chain
Hop 1 uses an exact entity ID — the oracle path (solid teal) works perfectly. Hop 2 requires discovering the next entity by attribute value — the search path (dashed amber) fails consistently. The broken step is search, not the chain logic.
Results across all three arms
Four multi-hop chains tested per arm. The oracle arm matches the baseline. The search arm fails 3 of 4.
| Arm | Score | Per-question breakdown |
|---|---|---|
| Baseline (context-reading) | 4/4 | Model scans a plain text document; all 4 multi-hop entity chains resolved correctly. |
| Iranti oracle path | 4/4 | Entity IDs known upfront; exact lookup chains work perfectly across both hops. |
| Iranti search path | 1/4 | Search-based entity discovery at hop 2 fails; consistently returns oldest KB entries, ignores recent writes. |
| Total (3 arms) | 9/12 | Oracle path drives all Iranti correctness |
All chains used the same underlying entity graph. The only difference between oracle and search arms is whether the hop-2 entity ID was supplied or had to be discovered through search.
Why did search fail? Three hypotheses
The search arm consistently returned the oldest KB entries and ignored recently written ones. The raw results do not conclusively determine the cause — these are the three most plausible explanations from the observed behavior.
Vector embeddings may not generate immediately for newly written entries. When the benchmark writes an entity and then queries for it seconds later, the embedding index may not yet include it — causing search to fall back to older, already-indexed entries.
Search may index auto-generated summaries rather than structured field values. If the summary does not faithfully reproduce the written value, attribute-value searches will miss the correct entry even when it is present in the store.
Older KB entries have accumulated higher confidence scores over repeated reads. When search returns the top-5 by relevance, score-weighted ranking may consistently surface high-confidence old entries over lower-scored new ones, regardless of semantic match quality.
These hypotheses are not mutually exclusive. All three could contribute simultaneously. Isolating the actual cause requires a targeted follow-up benchmark that controls for write timing, summary content, and entry age independently.
Honest limitations
Key findings
v0.2.16 Update: Search Restored
The search regression that caused 1/4 in v0.2.12 and a full crash in v0.2.14 has been fixed in v0.2.16. Vector scoring is now active (scores 0.35–0.74), and the three original failing queries — find-by-institution, find-by-prior-employer, find-by-institution-peer — now work correctly. The test used direct attribute values, which is now the reliable path.
| Version | iranti_search | Vector score | Direct attribute | Semantic paraphrase |
|---|---|---|---|---|
| 0.2.12 | Degraded | 0 (disabled) | Partial | Fails |
| 0.2.14 | Crashes | N/A | N/A | N/A |
| 0.2.16 | Operational | 0.35–0.74 | Works ✓ | Fails |
Full trial execution records, per-question scores, entity graphs, and methodology notes in the benchmarking repository.