Benchmark Update · April 2026
58.4% end-to-end QA, 93.6% retrieval: five iterations in the open
We ran the full LongMemEval end-to-end QA evaluation five times. Each iteration changed something specific, and we're publishing every number. HA5H retrieves memories, Claude generates an answer, GPT-4o judges correctness. 500 questions. Official evaluation script.
Retrieval R@10
End-to-end QA (v5)
The full progression
| Category | v1 | v2 | v4 | v5 | v1→v5 |
|---|---|---|---|---|---|
| Single-session assistant (56) | 80.4% | 82.1% | 94.6% | 83.9% | +3.5 |
| Single-session user (70) | 77.1% | 80.0% | 82.9% | 78.6% | +1.5 |
| Knowledge update (78) | 79.5% | 82.1% | 80.8% | 80.8% | +1.3 |
| Multi-session (133) | 39.1% | 36.1% | 45.1% | 43.6% | +4.5 |
| Temporal reasoning (133) | 28.6% | 41.4% | 34.6% | 42.1% | +13.5 |
| Single-session preference (30) | 30.0% | 43.3% | 26.7% | 43.3% | +13.3 |
What changed at each step
v1 (52%): Naive prompt. "Answer based on these memories. If you don't have enough information, say so." Works for factual lookups (~80%) but fails on everything requiring reasoning. The "say so" escape hatch let Claude dodge 68% of hard questions.
v2 (56.4%): Type-aware chain-of-thought. Detect temporal, preference, multi-hop, and knowledge-update questions. Route each to a specialized prompt with step-by-step instructions. Temporal jumped from 28.6% to 41.4%. Preference jumped from 30% to 43.3%. Multi-session dropped 3pp because the structured reasoning made Claude more cautious about partial evidence.
v4 (57.6%): Three structural changes.
- Answer-committed prompts: Removed all "say you don't know" escape hatches. Every prompt type now says "You MUST give your best answer." Single-session assistant went from 82.1% to 94.6%.
- Session-level context: Instead of retrieving isolated memories, we now fetch the full conversation session around each match. This gives Claude the surrounding context needed to answer questions about conversations.
- Two-pass aggregation retrieval: For "how many" and "total" questions, a second retrieval pass finds sessions missed by the first. Multi-session went from 36.1% to 45.1%.
v5 (58.4%): Selective session expansion. v4 applied session expansion to all questions, which diluted signal for temporal and preference categories. v5 only expands sessions for aggregation and multi-hop questions. Temporal and preference questions get focused, memory-level retrieval. Result: temporal recovered from 34.6% to 42.1%, preference recovered from 26.7% to 43.3%, while multi-session held at 43.6%.
The tradeoffs
No single configuration is best for every category. The data shows a clear tension:
- Session expansion + answer-committed prompts (v4): best for factual categories (assistant 94.6%, user 82.9%, multi-session 45.1%)
- Focused retrieval + type-aware prompts (v5): best for reasoning categories (temporal 42.1%, preference 43.3%)
v5 picks the best strategy per question type. The overall score (58.4%) is higher than any single approach, but each category could be better if we could perfectly predict which strategy to use.
What we learned
68% of failures were excessive caution, not bad retrieval. The single biggest improvement came from removing the "say you don't know" instruction. Claude had the right memories but was refusing to commit to answers.
Context strategy must match question type. Session expansion helps aggregation but hurts temporal precision. The right retrieval strategy depends on what the question is asking for. One-size-fits-all doesn't work.
The retrieval ceiling is real. Full-context GPT-4o (all sessions in the context window, no retrieval) scores 60-64% on LongMemEval. Our 58.4% with retrieval is approaching that ceiling. Going beyond it requires better reasoning, not better retrieval.
Why we publish everything
Because the trajectory matters more than the number. 52% → 56.4% → 57.6% → 58.4% with clear explanations of what worked and what didn't is more useful than a single polished score. You can see what each change actually does, including the regressions we caused and fixed.
93.6% retrieval is real. 58.4% end-to-end QA is our best so far. Every number on this page is reproducible.