Benchmark Update · April 2026

58.4% end-to-end QA, 93.6% retrieval: five iterations in the open

We ran the full LongMemEval end-to-end QA evaluation five times. Each iteration changed something specific, and we're publishing every number. HA5H retrieves memories, Claude generates an answer, GPT-4o judges correctness. 500 questions. Official evaluation script.

93.6%

Retrieval R@10

58.4%

End-to-end QA (v5)

The full progression

Categoryv1v2v4v5v1→v5
Single-session assistant (56)80.4%82.1%94.6%83.9%+3.5
Single-session user (70)77.1%80.0%82.9%78.6%+1.5
Knowledge update (78)79.5%82.1%80.8%80.8%+1.3
Multi-session (133)39.1%36.1%45.1%43.6%+4.5
Temporal reasoning (133)28.6%41.4%34.6%42.1%+13.5
Single-session preference (30)30.0%43.3%26.7%43.3%+13.3

What changed at each step

v1 (52%): Naive prompt. "Answer based on these memories. If you don't have enough information, say so." Works for factual lookups (~80%) but fails on everything requiring reasoning. The "say so" escape hatch let Claude dodge 68% of hard questions.

v2 (56.4%): Type-aware chain-of-thought. Detect temporal, preference, multi-hop, and knowledge-update questions. Route each to a specialized prompt with step-by-step instructions. Temporal jumped from 28.6% to 41.4%. Preference jumped from 30% to 43.3%. Multi-session dropped 3pp because the structured reasoning made Claude more cautious about partial evidence.

v4 (57.6%): Three structural changes.

v5 (58.4%): Selective session expansion. v4 applied session expansion to all questions, which diluted signal for temporal and preference categories. v5 only expands sessions for aggregation and multi-hop questions. Temporal and preference questions get focused, memory-level retrieval. Result: temporal recovered from 34.6% to 42.1%, preference recovered from 26.7% to 43.3%, while multi-session held at 43.6%.

The tradeoffs

No single configuration is best for every category. The data shows a clear tension:

v5 picks the best strategy per question type. The overall score (58.4%) is higher than any single approach, but each category could be better if we could perfectly predict which strategy to use.

What we learned

68% of failures were excessive caution, not bad retrieval. The single biggest improvement came from removing the "say you don't know" instruction. Claude had the right memories but was refusing to commit to answers.

Context strategy must match question type. Session expansion helps aggregation but hurts temporal precision. The right retrieval strategy depends on what the question is asking for. One-size-fits-all doesn't work.

The retrieval ceiling is real. Full-context GPT-4o (all sessions in the context window, no retrieval) scores 60-64% on LongMemEval. Our 58.4% with retrieval is approaching that ceiling. Going beyond it requires better reasoning, not better retrieval.

Why we publish everything

Because the trajectory matters more than the number. 52% → 56.4% → 57.6% → 58.4% with clear explanations of what worked and what didn't is more useful than a single polished score. You can see what each change actually does, including the regressions we caused and fixed.

93.6% retrieval is real. 58.4% end-to-end QA is our best so far. Every number on this page is reproducible.