Developer exposes flawed AI memory benchmark after discovering 98.3% score was meaningless
A developer building Bastra Recall, an open-source memory server for Claude that stores notes in a local Markdown vault, initially reported a 98.3% retrieval accuracy score. The figure later proved misleading because the benchmark tested memories using the exact trigger phrases embedded in each memory record, essentially rigging the results. To correct this, the developer designed a more rigorous test using six AI persona agents with distinct writing styles to generate 180 paraphrased queries across 30 stored memories. Results showed that adding a local embedding layer improved retrieval of heavily paraphrased queries from 63.1% to 79.6%, while the previously celebrated trigger-phrase feature provided no measurable benefit on real-world paraphrased inputs. The developer concluded that retrieval benchmarks must test paraphrase survival rather than exact-wording recall to reflect how AI systems actually query stored information over time.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in